From 31a62c2eb8aa0b0f836e72d18c5626e70d9f3917 Mon Sep 17 00:00:00 2001 From: logesh R <74873758+Logeswaran7@users.noreply.github.com> Date: Tue, 8 Apr 2025 00:24:47 +0530 Subject: [PATCH] Updated Model-card for donut (#37290) * Updated documentation for Donut model * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Updated code suggestions * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Updated code suggestion to Align with the AutoModel example * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Updated notes section included code examples * close hfoption block and indent --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/donut.md | 301 ++++++++++++++++-------------- 1 file changed, 156 insertions(+), 145 deletions(-) diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md index 6e5cfe648d0..1bc1a3bcfd0 100644 --- a/docs/source/en/model_doc/donut.md +++ b/docs/source/en/model_doc/donut.md @@ -13,180 +13,191 @@ rendered properly in your Markdown viewer. specific language governing permissions and limitations under the License. --> +
+
+ PyTorch +
+
+ # Donut -## Overview +[Donut (Document Understanding Transformer)](https://huggingface.co/papers2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats. -The Donut model was proposed in [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by -Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. -Donut consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform document understanding -tasks such as document image classification, form understanding and visual question answering. +Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences. -The abstract from the paper is the following: +You can find all the original Donut checkpoints under the [Naver Clova Information Extraction](https://huggingface.co/naver-clova-ix) organization. -*Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.* +> [!TIP] +> Click on the Donut models in the right sidebar for more examples of how to apply Donut to different language and vision tasks. - +The examples below demonstrate how to perform document understanding tasks using Donut with [`Pipeline`] and [`AutoModel`] - Donut high-level overview. Taken from the original paper. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found -[here](https://github.com/clovaai/donut). - -## Usage tips - -- The quickest way to get started with Donut is by checking the [tutorial - notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model - at inference time as well as fine-tuning on custom data. -- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. - -## Inference examples - -Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of -[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image. - -The [`DonutImageProcessor`] class is responsible for preprocessing the input image and -[`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The -[`DonutProcessor`] wraps [`DonutImageProcessor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] -into a single instance to both extract the input features and decode the predicted token ids. - -- Step-by-step Document Image Classification + + ```py ->>> import re +# pip install datasets +import torch +from transformers import pipeline +from PIL import Image ->>> from transformers import DonutProcessor, VisionEncoderDecoderModel ->>> from datasets import load_dataset ->>> import torch +pipeline = pipeline( + task="document-question-answering", + model="naver-clova-ix/donut-base-finetuned-docvqa", + device=0, + torch_dtype=torch.float16 +) +dataset = load_dataset("hf-internal-testing/example-documents", split="test") +image = dataset[0]["image"] ->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip") ->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip") - ->>> device = "cuda" if torch.cuda.is_available() else "cpu" ->>> model.to(device) # doctest: +IGNORE_RESULT - ->>> # load document image ->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") ->>> image = dataset[1]["image"] - ->>> # prepare decoder inputs ->>> task_prompt = "" ->>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids - ->>> pixel_values = processor(image, return_tensors="pt").pixel_values - ->>> outputs = model.generate( -... pixel_values.to(device), -... decoder_input_ids=decoder_input_ids.to(device), -... max_length=model.decoder.config.max_position_embeddings, -... pad_token_id=processor.tokenizer.pad_token_id, -... eos_token_id=processor.tokenizer.eos_token_id, -... use_cache=True, -... bad_words_ids=[[processor.tokenizer.unk_token_id]], -... return_dict_in_generate=True, -... ) - ->>> sequence = processor.batch_decode(outputs.sequences)[0] ->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") ->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token ->>> print(processor.token2json(sequence)) -{'class': 'advertisement'} +pipeline(image=image, question="What time is the coffee break?") ``` -- Step-by-step Document Parsing + + ```py ->>> import re +# pip install datasets +import torch +from datasets import load_dataset +from transformers import AutoProcessor, AutoModelForVision2Seq ->>> from transformers import DonutProcessor, VisionEncoderDecoderModel ->>> from datasets import load_dataset ->>> import torch +processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") +model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") ->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") ->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") +dataset = load_dataset("hf-internal-testing/example-documents", split="test") +image = dataset[0]["image"] +question = "What time is the coffee break?" +task_prompt = f"{question}" +inputs = processor(image, task_prompt, return_tensors="pt") ->>> device = "cuda" if torch.cuda.is_available() else "cpu" ->>> model.to(device) # doctest: +IGNORE_RESULT - ->>> # load document image ->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") ->>> image = dataset[2]["image"] - ->>> # prepare decoder inputs ->>> task_prompt = "" ->>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids - ->>> pixel_values = processor(image, return_tensors="pt").pixel_values - ->>> outputs = model.generate( -... pixel_values.to(device), -... decoder_input_ids=decoder_input_ids.to(device), -... max_length=model.decoder.config.max_position_embeddings, -... pad_token_id=processor.tokenizer.pad_token_id, -... eos_token_id=processor.tokenizer.eos_token_id, -... use_cache=True, -... bad_words_ids=[[processor.tokenizer.unk_token_id]], -... return_dict_in_generate=True, -... ) - ->>> sequence = processor.batch_decode(outputs.sequences)[0] ->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") ->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token ->>> print(processor.token2json(sequence)) -{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}} +outputs = model.generate( + input_ids=inputs.input_ids, + pixel_values=inputs.pixel_values, + max_length=512 +) +answer = processor.decode(outputs[0], skip_special_tokens=True) +print(answer) ``` -- Step-by-step Document Visual Question Answering (DocVQA) + + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4. ```py ->>> import re +# pip install datasets torchao +import torch +from datasets import load_dataset +from transformers import TorchAoConfig, AutoProcessor, AutoModelForVision2Seq ->>> from transformers import DonutProcessor, VisionEncoderDecoderModel ->>> from datasets import load_dataset ->>> import torch +quantization_config = TorchAoConfig("int4_weight_only", group_size=128) +processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") +model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config) ->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") ->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") +dataset = load_dataset("hf-internal-testing/example-documents", split="test") +image = dataset[0]["image"] +question = "What time is the coffee break?" +task_prompt = f"{question}" +inputs = processor(image, task_prompt, return_tensors="pt") ->>> device = "cuda" if torch.cuda.is_available() else "cpu" ->>> model.to(device) # doctest: +IGNORE_RESULT - ->>> # load document image from the DocVQA dataset ->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") ->>> image = dataset[0]["image"] - ->>> # prepare decoder inputs ->>> task_prompt = "{user_input}" ->>> question = "When is the coffee break?" ->>> prompt = task_prompt.replace("{user_input}", question) ->>> decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids - ->>> pixel_values = processor(image, return_tensors="pt").pixel_values - ->>> outputs = model.generate( -... pixel_values.to(device), -... decoder_input_ids=decoder_input_ids.to(device), -... max_length=model.decoder.config.max_position_embeddings, -... pad_token_id=processor.tokenizer.pad_token_id, -... eos_token_id=processor.tokenizer.eos_token_id, -... use_cache=True, -... bad_words_ids=[[processor.tokenizer.unk_token_id]], -... return_dict_in_generate=True, -... ) - ->>> sequence = processor.batch_decode(outputs.sequences)[0] ->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") ->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token ->>> print(processor.token2json(sequence)) -{'question': 'When is the coffee break?', 'answer': '11-14 to 11:39 a.m.'} +outputs = model.generate( + input_ids=inputs.input_ids, + pixel_values=inputs.pixel_values, + max_length=512 +) +answer = processor.decode(outputs[0], skip_special_tokens=True) +print(answer) ``` -See the [model hub](https://huggingface.co/models?filter=donut) to look for Donut checkpoints. +## Notes -## Training +- Use Donut for document image classification as shown below. -We refer to the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut). + ```py + >>> import re + >>> from transformers import DonutProcessor, VisionEncoderDecoderModel + >>> from datasets import load_dataset + >>> import torch + + >>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip") + >>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip") + + >>> device = "cuda" if torch.cuda.is_available() else "cpu" + >>> model.to(device) # doctest: +IGNORE_RESULT + + >>> # load document image + >>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") + >>> image = dataset[1]["image"] + + >>> # prepare decoder inputs + >>> task_prompt = "" + >>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids + + >>> pixel_values = processor(image, return_tensors="pt").pixel_values + + >>> outputs = model.generate( + ... pixel_values.to(device), + ... decoder_input_ids=decoder_input_ids.to(device), + ... max_length=model.decoder.config.max_position_embeddings, + ... pad_token_id=processor.tokenizer.pad_token_id, + ... eos_token_id=processor.tokenizer.eos_token_id, + ... use_cache=True, + ... bad_words_ids=[[processor.tokenizer.unk_token_id]], + ... return_dict_in_generate=True, + ... ) + + >>> sequence = processor.batch_decode(outputs.sequences)[0] + >>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") + >>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token + >>> print(processor.token2json(sequence)) + {'class': 'advertisement'} + ``` + +- Use Donut for document parsing as shown below. + + ```py + >>> import re + >>> from transformers import DonutProcessor, VisionEncoderDecoderModel + >>> from datasets import load_dataset + >>> import torch + + >>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") + >>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") + + >>> device = "cuda" if torch.cuda.is_available() else "cpu" + >>> model.to(device) # doctest: +IGNORE_RESULT + + >>> # load document image + >>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") + >>> image = dataset[2]["image"] + + >>> # prepare decoder inputs + >>> task_prompt = "" + >>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids + + >>> pixel_values = processor(image, return_tensors="pt").pixel_values + + >>> outputs = model.generate( + ... pixel_values.to(device), + ... decoder_input_ids=decoder_input_ids.to(device), + ... max_length=model.decoder.config.max_position_embeddings, + ... pad_token_id=processor.tokenizer.pad_token_id, + ... eos_token_id=processor.tokenizer.eos_token_id, + ... use_cache=True, + ... bad_words_ids=[[processor.tokenizer.unk_token_id]], + ... return_dict_in_generate=True, + ... ) + + >>> sequence = processor.batch_decode(outputs.sequences)[0] + >>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") + >>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token + >>> print(processor.token2json(sequence)) + {'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': + {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}} + ``` ## DonutSwinConfig