mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-05 22:00:09 +06:00

* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
122 lines
8.6 KiB
Markdown
122 lines
8.6 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
|
|
License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
specific language governing permissions and limitations under the License. -->
|
|
|
|
# Nougat
|
|
|
|
<div class="flex flex-wrap space-x-1">
|
|
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
|
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
|
|
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
|
|
">
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
The Nougat model was proposed in [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by
|
|
Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. Nougat uses the same architecture as [Donut](donut), meaning an image Transformer
|
|
encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.*
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/nougat_architecture.jpg"
|
|
alt="drawing" width="600"/>
|
|
|
|
<small> Nougat high-level overview. Taken from the <a href="https://arxiv.org/abs/2308.13418">original paper</a>. </small>
|
|
|
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
|
|
[here](https://github.com/facebookresearch/nougat).
|
|
|
|
## Usage tips
|
|
|
|
- The quickest way to get started with Nougat is by checking the [tutorial
|
|
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Nougat), which show how to use the model
|
|
at inference time as well as fine-tuning on custom data.
|
|
- Nougat is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. The model is identical to [Donut](donut) in terms of architecture.
|
|
|
|
## Inference
|
|
|
|
Nougat's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
|
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
|
|
|
The [`NougatImageProcessor`] class is responsible for preprocessing the input image and
|
|
[`NougatTokenizerFast`] decodes the generated target tokens to the target string. The
|
|
[`NougatProcessor`] wraps [`NougatImageProcessor`] and [`NougatTokenizerFast`] classes
|
|
into a single instance to both extract the input features and decode the predicted token ids.
|
|
|
|
- Step-by-step PDF transcription
|
|
|
|
```py
|
|
>>> from huggingface_hub import hf_hub_download
|
|
>>> import re
|
|
>>> from PIL import Image
|
|
|
|
>>> from transformers import NougatProcessor, VisionEncoderDecoderModel
|
|
>>> from datasets import load_dataset
|
|
>>> import torch
|
|
|
|
>>> processor = NougatProcessor.from_pretrained("facebook/nougat-base")
|
|
>>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")
|
|
|
|
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
>>> model.to(device) # doctest: +IGNORE_RESULT
|
|
|
|
>>> # prepare PDF image for the model
|
|
>>> filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
|
|
>>> image = Image.open(filepath)
|
|
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
|
|
|
|
>>> # generate transcription (here we only generate 30 tokens)
|
|
>>> outputs = model.generate(
|
|
... pixel_values.to(device),
|
|
... min_length=1,
|
|
... max_new_tokens=30,
|
|
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
|
|
... )
|
|
|
|
>>> sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
|
>>> sequence = processor.post_process_generation(sequence, fix_markdown=False)
|
|
>>> # note: we're using repr here such for the sake of printing the \n characters, feel free to just print the sequence
|
|
>>> print(repr(sequence))
|
|
'\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@'
|
|
```
|
|
|
|
See the [model hub](https://huggingface.co/models?filter=nougat) to look for Nougat checkpoints.
|
|
|
|
<Tip>
|
|
|
|
The model is identical to [Donut](donut) in terms of architecture.
|
|
|
|
</Tip>
|
|
|
|
## NougatImageProcessor
|
|
|
|
[[autodoc]] NougatImageProcessor
|
|
- preprocess
|
|
|
|
## NougatTokenizerFast
|
|
|
|
[[autodoc]] NougatTokenizerFast
|
|
|
|
## NougatProcessor
|
|
|
|
[[autodoc]] NougatProcessor
|
|
- __call__
|
|
- from_pretrained
|
|
- save_pretrained
|
|
- batch_decode
|
|
- decode
|
|
- post_process_generation |