Add Nougat (#25942)

* Add conversion script * Add NougatImageProcessor * Add crop margin * More improvements * Add docs, READMEs * Remove print statements * Include model_max_length * Add NougatTokenizerFast * Fix imports * Improve postprocessing * Improve image processor * Fix image processor * Improve normalize method * More improvements * More improvements * Add processor, improve docs * Simplify fast tokenizer * Remove test file * Fix docstrings * Use NougatProcessor in conversion script * Add is_levensthein_available * Add tokenizer tests * More improvements * Use numpy instead of opencv * Add is_cv2_available * Fix cv2_available * Add is_nltk_available * Add image processor tests, improve crop_margin * Add integration tests * Improve integration test * Use do_rescale instead of hacks, thanks Amy * Remove random_padding * Address comments * Address more comments * Add import * Address more comments * Address more comments * Address comment * Address comment * Set max_model_input_sizes * Add tests * Add requires_backends * Add Nougat to exotic tests * Use to_pil_image * Address comment regarding nltk * Add NLTK * Improve variable names, integration test * Add test * refactor, document, and test regexes * remove named capture groups, add comments * format * add non-markdown fixed tokenization * format * correct flakyness of args parse * add regex comments * test functionalities for crop_image, align long axis and expected output * add regex tests * remove cv2 dependency * test crop_margin equality between cv2 and python * refactor table regexes to markdown add newline * change print to log, improve doc * fix high count tables correction * address PR comments: naming, linting, asserts * Address comments * Add copied from * Update conversion script * Update conversion script to convert both small and base versions * Add inference example * Add more info * Fix style * Add require annotators to test * Define all keyword arguments explicitly * Move cv2 annotator * Add tokenizer init method * Transfer checkpoints * Add reference to Donut * Address comments * Skip test * Remove cv2 method * Add copied from statements * Use cached_property * Fix docstring * Add file to not doctested --------- Co-authored-by: Pablo Montalvo <pablo.montalvo.leroux@gmail.com>
2025-07-31 02:02:21 +06:00 · 2023-09-26 07:06:04 +02:00 · 2023-09-26 07:06:04 +02:00 · ace74d16bd
commit ace74d16bd
parent 5e09af2acd
31 changed files with 2347 additions and 5 deletions
--- a/.circleci/create_circleci_config.py
+++ b/.circleci/create_circleci_config.py
@ -466,13 +466,15 @@ exotic_models_job = CircleCIJob(
        "sudo apt install tesseract-ocr",
        "pip install -U --upgrade-strategy eager pytesseract",
        "pip install -U --upgrade-strategy eager natten",
-        # TODO (ydshieh): Remove this line once `https://github.com/facebookresearch/detectron2/issues/5010` is resolved
-        'pip install -U --upgrade-strategy eager "Pillow<10.0.0"',
+        "pip install -U --upgrade-strategy eager python-Levenshtein",
+        "pip install -U --upgrade-strategy eager opencv-python",
+        "pip install -U --upgrade-strategy eager nltk",
    ],
    tests_to_run=[
        "tests/models/*layoutlmv*",
        "tests/models/*nat",
        "tests/models/deta",
+        "tests/models/nougat",
    ],
    pytest_num_workers=1,
    pytest_options={"durations": 100},
--- a/README.md
+++ b/README.md
@ -426,6 +426,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
--- a/README_es.md
+++ b/README_es.md
@ -402,6 +402,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
--- a/README_hd.md
+++ b/README_hd.md
@ -374,6 +374,7 @@ conda install -c huggingface transformers
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (हुआवेई नूह के आर्क लैब से) साथ में कागज़ [NEZHA: चीनी भाषा समझ के लिए तंत्रिका प्रासंगिक प्रतिनिधित्व](https :/ /arxiv.org/abs/1909.00204) जुन्किउ वेई, ज़ियाओज़े रेन, ज़िआओगुआंग ली, वेनयोंग हुआंग, यी लियाओ, याशेंग वांग, जियाशू लिन, शिन जियांग, जिओ चेन और कुन लियू द्वारा।
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (फ्रॉम मेटा) साथ में पेपर [नो लैंग्वेज लेफ्ट बिहाइंड: स्केलिंग ह्यूमन-सेंटेड मशीन ट्रांसलेशन] (https://arxiv.org/abs/2207.04672) एनएलएलबी टीम द्वारा प्रकाशित।
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta से) the NLLB team. द्वाराअनुसंधान पत्र [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) के साथ जारी किया गया
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (Meta AI से) Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. द्वाराअनुसंधान पत्र [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) के साथ जारी किया गया
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (विस्कॉन्सिन विश्वविद्यालय - मैडिसन से) साथ में कागज [Nyströmformer: A Nyström- आधारित एल्गोरिथम आत्म-ध्यान का अनुमान लगाने के लिए ](https://arxiv.org/abs/2102.03902) युनयांग ज़िओंग, झानपेंग ज़ेंग, रुद्रसिस चक्रवर्ती, मिंगक्सिंग टैन, ग्लेन फंग, यिन ली, विकास सिंह द्वारा पोस्ट किया गया।
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs से) पेपर [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) जितेश जैन, जिआचेन ली, मांगटिक चिउ, अली हसनी, निकिता ओरलोव, हम्फ्री शि के द्वारा जारी किया गया है।
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
--- a/README_ja.md
+++ b/README_ja.md
@ -436,6 +436,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noah’s Ark Lab から) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu から公開された研究論文: [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204)
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta から) the NLLB team から公開された研究論文: [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta から) the NLLB team. から公開された研究論文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (Meta AI から) Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. から公開された研究論文 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418)
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison から) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh から公開された研究論文: [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902)
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs から) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi から公開された研究論文: [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220)
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
--- a/README_ko.md
+++ b/README_ko.md
@ -351,6 +351,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noah’s Ark Lab 에서) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 의 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 논문과 함께 발표했습니다.
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta 에서) the NLLB team 의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 논문과 함께 발표했습니다.
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta 에서 제공)은 the NLLB team.의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)논문과 함께 발표했습니다.
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (Meta AI 에서 제공)은 Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.의 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418)논문과 함께 발표했습니다.
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison 에서) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 의 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 논문과 함께 발표했습니다.
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs 에서) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 의 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 논문과 함께 발표했습니다.
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@ -375,6 +375,7 @@ conda install -c huggingface transformers
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (来自华为诺亚方舟实验室) 伴随论文 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 由 Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 发布。
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (来自 Meta AI) 伴随论文 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) 由 Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic 发布。
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (来自 the University of Wisconsin - Madison) 伴随论文 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 由 Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 发布。
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (来自 SHI Labs)  伴随论文 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 由 Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 发布。
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (来自 [s-JoL](https://huggingface.co/s-JoL)) 由 [Open-Llama](https://github.com/s-JoL/Open-Llama) 发布.
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@ -387,6 +387,7 @@ conda install -c huggingface transformers
 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
 1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
+1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
 1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -685,6 +685,8 @@
        title: MatCha
      - local: model_doc/mgp-str
        title: MGP-STR
+      - local: model_doc/nougat
+        title: Nougat
      - local: model_doc/oneformer
        title: OneFormer
      - local: model_doc/owlvit
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -191,6 +191,7 @@ The documentation is organized into five sections:
 1. **[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
 1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
 1. **[NLLB-MOE](model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
+1. **[Nougat](model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
 1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
 1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
 1. **[OpenLlama](model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
@ -411,6 +412,7 @@ Flax), PyTorch, and/or TensorFlow.
 |              NAT              |       ✅        |         ❌         |      ❌      |
 |             Nezha             |       ✅        |         ❌         |      ❌      |
 |           NLLB-MOE            |       ✅        |         ❌         |      ❌      |
+|            Nougat             |       ✅        |         ✅         |      ✅      |
 |         Nyströmformer         |       ✅        |         ❌         |      ❌      |
 |           OneFormer           |       ✅        |         ❌         |      ❌      |
 |          OpenAI GPT           |       ✅        |         ✅         |      ❌      |
--- a/docs/source/en/model_doc/nougat.md
+++ b/docs/source/en/model_doc/nougat.md
@ -0,0 +1,109 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
+License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+specific language governing permissions and limitations under the License. -->
+
+# Nougat
+
+## Overview
+
+The Nougat model was proposed in [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by
+Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. Nougat uses the same architecture as [Donut](donut), meaning an image Transformer
+encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.
+
+The abstract from the paper is the following:
+
+*Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.*
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/nougat_architecture.jpg"
+alt="drawing" width="600"/>
+
+<small> Nougat high-level overview. Taken from the <a href="https://arxiv.org/abs/2308.13418">original paper</a>. </small>
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
+[here](https://github.com/facebookresearch/nougat).
+
+Tips:
+
+- The quickest way to get started with Nougat is by checking the [tutorial
+  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Nougat), which show how to use the model
+  at inference time as well as fine-tuning on custom data.
+- Nougat is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. The model is identical to [Donut](donut) in terms of architecture.
+
+## Inference
+
+Nougat's [`VisionEncoderDecoder`] model accepts images as input and makes use of
+[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
+
+The [`NougatImageProcessor`] class is responsible for preprocessing the input image and
+[`NougatTokenizerFast`] decodes the generated target tokens to the target string. The
+[`NougatProcessor`] wraps [`NougatImageProcessor`] and [`NougatTokenizerFast`] classes
+into a single instance to both extract the input features and decode the predicted token ids.
+
+- Step-by-step PDF transcription
+
+```py
+>>> from huggingface_hub import hf_hub_download
+>>> import re
+>>> from PIL import Image
+
+>>> from transformers import NougatProcessor, VisionEncoderDecoderModel
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> processor = NougatProcessor.from_pretrained("facebook/nougat-base")
+>>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")
+
+>>> device = "cuda" if torch.cuda.is_available() else "cpu"
+>>> model.to(device)  # doctest: +IGNORE_RESULT
+
+>>> # prepare PDF image for the model
+>>> filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
+>>> image = Image.open(filepath)
+>>> pixel_values = processor(image, return_tensors="pt").pixel_values
+
+>>> # generate transcription (here we only generate 30 tokens)
+>>> outputs = model.generate(
+...     pixel_values.to(device),
+...     min_length=1,
+...     max_new_tokens=30,
+...     bad_words_ids=[[processor.tokenizer.unk_token_id]],
+... )
+
+>>> sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
+>>> sequence = processor.post_process_generation(sequence, fix_markdown=False)
+>>> # note: we're using repr here such for the sake of printing the \n characters, feel free to just print the sequence
+>>> print(repr(sequence))
+'\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@'
+```
+
+See the [model hub](https://huggingface.co/models?filter=nougat) to look for Nougat checkpoints.
+
+## NougatImageProcessor
+
+[[autodoc]] NougatImageProcessor
+    - preprocess
+
+## NougatTokenizerFast
+
+[[autodoc]] NougatTokenizerFast
+
+## NougatProcessor
+
+[[autodoc]] NougatProcessor
+    - __call__
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+    - post_process_generation
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@ -453,6 +453,7 @@ _import_structure = {
    "models.nezha": ["NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP", "NezhaConfig"],
    "models.nllb": [],
    "models.nllb_moe": ["NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP", "NllbMoeConfig"],
+    "models.nougat": ["NougatProcessor"],
    "models.nystromformer": [
        "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "NystromformerConfig",
@ -859,6 +860,7 @@ else:
    _import_structure["models.mt5"].append("MT5TokenizerFast")
    _import_structure["models.mvp"].append("MvpTokenizerFast")
    _import_structure["models.nllb"].append("NllbTokenizerFast")
+    _import_structure["models.nougat"].append("NougatTokenizerFast")
    _import_structure["models.openai"].append("OpenAIGPTTokenizerFast")
    _import_structure["models.pegasus"].append("PegasusTokenizerFast")
    _import_structure["models.realm"].append("RealmTokenizerFast")
@ -973,6 +975,7 @@ else:
    _import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
    _import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"])
    _import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
+    _import_structure["models.nougat"].append("NougatImageProcessor")
    _import_structure["models.oneformer"].extend(["OneFormerImageProcessor"])
    _import_structure["models.owlvit"].extend(["OwlViTFeatureExtractor", "OwlViTImageProcessor"])
    _import_structure["models.perceiver"].extend(["PerceiverFeatureExtractor", "PerceiverImageProcessor"])
@ -4566,6 +4569,7 @@ if TYPE_CHECKING:
    from .models.nat import NAT_PRETRAINED_CONFIG_ARCHIVE_MAP, NatConfig
    from .models.nezha import NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP, NezhaConfig
    from .models.nllb_moe import NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP, NllbMoeConfig
+    from .models.nougat import NougatProcessor
    from .models.nystromformer import NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, NystromformerConfig
    from .models.oneformer import ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, OneFormerConfig, OneFormerProcessor
    from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer
@ -4944,6 +4948,7 @@ if TYPE_CHECKING:
        from .models.mt5 import MT5TokenizerFast
        from .models.mvp import MvpTokenizerFast
        from .models.nllb import NllbTokenizerFast
+        from .models.nougat import NougatTokenizerFast
        from .models.openai import OpenAIGPTTokenizerFast
        from .models.pegasus import PegasusTokenizerFast
        from .models.realm import RealmTokenizerFast
@ -5029,6 +5034,7 @@ if TYPE_CHECKING:
        from .models.mobilenet_v1 import MobileNetV1FeatureExtractor, MobileNetV1ImageProcessor
        from .models.mobilenet_v2 import MobileNetV2FeatureExtractor, MobileNetV2ImageProcessor
        from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
+        from .models.nougat import NougatImageProcessor
        from .models.oneformer import OneFormerImageProcessor
        from .models.owlvit import OwlViTFeatureExtractor, OwlViTImageProcessor
        from .models.perceiver import PerceiverFeatureExtractor, PerceiverImageProcessor
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@ -147,6 +147,7 @@ from . import (
    nezha,
    nllb,
    nllb_moe,
+    nougat,
    nystromformer,
    oneformer,
    openai,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@ -152,6 +152,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("nat", "NatConfig"),
        ("nezha", "NezhaConfig"),
        ("nllb-moe", "NllbMoeConfig"),
+        ("nougat", "VisionEncoderDecoderConfig"),
        ("nystromformer", "NystromformerConfig"),
        ("oneformer", "OneFormerConfig"),
        ("open-llama", "OpenLlamaConfig"),
@ -580,6 +581,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("nezha", "Nezha"),
        ("nllb", "NLLB"),
        ("nllb-moe", "NLLB-MOE"),
+        ("nougat", "Nougat"),
        ("nystromformer", "Nyströmformer"),
        ("oneformer", "OneFormer"),
        ("open-llama", "OpenLlama"),
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@ -82,6 +82,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("mobilevit", "MobileViTImageProcessor"),
        ("mobilevitv2", "MobileViTImageProcessor"),
        ("nat", "ViTImageProcessor"),
+        ("nougat", "NougatImageProcessor"),
        ("oneformer", "OneFormerImageProcessor"),
        ("owlvit", "OwlViTImageProcessor"),
        ("perceiver", "PerceiverImageProcessor"),
--- a/src/transformers/models/donut/image_processing_donut.py
+++ b/src/transformers/models/donut/image_processing_donut.py
@ -265,8 +265,7 @@ class DonutImageProcessor(BaseImageProcessor):
        **kwargs,
    ) -> np.ndarray:
        """
-        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
-        resized to keep the input aspect ratio.
+        Resizes `image` to `(height, width)` specified by `size` using the PIL library.

        Args:
            image (`np.ndarray`):
--- a/src/transformers/models/nougat/init.py
+++ b/src/transformers/models/nougat/init.py
@ -0,0 +1,63 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_vision_available
+
+
+_import_structure = {
+    "processing_nougat": ["NougatProcessor"],
+}
+
+try:
+    if not is_tokenizers_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["tokenization_nougat_fast"] = ["NougatTokenizerFast"]
+
+try:
+    if not is_vision_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["image_processing_nougat"] = ["NougatImageProcessor"]
+
+
+if TYPE_CHECKING:
+    from .processing_nougat import NougatProcessor
+
+    try:
+        if not is_tokenizers_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .tokenization_nougat_fast import NougatTokenizerFast
+
+    try:
+        if not is_vision_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .image_processing_nougat import NougatImageProcessor
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
--- a/src/transformers/models/nougat/convert_nougat_to_hf.py
+++ b/src/transformers/models/nougat/convert_nougat_to_hf.py
@ -0,0 +1,282 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert Nougat checkpoints using the original `nougat` library. URL:
+https://github.com/facebookresearch/nougat/tree/main"""
+
+import argparse
+
+import torch
+from huggingface_hub import hf_hub_download
+from nougat import NougatModel
+from nougat.dataset.rasterize import rasterize_paper
+from nougat.utils.checkpoint import get_checkpoint
+from PIL import Image
+
+from transformers import (
+    DonutSwinConfig,
+    DonutSwinModel,
+    MBartConfig,
+    MBartForCausalLM,
+    NougatImageProcessor,
+    NougatProcessor,
+    NougatTokenizerFast,
+    VisionEncoderDecoderModel,
+)
+
+
+def get_configs(model):
+    original_config = model.config
+
+    encoder_config = DonutSwinConfig(
+        image_size=original_config.input_size,
+        patch_size=4,
+        depths=original_config.encoder_layer,
+        num_heads=[4, 8, 16, 32],
+        window_size=original_config.window_size,
+        embed_dim=128,
+    )
+    decoder_config = MBartConfig(
+        is_decoder=True,
+        is_encoder_decoder=False,
+        add_cross_attention=True,
+        decoder_layers=original_config.decoder_layer,
+        max_position_embeddings=original_config.max_position_embeddings,
+        vocab_size=len(
+            model.decoder.tokenizer
+        ),  # several special tokens are added to the vocab of XLMRobertaTokenizer, see repo on the hub (added_tokens.json)
+        scale_embedding=True,
+        add_final_layer_norm=True,
+        tie_word_embeddings=False,
+    )
+
+    return encoder_config, decoder_config
+
+
+# Copied from transformers.models.donut.convert_donut_to_pytorch.rename_key
+def rename_key(name):
+    if "encoder.model" in name:
+        name = name.replace("encoder.model", "encoder")
+    if "decoder.model" in name:
+        name = name.replace("decoder.model", "decoder")
+    if "patch_embed.proj" in name:
+        name = name.replace("patch_embed.proj", "embeddings.patch_embeddings.projection")
+    if "patch_embed.norm" in name:
+        name = name.replace("patch_embed.norm", "embeddings.norm")
+    if name.startswith("encoder"):
+        if "layers" in name:
+            name = "encoder." + name
+        if "attn.proj" in name:
+            name = name.replace("attn.proj", "attention.output.dense")
+        if "attn" in name and "mask" not in name:
+            name = name.replace("attn", "attention.self")
+        if "norm1" in name:
+            name = name.replace("norm1", "layernorm_before")
+        if "norm2" in name:
+            name = name.replace("norm2", "layernorm_after")
+        if "mlp.fc1" in name:
+            name = name.replace("mlp.fc1", "intermediate.dense")
+        if "mlp.fc2" in name:
+            name = name.replace("mlp.fc2", "output.dense")
+
+        if name == "encoder.norm.weight":
+            name = "encoder.layernorm.weight"
+        if name == "encoder.norm.bias":
+            name = "encoder.layernorm.bias"
+
+    return name
+
+
+# Copied from transformers.models.donut.convert_donut_to_pytorch.convert_state_dict
+def convert_state_dict(orig_state_dict, model):
+    for key in orig_state_dict.copy().keys():
+        val = orig_state_dict.pop(key)
+
+        if "qkv" in key:
+            key_split = key.split(".")
+            layer_num = int(key_split[3])
+            block_num = int(key_split[5])
+            dim = model.encoder.encoder.layers[layer_num].blocks[block_num].attention.self.all_head_size
+
+            if "weight" in key:
+                orig_state_dict[
+                    f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.query.weight"
+                ] = val[:dim, :]
+                orig_state_dict[
+                    f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.key.weight"
+                ] = val[dim : dim * 2, :]
+                orig_state_dict[
+                    f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.value.weight"
+                ] = val[-dim:, :]
+            else:
+                orig_state_dict[
+                    f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.query.bias"
+                ] = val[:dim]
+                orig_state_dict[
+                    f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.key.bias"
+                ] = val[dim : dim * 2]
+                orig_state_dict[
+                    f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.value.bias"
+                ] = val[-dim:]
+        elif "attn_mask" in key or key in ["encoder.model.norm.weight", "encoder.model.norm.bias"]:
+            # HuggingFace implementation doesn't use attn_mask buffer
+            # and model doesn't use final LayerNorms for the encoder
+            pass
+        else:
+            orig_state_dict[rename_key(key)] = val
+
+    return orig_state_dict
+
+
+def convert_nougat_checkpoint(model_tag, pytorch_dump_folder_path=None, push_to_hub=False):
+    # load original model
+    checkpoint_path = get_checkpoint(None, model_tag)
+    original_model = NougatModel.from_pretrained(checkpoint_path)
+    original_model.eval()
+
+    # load HuggingFace model
+    encoder_config, decoder_config = get_configs(original_model)
+    encoder = DonutSwinModel(encoder_config)
+    decoder = MBartForCausalLM(decoder_config)
+    model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)
+    model.eval()
+
+    state_dict = original_model.state_dict()
+    new_state_dict = convert_state_dict(state_dict, model)
+    model.load_state_dict(new_state_dict)
+
+    # verify results on PDF
+    filepath = hf_hub_download(repo_id="ysharma/nougat", filename="input/nougat.pdf", repo_type="space")
+    images = rasterize_paper(pdf=filepath, return_pil=True)
+    image = Image.open(images[0])
+
+    tokenizer_file = checkpoint_path / "tokenizer.json"
+    tokenizer = NougatTokenizerFast(tokenizer_file=str(tokenizer_file))
+    tokenizer.pad_token = "<pad>"
+    tokenizer.bos_token = "<s>"
+    tokenizer.eos_token = "</s>"
+    tokenizer.unk_token = "<unk>"
+    tokenizer.model_max_length = original_model.config.max_length
+
+    size = {"height": original_model.config.input_size[0], "width": original_model.config.input_size[1]}
+    image_processor = NougatImageProcessor(
+        do_align_long_axis=original_model.config.align_long_axis,
+        size=size,
+    )
+    processor = NougatProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+    # verify pixel_values
+    pixel_values = processor(image, return_tensors="pt").pixel_values
+    original_pixel_values = original_model.encoder.prepare_input(image).unsqueeze(0)
+
+    assert torch.allclose(original_pixel_values, pixel_values)
+
+    # verify patch embeddings
+    original_patch_embed = original_model.encoder.model.patch_embed(pixel_values)
+    patch_embeddings, _ = model.encoder.embeddings(pixel_values)
+    assert torch.allclose(original_patch_embed, patch_embeddings)
+
+    # verify encoder hidden states
+    original_last_hidden_state = original_model.encoder(pixel_values)
+    last_hidden_state = model.encoder(pixel_values).last_hidden_state
+    assert torch.allclose(original_last_hidden_state, last_hidden_state, atol=1e-2)
+
+    # NOTE original model does not use tied weights for embeddings of decoder
+    original_embeddings = original_model.decoder.model.model.decoder.embed_tokens
+    embeddings = model.decoder.model.decoder.embed_tokens
+    assert torch.allclose(original_embeddings.weight, embeddings.weight, atol=1e-3)
+
+    # verify decoder hidden states
+    prompt = "hello world"
+    decoder_input_ids = original_model.decoder.tokenizer(
+        prompt, add_special_tokens=False, return_tensors="pt"
+    ).input_ids
+    decoder_attention_mask = torch.ones_like(decoder_input_ids)
+    original_logits = original_model(
+        image_tensors=pixel_values, decoder_input_ids=decoder_input_ids, attention_mask=decoder_attention_mask
+    ).logits
+    logits = model(
+        pixel_values,
+        decoder_input_ids=decoder_input_ids[:, :-1],
+        decoder_attention_mask=decoder_attention_mask[:, :-1],
+    ).logits
+    assert torch.allclose(original_logits, logits, atol=1e-3)
+
+    # verify generation
+    outputs = model.generate(
+        pixel_values,
+        min_length=1,
+        max_length=30,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+        use_cache=True,
+        bad_words_ids=[
+            [tokenizer.unk_token_id],
+        ],
+        return_dict_in_generate=True,
+        do_sample=False,
+    )
+    generated = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0]
+
+    if model_tag == "0.1.0-base":
+        expected_generation = "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lblec"
+    elif model_tag == "0.1.0-small":
+        expected_generation = (
+            "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lble"
+        )
+    else:
+        raise ValueError(f"Unexpected model tag: {model_tag}")
+
+    assert generated == expected_generation
+    print("Looks ok!")
+
+    if pytorch_dump_folder_path is not None:
+        print(f"Saving model and processor to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        processor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        tag_to_name = {"0.1.0-base": "nougat-base", "0.1.0-small": "nougat-small"}
+        model_name = tag_to_name[model_tag]
+
+        model.push_to_hub(f"facebook/{model_name}")
+        processor.push_to_hub(f"facebook/{model_name}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_tag",
+        default="0.1.0-base",
+        required=False,
+        type=str,
+        choices=["0.1.0-base", "0.1.0-small"],
+        help="Tag of the original model you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        required=False,
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model and processor to the 🤗 hub.",
+    )
+
+    args = parser.parse_args()
+    convert_nougat_checkpoint(args.model_tag, args.pytorch_dump_folder_path, args.push_to_hub)
--- a/src/transformers/models/nougat/image_processing_nougat.py
+++ b/src/transformers/models/nougat/image_processing_nougat.py
@ -0,0 +1,510 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for Nougat."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+
+from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ...image_transforms import (
+    get_resize_output_image_size,
+    pad,
+    resize,
+    to_channel_dimension_format,
+    to_pil_image,
+)
+from ...image_utils import (
+    IMAGENET_DEFAULT_MEAN,
+    IMAGENET_DEFAULT_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    get_image_size,
+    infer_channel_dimension_format,
+    is_scaled_image,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+)
+from ...utils import TensorType, logging
+from ...utils.import_utils import is_cv2_available, is_vision_available
+
+
+logger = logging.get_logger(__name__)
+
+
+if is_cv2_available():
+    pass
+
+
+if is_vision_available():
+    import PIL
+
+
+class NougatImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a Nougat image processor.
+
+    Args:
+        do_crop_margin (`bool`, *optional*, defaults to `True`):
+            Whether to crop the image margins.
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"height": 896, "width": 672}`):
+            Size of the image after resizing. Can be overridden by `size` in the `preprocess` method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_thumbnail (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image using thumbnail method.
+        do_align_long_axis (`bool`, *optional*, defaults to `False`):
+            Whether to align the long axis of the image with the long axis of `size` by rotating by 90 degrees.
+        do_pad (`bool`, *optional*, defaults to `True`):
+            Whether to pad the images to the largest image size in the batch.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
+            parameter in the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
+            `preprocess` method.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
+            Image standard deviation.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_crop_margin: bool = True,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        do_thumbnail: bool = True,
+        do_align_long_axis: bool = False,
+        do_pad: bool = True,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+
+        size = size if size is not None else {"height": 896, "width": 672}
+        size = get_size_dict(size)
+
+        self.do_crop_margin = do_crop_margin
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_thumbnail = do_thumbnail
+        self.do_align_long_axis = do_align_long_axis
+        self.do_pad = do_pad
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
+
+    def python_find_non_zero(self, image: np.array):
+        """This is a reimplementation of a findNonZero function equivalent to cv2."""
+        non_zero_indices = np.column_stack(np.nonzero(image))
+        idxvec = non_zero_indices[:, [1, 0]]
+        idxvec = idxvec.reshape(-1, 1, 2)
+        return idxvec
+
+    def python_bounding_rect(self, coordinates):
+        """This is a reimplementation of a BoundingRect function equivalent to cv2."""
+        min_values = np.min(coordinates, axis=(0, 1)).astype(int)
+        max_values = np.max(coordinates, axis=(0, 1)).astype(int)
+        x_min, y_min = min_values[0], min_values[1]
+        width = max_values[0] - x_min + 1
+        height = max_values[1] - y_min + 1
+        return x_min, y_min, width, height
+
+    def crop_margin(
+        self,
+        image: np.array,
+        gray_threshold: int = 200,
+        data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ) -> np.array:
+        """
+        Crops the margin of the image. Gray pixels are considered margin (i.e., pixels with a value below the
+        threshold).
+
+        Args:
+            image (`np.array`):
+                The image to be cropped.
+            gray_threshold (`int`, *optional*, defaults to `200`)
+                Value below which pixels are considered to be gray.
+            data_format (`ChannelDimension`, *optional*):
+                The channel dimension format of the output image. If unset, will use the inferred format from the
+                input.
+            input_data_format (`ChannelDimension`, *optional*):
+                The channel dimension format of the input image. If unset, will use the inferred format from the input.
+        """
+        if input_data_format is None:
+            input_data_format = infer_channel_dimension_format(image)
+
+        image = to_pil_image(image, input_data_format=input_data_format)
+        data = np.array(image.convert("L")).astype(np.uint8)
+        max_val = data.max()
+        min_val = data.min()
+        if max_val == min_val:
+            image = np.array(image)
+            image = (
+                to_channel_dimension_format(image, data_format, input_data_format)
+                if data_format is not None
+                else image
+            )
+            return image
+        data = (data - min_val) / (max_val - min_val) * 255
+        gray = data < gray_threshold
+        coords = self.python_find_non_zero(gray)
+        x_min, y_min, width, height = self.python_bounding_rect(coords)
+        image = image.crop((x_min, y_min, x_min + width, y_min + height))
+        image = np.array(image).astype(np.uint8)
+        image = to_channel_dimension_format(image, input_data_format, ChannelDimension.LAST)
+
+        image = (
+            to_channel_dimension_format(image, data_format, input_data_format) if data_format is not None else image
+        )
+
+        return image
+
+    # Copied from transformers.models.donut.image_processing_donut.DonutImageProcessor.align_long_axis
+    def align_long_axis(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ) -> np.ndarray:
+        """
+        Align the long axis of the image to the longest axis of the specified size.
+
+        Args:
+            image (`np.ndarray`):
+                The image to be aligned.
+            size (`Dict[str, int]`):
+                The size `{"height": h, "width": w}` to align the long axis to.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The data format of the output image. If unset, the same format as the input image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+
+        Returns:
+            `np.ndarray`: The aligned image.
+        """
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
+        output_height, output_width = size["height"], size["width"]
+
+        if (output_width < output_height and input_width > input_height) or (
+            output_width > output_height and input_width < input_height
+        ):
+            image = np.rot90(image, 3)
+
+        if data_format is not None:
+            image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+
+        return image
+
+    def pad_image(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ) -> np.ndarray:
+        """
+        Pad the image to the specified size at the top, bottom, left and right.
+
+        Args:
+            image (`np.ndarray`):
+                The image to be padded.
+            size (`Dict[str, int]`):
+                The size `{"height": h, "width": w}` to pad the image to.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The data format of the output image. If unset, the same format as the input image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+        """
+        output_height, output_width = size["height"], size["width"]
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
+
+        delta_width = output_width - input_width
+        delta_height = output_height - input_height
+
+        pad_top = delta_height // 2
+        pad_left = delta_width // 2
+
+        pad_bottom = delta_height - pad_top
+        pad_right = delta_width - pad_left
+
+        padding = ((pad_top, pad_bottom), (pad_left, pad_right))
+        return pad(image, padding, data_format=data_format, input_data_format=input_data_format)
+
+    # Copied from transformers.models.donut.image_processing_donut.DonutImageProcessor.thumbnail
+    def thumbnail(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize the image to make a thumbnail. The image is resized so that no dimension is larger than any
+        corresponding dimension of the specified size.
+
+        Args:
+            image (`np.ndarray`):
+                The image to be resized.
+            size (`Dict[str, int]`):
+                The size `{"height": h, "width": w}` to resize the image to.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                The resampling filter to use.
+            data_format (`Optional[Union[str, ChannelDimension]]`, *optional*):
+                The data format of the output image. If unset, the same format as the input image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+        """
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
+        output_height, output_width = size["height"], size["width"]
+
+        # We always resize to the smallest of either the input or output size.
+        height = min(input_height, output_height)
+        width = min(input_width, output_width)
+
+        if height == input_height and width == input_width:
+            return image
+
+        if input_height > input_width:
+            width = int(input_width * height / input_height)
+        elif input_width > input_height:
+            height = int(input_height * width / input_width)
+
+        return resize(
+            image,
+            size=(height, width),
+            resample=resample,
+            reducing_gap=2.0,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
+        )
+
+    # Copied from transformers.models.donut.image_processing_donut.DonutImageProcessor.resize
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resizes `image` to `(height, width)` specified by `size` using the PIL library.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+        """
+        size = get_size_dict(size)
+        shortest_edge = min(size["height"], size["width"])
+        output_size = get_resize_output_image_size(
+            image, size=shortest_edge, default_to_square=False, input_data_format=input_data_format
+        )
+        resized_image = resize(
+            image,
+            size=output_size,
+            resample=resample,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
+        )
+        return resized_image
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_crop_margin: bool = None,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_thumbnail: bool = None,
+        do_align_long_axis: bool = None,
+        do_pad: bool = None,
+        do_rescale: bool = None,
+        rescale_factor: Union[int, float] = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255.
+            do_crop_margin (`bool`, *optional*, defaults to `self.do_crop_margin`):
+                Whether to crop the image margins.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to min(size["height"],
+                size["width"]) with the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_thumbnail (`bool`, *optional*, defaults to `self.do_thumbnail`):
+                Whether to resize the image using thumbnail method.
+            do_align_long_axis (`bool`, *optional*, defaults to `self.do_align_long_axis`):
+                Whether to align the long axis of the image with the long axis of `size` by rotating by 90 degrees.
+            do_pad (`bool`, *optional*, defaults to `self.do_pad`):
+                Whether to pad the images to the largest image size in the batch.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image by the specified scale `rescale_factor`.
+            rescale_factor (`int` or `float`, *optional*, defaults to `self.rescale_factor`):
+                Scale factor to use if rescaling the image.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+        """
+        do_crop_margin = do_crop_margin if do_crop_margin is not None else self.do_crop_margin
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        resample = resample if resample is not None else self.resample
+        do_thumbnail = do_thumbnail if do_thumbnail is not None else self.do_thumbnail
+        do_align_long_axis = do_align_long_axis if do_align_long_axis is not None else self.do_align_long_axis
+        do_pad = do_pad if do_pad is not None else self.do_pad
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+
+        images = make_list_of_images(images)
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_pad and size is None:
+            raise ValueError("Size must be specified if do_pad is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
+        if do_crop_margin:
+            images = [self.crop_margin(image, input_data_format=input_data_format) for image in images]
+
+        if do_align_long_axis:
+            images = [self.align_long_axis(image, size=size, input_data_format=input_data_format) for image in images]
+
+        if do_resize:
+            images = [
+                self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        if do_thumbnail:
+            images = [self.thumbnail(image=image, size=size, input_data_format=input_data_format) for image in images]
+
+        if do_pad:
+            images = [self.pad_image(image=image, size=size, input_data_format=input_data_format) for image in images]
+
+        if do_rescale:
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        if do_normalize:
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
--- a/src/transformers/models/nougat/processing_nougat.py
+++ b/src/transformers/models/nougat/processing_nougat.py
@ -0,0 +1,160 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for Nougat.
+"""
+
+from typing import Dict, List, Optional, Union
+
+from transformers.image_utils import ChannelDimension, PILImageResampling
+from transformers.tokenization_utils_base import PreTokenizedInput, TextInput, TruncationStrategy
+
+from ...processing_utils import ProcessorMixin
+from ...utils import PaddingStrategy, TensorType
+
+
+class NougatProcessor(ProcessorMixin):
+    r"""
+    Constructs a Nougat processor which wraps a Nougat image processor and a Nougat tokenizer into a single processor.
+
+    [`NougatProcessor`] offers all the functionalities of [`NougatImageProcessor`] and [`NougatTokenizerFast`]. See the
+    [`~NougatProcessor.__call__`] and [`~NougatProcessor.decode`] for more information.
+
+    Args:
+        image_processor ([`NougatImageProcessor`]):
+            An instance of [`NougatImageProcessor`]. The image processor is a required input.
+        tokenizer ([`NougatTokenizerFast`]):
+            An instance of [`NougatTokenizerFast`]. The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+
+    def __init__(self, image_processor, tokenizer):
+        super().__init__(image_processor, tokenizer)
+        self.current_processor = self.image_processor
+
+    def __call__(
+        self,
+        images=None,
+        text=None,
+        do_crop_margin: bool = None,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_thumbnail: bool = None,
+        do_align_long_axis: bool = None,
+        do_pad: bool = None,
+        do_rescale: bool = None,
+        rescale_factor: Union[int, float] = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
+        text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+        text_pair_target: Optional[
+            Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
+        ] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = None,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+    ):
+        if images is None and text is None:
+            raise ValueError("You need to specify either an `images` or `text` input to process.")
+
+        if images is not None:
+            inputs = self.image_processor(
+                images,
+                do_crop_margin=do_crop_margin,
+                do_resize=do_resize,
+                size=size,
+                resample=resample,
+                do_thumbnail=do_thumbnail,
+                do_align_long_axis=do_align_long_axis,
+                do_pad=do_pad,
+                do_rescale=do_rescale,
+                rescale_factor=rescale_factor,
+                do_normalize=do_normalize,
+                image_mean=image_mean,
+                image_std=image_std,
+                return_tensors=return_tensors,
+                data_format=data_format,
+                input_data_format=input_data_format,
+            )
+        if text is not None:
+            encodings = self.tokenizer(
+                text,
+                text_pair=text_pair,
+                text_target=text_target,
+                text_pair_target=text_pair_target,
+                add_special_tokens=add_special_tokens,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                stride=stride,
+                is_split_into_words=is_split_into_words,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_tensors=return_tensors,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_offsets_mapping=return_offsets_mapping,
+                return_length=return_length,
+                verbose=verbose,
+            )
+
+        if text is None:
+            return inputs
+        elif images is None:
+            return encodings
+        else:
+            inputs["labels"] = encodings["input_ids"]
+            return inputs
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to NougatTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to NougatTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    def post_process_generation(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to NougatTokenizer's [`~PreTrainedTokenizer.post_process_generation`].
+        Please refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.post_process_generation(*args, **kwargs)
--- a/src/transformers/models/nougat/tokenization_nougat_fast.py
+++ b/src/transformers/models/nougat/tokenization_nougat_fast.py
@ -0,0 +1,634 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fast tokenizer class for Nougat.
+"""
+import re
+from functools import partial
+from multiprocessing import Pool
+from typing import List, Union
+
+import numpy as np
+
+from transformers.tokenization_utils_base import INIT_TOKENIZER_DOCSTRING
+from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
+from transformers.utils import add_end_docstrings
+
+from ...utils import is_levenshtein_available, is_nltk_available, logging, requires_backends
+
+
+if is_levenshtein_available():
+    from Levenshtein import ratio
+
+if is_nltk_available():
+    import nltk
+
+
+logger = logging.get_logger(__name__)
+
+
+INIT_TOKENIZER_DOCSTRING += """
+        tokenizer_object ([`tokenizers.Tokenizer`]):
+            A [`tokenizers.Tokenizer`] object from 🤗 tokenizers to instantiate from. See [Using tokenizers from 🤗
+            tokenizers](../fast_tokenizers) for more information.
+        tokenizer_file ([`str`]):
+            A path to a local JSON file representing a previously serialized [`tokenizers.Tokenizer`] object from 🤗
+            tokenizers.
+"""
+
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "tokenizer_file": {
+        "facebook/nougat-base": "https://huggingface.co/facebook/nougat-base/tokenizer/blob/main/tokenizer.json",
+    },
+}
+
+VOCAB_FILES_NAMES = {"tokenizer_file": "tokenizer.json"}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"facebook/nougat-base": 3584}
+
+
+def markdown_compatible(text: str) -> str:
+    """
+    Make text compatible with Markdown formatting.
+
+    This function makes various text formatting adjustments to make it compatible with Markdown.
+
+    Args:
+        text (`str`):
+            The input text to be made Markdown-compatible.
+
+    Returns:
+        `str`: The Markdown-compatible text.
+    """
+    # equation tag
+    # Replace lines that start with a pattern like (decimal) \[some text\] with \[[some text] \tag{decimal}\].
+    text = re.sub(r"^\(([\d.]+[a-zA-Z]?)\) \\\[(.+?)\\\]$", r"\[\2 \\tag{\1}\]", text, flags=re.M)
+    # Replace lines that start with a pattern like \[some text\] (decimal)  with \[[some text] \tag{decimal}\].
+    text = re.sub(r"^\\\[(.+?)\\\] \(([\d.]+[a-zA-Z]?)\)$", r"\[\1 \\tag{\2}\]", text, flags=re.M)
+    # Replace lines that start with a pattern like \[some text\] (digits) \[another text\]  with \[[some text] \tag{digits}\] [another text].
+    text = re.sub(
+        r"^\\\[(.+?)\\\] \(([\d.]+[a-zA-Z]?)\) (\\\[.+?\\\])$",
+        r"\[\1 \\tag{\2}\] \3",
+        text,
+        flags=re.M,
+    )
+    # multi line
+    text = text.replace(r"\. ", ". ")
+    # bold formatting
+    text = text.replace(r"\bm{", r"\mathbf{").replace(r"{\\bm ", r"\mathbf{")
+    text = re.sub(r"\\mbox{ ?\\boldmath\$(.*?)\$}", r"\\mathbf{\1}", text)
+    # Reformat urls (http, ftp and https only) to markdown [url](url) clickable format
+    text = re.sub(
+        r"((?:http|ftp|https):\/\/(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-]))",
+        r"[\1](\1)",
+        text,
+    )
+    # algorithms
+    text = re.sub(r"```\s*(.+?)\s*```", r"```\n\1\n```", text, flags=re.S)
+
+    return text
+
+
+def normalize_list_like_lines(generation):
+    """
+    Normalize lines in the given text that resemble list items. The function looks for lines that start optionally with
+    '-' or '*', possibly followed by Roman numerals or digits indicating nesting levels. The function reformats such
+    lines to make them more structured.
+
+    Args:
+        generation (str): The input text containing lines that need to be normalized.
+
+    Returns:
+        str: The input text with the list-like lines normalized.
+
+    Note:
+        The function uses regular expressions to identify and reformat the list-like lines. The patterns capture
+        optional bullet points, nesting levels indicated by numerals, and the actual list item content. The
+        normalization adjusts the bullet point style and nesting levels based on the captured patterns.
+    """
+
+    # This matches lines starting with - or *, not followed by - or * (lists)
+    # that are then numbered by digits \d or roman numerals (one or more)
+    # and then, optional additional numbering of this line is captured
+    # this is then fed to re.finditer.
+    pattern = r"(?:^)(-|\*)?(?!-|\*) ?((?:\d|[ixv])+ )?.+? (-|\*) (((?:\d|[ixv])+)\.(\d|[ixv]) )?.*(?:$)"
+
+    for match in reversed(list(re.finditer(pattern, generation, flags=re.I | re.M))):
+        start, stop = match.span()
+        delim = match.group(3) + " "
+        splits = match.group(0).split(delim)
+        replacement = ""
+
+        if match.group(1) is not None:
+            splits = splits[1:]
+            delim1 = match.group(1) + " "
+        else:
+            delim1 = ""
+            continue  # Skip false positives
+
+        pre, post = generation[:start], generation[stop:]
+
+        for i, item in enumerate(splits):
+            level = 0
+            potential_numeral, _, rest = item.strip().partition(" ")
+            if not rest:
+                continue
+            # Infer current nesting level based on detected numbering
+            if re.match(r"^[\dixv]+((?:\.[\dixv])?)+$", potential_numeral, flags=re.I | re.M):
+                level = potential_numeral.count(".")
+
+            replacement += (
+                ("\n" if i > 0 else "") + ("\t" * level) + (delim if i > 0 or start == 0 else delim1) + item.strip()
+            )
+
+        if post == "":
+            post = "\n"
+
+        generation = pre + replacement + post
+
+    return generation
+
+
+def find_next_punctuation(text: str, start_idx=0):
+    """
+    Find the index of the next punctuation mark.
+
+    Args:
+        text (`str`):
+            String to examine
+        start_idx (`int`, *optional*)
+            Index where to start
+    """
+
+    for i in range(start_idx, len(text)):
+        if text[i] in [".", "?", "!", "\n"]:
+            return i
+
+    return None
+
+
+def truncate_repetitions(text: str, min_len: int = 30) -> str:
+    """
+    Attempt to truncate repeating segments in the input string.
+
+    This function looks for the longest repeating substring at the end of the input string and truncates it to appear
+    only once. To be considered for removal, repetitions need to be continuous.
+
+    Args:
+        text (`str`):
+            The input raw prediction to be truncated.
+        min_len (int):
+            The minimum length of the repeating segment.
+
+    Returns:
+        `str`: The input string with repeated segments truncated.
+    """
+    text_lower = text.lower()
+    text_length = len(text_lower)
+
+    if text_length < 2 * min_len:
+        return text
+
+    # try to find a length at which the tail is repeating
+    max_repetition_length = None
+    for repetition_length in range(min_len, int(text_length / 2)):
+        # check if there is a repetition at the end
+        same = True
+        for i in range(0, repetition_length):
+            if text_lower[text_length - repetition_length - i - 1] != text_lower[text_length - i - 1]:
+                same = False
+                break
+
+        if same:
+            max_repetition_length = repetition_length
+
+    if max_repetition_length is None:
+        return text
+
+    lcs = text_lower[-max_repetition_length:]
+
+    # remove all but the last repetition
+    substituted_text = text
+    substituted_text_lower = text_lower
+    while substituted_text_lower.endswith(lcs):
+        substituted_text = substituted_text[:-max_repetition_length]
+        substituted_text_lower = substituted_text_lower[:-max_repetition_length]
+
+    # this is the tail with the repetitions
+    repeating_tail = text_lower[len(substituted_text_lower) :]
+
+    # add until next punctuation and make sure last sentence is not repeating
+    substituted_text_lower_out = substituted_text_lower
+    while True:
+        sentence_end = find_next_punctuation(text_lower, len(substituted_text_lower_out))
+        sentence_start = find_next_punctuation(text_lower[::-1], len(substituted_text_lower_out))
+        if sentence_end and sentence_start:
+            sentence = text_lower[sentence_start:sentence_end]
+            substituted_text_lower_out = text_lower[: sentence_end + 1]
+            if sentence in repeating_tail:
+                break
+        else:
+            break
+
+    text_out = text[: len(substituted_text_lower_out)]
+
+    return text_out
+
+
+def remove_numbers(lines):
+    def _clean(s):
+        return re.sub(r"(?:[\d_]|\*\*)", "", s).strip()
+
+    if type(lines) is str:
+        return _clean(lines)
+    out = []
+    for l in lines:
+        out.append(_clean(l))
+    return out
+
+
+def get_slices(lines, clean_lines):
+    """
+    Get slices of text based on specific criteria within the lines.
+
+    This function identifies and returns slices of text from the input lines based on certain conditions.
+
+    These conditions were chosen by the Nougat authors:
+    - The slice is less than 200 characters long.
+    - The slice is more than 3 characters long.
+    - The slice does not start with "[MISSING_PAGE".
+    - The slice is either the same as the next slice or the ratio of the two in terms of Levensthein distance is
+      greater than 0.9.
+
+    Args:
+        lines (`List[str]`):
+            The list of lines containing the text.
+        clean_lines (`List[str]`):
+            A cleaned version of the text (without numbers).
+
+    Returns:
+        `List[tuple]`: A list of tuples representing the start and end indices of text slices.
+    """
+    indices = np.zeros(len(lines))
+    for i in range(len(lines) - 1):
+        j = i + 1
+        while not clean_lines[j] and j < len(lines) - 1:
+            j += 1
+        if (
+            len(clean_lines[i]) < 200
+            and len(clean_lines[i]) > 3
+            and len(clean_lines[j]) < 200
+            and len(clean_lines[j]) > 3
+            and not clean_lines[i].startswith("[MISSING_PAGE")
+            and (clean_lines[i] == clean_lines[j] or ratio(clean_lines[i], clean_lines[j]) > 0.9)
+        ):
+            indices[i:j] = 1
+    ids = np.where(indices)[0]
+    slices = []
+    if len(ids) == 0:
+        return slices
+    j0 = 0
+    for j, x in enumerate(np.diff(ids) > 3):
+        if x:
+            slices.append((ids[j0], ids[j] + 2))
+            j0 = j + 1
+    slices.append((ids[j0], ids[-1] + 2))
+    return [sli for sli in slices if sli[1] - sli[0] > 15]
+
+
+def remove_slice_from_lines(lines, clean_text, slice) -> str:
+    """
+    Remove a slice of text from the lines based on specific criteria.
+
+    This function identifies a slice of text within the lines and removes it based on certain conditions.
+
+    Args:
+        lines (list of str): The list of lines containing the text.
+        clean_text (list of str): A cleaned version of the text (without numbers).
+        slice (tuple): A tuple representing the start and end indices of the slice to be removed.
+
+    Returns:
+        str: The removed slice of text as a single string.
+    """
+    base = clean_text[slice[0]]
+    section = list(slice)
+    check_start_flag = False
+    # backwards pass, at most 5 lines
+    for line_idx in range(max(0, slice[0] - 1), max(0, slice[0] - 5), -1):
+        if not lines[line_idx]:
+            continue
+        if lines[line_idx] == "## References":
+            section[0] = line_idx
+            break
+        elif ratio(base, remove_numbers(lines[line_idx])) < 0.9:
+            section[0] = line_idx + 1
+            potential_ref = remove_numbers(lines[max(0, line_idx - 1)].partition("* [")[-1])
+            if len(potential_ref) >= 0.75 * len(base) and ratio(base, potential_ref) < 0.9:
+                section[0] = line_idx
+            check_start_flag = True
+            break
+    # forward pass, at most 5 lines
+    for line_idx in range(min(len(lines), slice[1]), min(len(lines), slice[1] + 5)):
+        if ratio(base, remove_numbers(lines[line_idx])) < 0.9:
+            section[1] = line_idx
+            break
+    if len(lines) <= section[1]:
+        section[1] = len(lines) - 1
+    to_delete = "\n".join(lines[section[0] : section[1] + 1])
+    # cut off next page content
+    itera, iterb = enumerate(lines[section[1] - 1]), enumerate(lines[section[1]])
+    while True:
+        try:
+            (ia, a) = next(itera)
+            while a.isnumeric():
+                (ia, a) = next(itera)
+            (ib, b) = next(iterb)
+            while b.isnumeric():
+                (ib, b) = next(iterb)
+            if a != b:
+                break
+        except StopIteration:
+            break
+    if check_start_flag and "* [" in to_delete:
+        to_delete = "* [" + to_delete.partition("* [")[-1]
+    try:
+        delta = len(lines[section[1]]) - ib - 1
+        if delta > 0:
+            to_delete = to_delete[:-delta]
+    except UnboundLocalError:
+        pass
+
+    return to_delete.strip()
+
+
+@add_end_docstrings(INIT_TOKENIZER_DOCSTRING)
+class NougatTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Fast tokenizer for Nougat (backed by HuggingFace tokenizers library).
+
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+    refer to this superclass for more information regarding those methods. This class mainly adds Nougat-specific
+    methods for postprocessing the generated text.
+
+    Args:
+        vocab_file (`str`):
+            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        tokenizer_file (`str`):
+            [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
+            contains everything needed to load the tokenizer.
+
+        clean_up_tokenization_spaces (`str`, *optional*, defaults to `False`):
+            Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra
+            spaces.
+
+        bos_token (`str`, *optional*, defaults to `"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+
+        eos_token (`str`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
+
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+    slow_tokenizer_class = None
+
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        clean_up_tokenization_spaces=False,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        pad_token="<pad>",
+        **kwargs,
+    ):
+        super().__init__(
+            vocab_file=vocab_file,
+            tokenizer_file=tokenizer_file,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            **kwargs,
+        )
+        self.vocab_file = vocab_file
+
+    def remove_hallucinated_references(self, text: str) -> str:
+        """
+        Remove hallucinated or missing references from the text.
+
+        This function identifies and removes references that are marked as missing or hallucinated from the input text.
+
+        Args:
+            text (`str`):
+                The input text containing references.
+
+        Returns:
+            `str`: The text with hallucinated references removed.
+        """
+        lines = text.split("\n")
+        if len(lines) == 0:
+            return ""
+        clean_lines = remove_numbers(lines)
+        slices = get_slices(lines, clean_lines)
+        to_delete = []
+        for slice in slices:
+            to_delete.append(remove_slice_from_lines(lines, clean_lines, slice))
+        for to_delete in reversed(to_delete):
+            text = text.replace(to_delete, "\n\n[MISSING_PAGE_POST]\n\n")
+        text = re.sub(
+            r"## References\n+\[MISSING_PAGE_POST(:\d+)?\]",
+            "\n\n[MISSING_PAGE_POST\\1]",
+            text,
+        )
+        return text
+
+    def correct_tables(self, generation: str) -> str:
+        """
+        Takes a generated string and fixes tables/tabulars to make them match the markdown format needed.
+
+        Args:
+            generation (str): The generated text to be postprocessed.
+
+        Returns:
+            str: The postprocessed text.
+
+        Example:
+
+        ```python
+        correct_tables("\\begin{table} \\begin{tabular}{l l} & \\ \\end{tabular} \\end{table}")
+        "\\begin{table}\n\\begin{tabular}{l l} & \\ \\end{tabular}\n\\end{table}"
+        ```
+        """
+        # remove obvious wrong tables
+        for l in generation.split("\n"):
+            if l.count("\\begin{tabular}") > 15 or l.count("\\multicolumn") > 60 or l.count("&") > 400:
+                generation = generation.replace(l, "")
+        # whitespace corrections
+
+        generation = generation.replace("\\begin{table} \\begin{tabular}", "\\begin{table}\n\\begin{tabular}")
+        generation = generation.replace("\\end{tabular} \\end{table}", "\\end{tabular}\n\\end{table}")
+        generation = generation.replace("\\end{table} Tab", "\\end{table}\nTab")
+
+        generation = re.sub(r"(^.+)\\begin{tab", r"\1\n\\begin{tab", generation, flags=re.M)
+
+        # Remove left-aligned empty LaTeX tabular blocks.
+        generation = generation.replace(r"\begin{tabular}{l l}  & \\ \end{tabular}", "")
+        # Remove tabulars with just 2 newline characters.
+        generation = generation.replace("\\begin{tabular}{}\n\n\\end{tabular}", "")
+        return generation
+
+    def post_process_single(self, generation: str, fix_markdown: bool = True) -> str:
+        """
+        Postprocess a single generated text. Regular expressions used here are taken directly from the Nougat article
+        authors. These expressions are commented for clarity and tested end-to-end in most cases.
+
+        Args:
+            generation (str): The generated text to be postprocessed.
+            fix_markdown (bool, optional): Whether to perform Markdown formatting fixes. Default is True.
+
+        Returns:
+            str: The postprocessed text.
+        """
+        generation = re.sub(
+            r"(?:\n|^)#+ \d*\W? ?(.{100,})", r"\n\1", generation
+        )  # too long section titles probably are none
+        generation = generation.strip()
+        # Remove LaTeX left margin tag
+        generation = generation.replace("\n* [leftmargin=*]\n", "\n")
+        # Remove lines with markdown headings starting with #, with numerals,
+        # and possibly roman numerals with trailing spaces and newlines
+        generation = re.sub(r"^#+ (?:\.?(?:\d|[ixv])+)*\s*(?:$|\n\s*)", "", generation, flags=re.M)
+        # most likely hallucinated titles
+        lines = generation.split("\n")
+        if lines[-1].startswith("#") and lines[-1].lstrip("#").startswith(" ") and len(lines) > 1:
+            logger.info("Likely hallucinated title at the end of the page: " + lines[-1])
+            generation = "\n".join(lines[:-1])
+        # obvious repetition detection
+        generation = truncate_repetitions(generation)
+        # Reference corrections
+        generation = self.remove_hallucinated_references(generation)
+        # Remove lines starting with asterisks and numbers like "*[1]" and followed by capital letters and periods (ie too long references)
+        generation = re.sub(r"^\* \[\d+\](\s?[A-W]\.+\s?){10,}.*$", "", generation, flags=re.M)
+        # Remove empty brackets after a reference number in brackets. *[12][]ABC will become *[12]ABC
+        generation = re.sub(r"^(\* \[\d+\])\[\](.*)$", r"\1\2", generation, flags=re.M)
+        # Remove single characters before or after 2 new lines
+        generation = re.sub(r"(^\w\n\n|\n\n\w$)", "", generation)
+        # pmc math artifact correction
+        generation = re.sub(
+            r"([\s.,()])_([a-zA-Z0-9])__([a-zA-Z0-9]){1,3}_([\s.,:()])",
+            r"\1\(\2_{\3}\)\4",
+            generation,
+        )
+        generation = re.sub(r"([\s.,\d])_([a-zA-Z0-9])_([\s.,\d;])", r"\1\(\2\)\3", generation)
+        # footnote mistakes
+        generation = re.sub(
+            r"(\nFootnote .*?:) (?:footnotetext|thanks):\W*(.*(?:\n\n|$))",
+            r"\1 \2",
+            generation,
+        )
+        # TODO Come up with footnote formatting inside a table
+        generation = re.sub(r"\[FOOTNOTE:.+?\](.*?)\[ENDFOOTNOTE\]", "", generation)
+        # itemize post processing
+        generation = normalize_list_like_lines(generation)
+
+        if generation.endswith((".", "}")):
+            generation += "\n\n"
+        if re.match(r"[A-Z0-9,;:]$", generation):
+            # add space in case it there is a comma or word ending
+            generation += " "
+        elif generation.startswith(("#", "**", "\\begin")):
+            generation = "\n\n" + generation
+        elif generation.split("\n")[-1].startswith(("#", "Figure", "Table")):
+            generation = generation + "\n\n"
+        else:
+            try:
+                last_word = generation.split(" ")[-1]
+                if last_word in nltk.corpus.words.words():
+                    generation += " "
+            except LookupError:
+                # add space just in case. Will split words but better than concatenating them
+                generation += " "
+
+        # table corrections
+        generation = self.correct_tables(generation)
+        # Remove optional, empty square brackets after begin{array}
+        generation = generation.replace("\\begin{array}[]{", "\\begin{array}{")
+        # Remove empty or malformed LaTeX tabular blocks with 2 or more columns specified, with spaces and ampersands.
+        generation = re.sub(
+            r"\\begin{tabular}{([clr ]){2,}}\s*[& ]*\s*(\\\\)? \\end{tabular}",
+            "",
+            generation,
+        )
+        # Remove lines containing "S.A.B." one or more times. Was included in Nougat's code.
+        generation = re.sub(r"(\*\*S\. A\. B\.\*\*\n+){2,}", "", generation)
+        # Remove markdown-style headers that are incomplete or empty on multiple lines.
+        generation = re.sub(r"^#+( [\[\d\w])?$", "", generation, flags=re.M)
+        # Remove lines with just one period.
+        generation = re.sub(r"^\.\s*$", "", generation, flags=re.M)
+        # Replace instances of three or more newlines with just two newlines.
+        generation = re.sub(r"\n{3,}", "\n\n", generation)
+        if fix_markdown:
+            return markdown_compatible(generation)
+        else:
+            return generation
+
+    def post_process_generation(
+        self,
+        generation: Union[str, List[str]],
+        fix_markdown: bool = True,
+        num_workers: int = None,
+    ) -> Union[str, List[str]]:
+        """
+        Postprocess a generated text or a list of generated texts.
+
+        This function can be used to perform postprocessing on generated text, such as fixing Markdown formatting.
+
+        Postprocessing is quite slow so it is recommended to use multiprocessing to speed up the process.
+
+        Args:
+            generation (Union[str, List[str]]):
+                The generated text or a list of generated texts.
+            fix_markdown (`bool`, *optional*, defaults to `True`):
+                Whether to perform Markdown formatting fixes.
+            num_workers (`int`, *optional*):
+                Optional number of workers to pass to leverage multiprocessing (postprocessing several texts in
+                parallel).
+
+        Returns:
+            Union[str, List[str]]: The postprocessed text or list of postprocessed texts.
+        """
+        requires_backends(self, ["nltk", "levenshtein"])
+
+        if isinstance(generation, list):
+            if num_workers is not None and isinstance(num_workers, int):
+                with Pool(num_workers) as p:
+                    return p.map(partial(self.post_process_single, fix_markdown=fix_markdown), generation)
+            else:
+                return [self.post_process_single(s, fix_markdown=fix_markdown) for s in generation]
+        else:
+            return self.post_process_single(generation, fix_markdown=fix_markdown)
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@ -55,6 +55,7 @@ from .utils import (
    is_auto_gptq_available,
    is_bitsandbytes_available,
    is_bs4_available,
+    is_cv2_available,
    is_cython_available,
    is_decord_available,
    is_detectron2_available,
@ -69,8 +70,10 @@ from .utils import (
    is_jinja_available,
    is_jumanpp_available,
    is_keras_nlp_available,
+    is_levenshtein_available,
    is_librosa_available,
    is_natten_available,
+    is_nltk_available,
    is_onnx_available,
    is_optimum_available,
    is_pandas_available,
@ -311,6 +314,36 @@ def require_bs4(test_case):
    return unittest.skipUnless(is_bs4_available(), "test requires BeautifulSoup4")(test_case)


+def require_cv2(test_case):
+    """
+    Decorator marking a test that requires OpenCV.
+
+    These tests are skipped when OpenCV isn't installed.
+
+    """
+    return unittest.skipUnless(is_cv2_available(), "test requires OpenCV")(test_case)
+
+
+def require_levenshtein(test_case):
+    """
+    Decorator marking a test that requires Levenshtein.
+
+    These tests are skipped when Levenshtein isn't installed.
+
+    """
+    return unittest.skipUnless(is_levenshtein_available(), "test requires Levenshtein")(test_case)
+
+
+def require_nltk(test_case):
+    """
+    Decorator marking a test that requires NLTK.
+
+    These tests are skipped when NLTK isn't installed.
+
+    """
+    return unittest.skipUnless(is_nltk_available(), "test requires NLTK")(test_case)
+
+
 def require_accelerate(test_case):
    """
    Decorator marking a test that requires accelerate. These tests are skipped when accelerate isn't installed.
--- a/src/transformers/utils/init.py
+++ b/src/transformers/utils/init.py
@ -108,6 +108,7 @@ from .import_utils import (
    is_bitsandbytes_available,
    is_bs4_available,
    is_coloredlogs_available,
+    is_cv2_available,
    is_cython_available,
    is_datasets_available,
    is_decord_available,
@ -125,9 +126,11 @@ from .import_utils import (
    is_jumanpp_available,
    is_kenlm_available,
    is_keras_nlp_available,
+    is_levenshtein_available,
    is_librosa_available,
    is_natten_available,
    is_ninja_available,
+    is_nltk_available,
    is_onnx_available,
    is_openai_available,
    is_optimum_available,
--- a/src/transformers/utils/dummy_tokenizers_objects.py
+++ b/src/transformers/utils/dummy_tokenizers_objects.py
@ -310,6 +310,13 @@ class NllbTokenizerFast(metaclass=DummyObject):
        requires_backends(self, ["tokenizers"])


+class NougatTokenizerFast(metaclass=DummyObject):
+    _backends = ["tokenizers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tokenizers"])
+
+
 class OpenAIGPTTokenizerFast(metaclass=DummyObject):
    _backends = ["tokenizers"]

--- a/src/transformers/utils/dummy_vision_objects.py
+++ b/src/transformers/utils/dummy_vision_objects.py
@ -359,6 +359,13 @@ class MobileViTImageProcessor(metaclass=DummyObject):
        requires_backends(self, ["vision"])


+class NougatImageProcessor(metaclass=DummyObject):
+    _backends = ["vision"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["vision"])
+
+
 class OneFormerImageProcessor(metaclass=DummyObject):
    _backends = ["vision"]

--- a/src/transformers/utils/import_utils.py
+++ b/src/transformers/utils/import_utils.py
@ -75,6 +75,8 @@ _flash_attn_available = _is_package_available("flash_attn")
 # `importlib.metadata.version` doesn't work with `bs4` but `beautifulsoup4`. For `importlib.util.find_spec`, reversed.
 _bs4_available = importlib.util.find_spec("bs4") is not None
 _coloredlogs_available = _is_package_available("coloredlogs")
+# `importlib.metadata.util` doesn't work with `opencv-python-headless`.
+_cv2_available = importlib.util.find_spec("cv2") is not None
 _datasets_available = _is_package_available("datasets")
 _decord_available = importlib.util.find_spec("decord") is not None
 _detectron2_available = _is_package_available("detectron2")
@ -95,8 +97,10 @@ _jieba_available = _is_package_available("jieba")
 _jinja_available = _is_package_available("jinja2")
 _kenlm_available = _is_package_available("kenlm")
 _keras_nlp_available = _is_package_available("keras_nlp")
+_levenshtein_available = _is_package_available("Levenshtein")
 _librosa_available = _is_package_available("librosa")
 _natten_available = _is_package_available("natten")
+_nltk_available = _is_package_available("nltk")
 _onnx_available = _is_package_available("onnx")
 _openai_available = _is_package_available("openai")
 _optimum_available = _is_package_available("optimum")
@ -240,6 +244,10 @@ def is_kenlm_available():
    return _kenlm_available


+def is_cv2_available():
+    return _cv2_available
+
+
 def is_torch_available():
    return _torch_available

@ -629,6 +637,10 @@ def is_auto_gptq_available():
    return _auto_gptq_available


+def is_levenshtein_available():
+    return _levenshtein_available
+
+
 def is_optimum_neuron_available():
    return _optimum_available and _is_package_available("optimum.neuron")

@ -759,6 +771,10 @@ def is_natten_available():
    return _natten_available


+def is_nltk_available():
+    return _nltk_available
+
+
 def is_torchaudio_available():
    return _torchaudio_available

@ -813,6 +829,16 @@ def is_jinja_available():
    return _jinja_available


+# docstyle-ignore
+CV2_IMPORT_ERROR = """
+{0} requires the OpenCV library but it was not found in your environment. You can install it with:
+```
+pip install opencv-python
+```
+Please note that you may need to restart your runtime after installation.
+"""
+
+
 # docstyle-ignore
 DATASETS_IMPORT_ERROR = """
 {0} requires the 🤗 Datasets library but it was not found in your environment. You can install it with:
@ -959,6 +985,11 @@ installation section: https://github.com/rspeer/python-ftfy/tree/master#installi
 that match your environment. Please note that you may need to restart your runtime after installation.
 """

+LEVENSHTEIN_IMPORT_ERROR = """
+{0} requires the python-Levenshtein library but it was not found in your environment. You can install it with pip: `pip
+install python-Levenshtein`. Please note that you may need to restart your runtime after installation.
+"""
+
 # docstyle-ignore
 PYTORCH_QUANTIZATION_IMPORT_ERROR = """
 {0} requires the pytorch-quantization library but it was not found in your environment. You can install it with pip:
@ -1028,6 +1059,14 @@ shi-labs.com/natten . You can also install it with pip (may take longer to build
 `pip install natten`. Please note that you may need to restart your runtime after installation.
 """

+
+# docstyle-ignore
+NLTK_IMPORT_ERROR = """
+{0} requires the NLTK library but it was not found in your environment. You can install it by referring to:
+https://www.nltk.org/install.html. Please note that you may need to restart your runtime after installation.
+"""
+
+
 # docstyle-ignore
 VISION_IMPORT_ERROR = """
 {0} requires the PIL library but it was not found in your environment. You can install it with pip:
@ -1109,6 +1148,7 @@ jinja2`. Please note that you may need to restart your runtime after installatio
 BACKENDS_MAPPING = OrderedDict(
    [
        ("bs4", (is_bs4_available, BS4_IMPORT_ERROR)),
+        ("cv2", (is_cv2_available, CV2_IMPORT_ERROR)),
        ("datasets", (is_datasets_available, DATASETS_IMPORT_ERROR)),
        ("detectron2", (is_detectron2_available, DETECTRON2_IMPORT_ERROR)),
        ("essentia", (is_essentia_available, ESSENTIA_IMPORT_ERROR)),
@ -1118,6 +1158,7 @@ BACKENDS_MAPPING = OrderedDict(
        ("pandas", (is_pandas_available, PANDAS_IMPORT_ERROR)),
        ("phonemizer", (is_phonemizer_available, PHONEMIZER_IMPORT_ERROR)),
        ("pretty_midi", (is_pretty_midi_available, PRETTY_MIDI_IMPORT_ERROR)),
+        ("levenshtein", (is_levenshtein_available, LEVENSHTEIN_IMPORT_ERROR)),
        ("librosa", (is_librosa_available, LIBROSA_IMPORT_ERROR)),
        ("protobuf", (is_protobuf_available, PROTOBUF_IMPORT_ERROR)),
        ("pyctcdecode", (is_pyctcdecode_available, PYCTCDECODE_IMPORT_ERROR)),
@ -1132,6 +1173,7 @@ BACKENDS_MAPPING = OrderedDict(
        ("tensorflow_text", (is_tensorflow_text_available, TENSORFLOW_TEXT_IMPORT_ERROR)),
        ("timm", (is_timm_available, TIMM_IMPORT_ERROR)),
        ("natten", (is_natten_available, NATTEN_IMPORT_ERROR)),
+        ("nltk", (is_nltk_available, NLTK_IMPORT_ERROR)),
        ("tokenizers", (is_tokenizers_available, TOKENIZERS_IMPORT_ERROR)),
        ("torch", (is_torch_available, PYTORCH_IMPORT_ERROR)),
        ("torchvision", (is_torchvision_available, TORCHVISION_IMPORT_ERROR)),
--- a/tests/models/nougat/init.py
+++ b/tests/models/nougat/init.py
--- a/tests/models/nougat/test_image_processing_nougat.py
+++ b/tests/models/nougat/test_image_processing_nougat.py
@ -0,0 +1,194 @@
+# coding=utf-8
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import numpy as np
+from huggingface_hub import hf_hub_download
+
+from transformers.testing_utils import require_torch, require_vision
+from transformers.utils import cached_property, is_torch_available, is_vision_available
+
+from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
+
+
+if is_torch_available():
+    import torch
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import NougatImageProcessor
+
+
+class NougatImageProcessingTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        num_channels=3,
+        image_size=18,
+        min_resolution=30,
+        max_resolution=400,
+        do_crop_margin=True,
+        do_resize=True,
+        size=None,
+        do_thumbnail=True,
+        do_align_long_axis: bool = False,
+        do_pad=True,
+        do_normalize: bool = True,
+        image_mean=[0.5, 0.5, 0.5],
+        image_std=[0.5, 0.5, 0.5],
+    ):
+        size = size if size is not None else {"height": 20, "width": 20}
+        self.parent = parent
+        self.batch_size = batch_size
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.min_resolution = min_resolution
+        self.max_resolution = max_resolution
+        self.do_crop_margin = do_crop_margin
+        self.do_resize = do_resize
+        self.size = size
+        self.do_thumbnail = do_thumbnail
+        self.do_align_long_axis = do_align_long_axis
+        self.do_pad = do_pad
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean
+        self.image_std = image_std
+
+    def prepare_image_processor_dict(self):
+        return {
+            "do_crop_margin": self.do_crop_margin,
+            "do_resize": self.do_resize,
+            "size": self.size,
+            "do_thumbnail": self.do_thumbnail,
+            "do_align_long_axis": self.do_align_long_axis,
+            "do_pad": self.do_pad,
+            "do_normalize": self.do_normalize,
+            "image_mean": self.image_mean,
+            "image_std": self.image_std,
+        }
+
+    def expected_output_image_shape(self, images):
+        return self.num_channels, self.size["height"], self.size["width"]
+
+    def prepare_dummy_image(self):
+        filepath = hf_hub_download(
+            repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_pdf.png", repo_type="dataset"
+        )
+        image = Image.open(filepath).convert("RGB")
+        return image
+
+    def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
+        return prepare_image_inputs(
+            batch_size=self.batch_size,
+            num_channels=self.num_channels,
+            min_resolution=self.min_resolution,
+            max_resolution=self.max_resolution,
+            equal_resolution=equal_resolution,
+            numpify=numpify,
+            torchify=torchify,
+        )
+
+
+@require_torch
+@require_vision
+class NougatImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
+    image_processing_class = NougatImageProcessor if is_vision_available() else None
+
+    def setUp(self):
+        self.image_processor_tester = NougatImageProcessingTester(self)
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    @cached_property
+    def image_processor(self):
+        return self.image_processing_class(**self.image_processor_dict)
+
+    def test_image_processor_properties(self):
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        self.assertTrue(hasattr(image_processing, "do_resize"))
+        self.assertTrue(hasattr(image_processing, "size"))
+        self.assertTrue(hasattr(image_processing, "do_normalize"))
+        self.assertTrue(hasattr(image_processing, "image_mean"))
+        self.assertTrue(hasattr(image_processing, "image_std"))
+
+    def test_image_processor_from_dict_with_kwargs(self):
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
+        self.assertEqual(image_processor.size, {"height": 20, "width": 20})
+
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42)
+        self.assertEqual(image_processor.size, {"height": 42, "width": 42})
+
+    def test_expected_output(self):
+        dummy_image = self.image_processor_tester.prepare_dummy_image()
+        image_processor = self.image_processor
+        inputs = image_processor(dummy_image, return_tensors="pt")
+        self.assertTrue(torch.allclose(inputs["pixel_values"].mean(), torch.tensor(0.4906), atol=1e-3, rtol=1e-3))
+
+    def test_crop_margin_all_white(self):
+        image = np.uint8(np.ones((100, 100, 3)) * 255)
+        image_processor = self.image_processor
+        cropped_image = image_processor.crop_margin(image)
+        self.assertTrue(np.array_equal(image, cropped_image))
+
+    def test_crop_margin_centered_black_square(self):
+        image = np.ones((100, 100, 3), dtype=np.uint8) * 255
+        image[45:55, 45:55, :] = 0
+        image_processor = self.image_processor
+        cropped_image = image_processor.crop_margin(image)
+        expected_cropped = image[45:55, 45:55, :]
+        self.assertTrue(np.array_equal(expected_cropped, cropped_image))
+
+    def test_align_long_axis_no_rotation(self):
+        image = np.uint8(np.ones((100, 200, 3)) * 255)
+        image_processor = self.image_processor
+        size = {"height": 200, "width": 300}
+        aligned_image = image_processor.align_long_axis(image, size)
+        self.assertEqual(image.shape, aligned_image.shape)
+
+    def test_align_long_axis_with_rotation(self):
+        image = np.uint8(np.ones((200, 100, 3)) * 255)
+        image_processor = self.image_processor
+        size = {"height": 300, "width": 200}
+        aligned_image = image_processor.align_long_axis(image, size)
+        self.assertEqual((200, 100, 3), aligned_image.shape)
+
+    def test_align_long_axis_data_format(self):
+        image = np.uint8(np.ones((100, 200, 3)) * 255)
+        data_format = "channels_first"
+        size = {"height": 200, "width": 300}
+        image_processor = self.image_processor
+        aligned_image = image_processor.align_long_axis(image, size, data_format=data_format)
+        self.assertEqual((3, 100, 200), aligned_image.shape)
+
+    def prepare_dummy_np_image(self):
+        filepath = hf_hub_download(
+            repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_pdf.png", repo_type="dataset"
+        )
+        image = Image.open(filepath).convert("RGB")
+        return np.array(image)
+
+    def test_crop_margin_equality_cv2_python(self):
+        image = self.prepare_dummy_np_image()
+        image_processor = self.image_processor
+        image_cropped_python = image_processor.crop_margin(image)
+
+        self.assertEqual(image_cropped_python.shape, (850, 685, 3))
+        self.assertEqual(image_cropped_python.mean(), 237.43881150708458)
--- a/tests/models/nougat/test_tokenization_nougat.py
+++ b/tests/models/nougat/test_tokenization_nougat.py
@ -0,0 +1,196 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from transformers import NougatTokenizerFast
+from transformers.models.nougat.tokenization_nougat_fast import markdown_compatible, normalize_list_like_lines
+from transformers.testing_utils import require_levenshtein, require_nltk, require_tokenizers
+
+from ...test_tokenization_common import TokenizerTesterMixin
+
+
+@require_tokenizers
+class NougatTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    slow_tokenizer_class = None
+    rust_tokenizer_class = NougatTokenizerFast
+    tokenizer_class = NougatTokenizerFast
+    test_rust_tokenizer = True
+    test_slow_tokenizer = False
+    from_pretrained_vocab_key = "tokenizer_file"
+    special_tokens_map = {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
+
+    def setUp(self):
+        super().setUp()
+        tokenizer = NougatTokenizerFast.from_pretrained("facebook/nougat-base")
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_rust_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return NougatTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_padding(self, max_length=6):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                # Simple input
+                sentence1 = "This is a simple input"
+                sentence2 = ["This is a simple input 1", "This is a simple input 2"]
+                pair1 = ("This is a simple input", "This is a pair")
+                pair2 = [
+                    ("This is a simple input 1", "This is a simple input 2"),
+                    ("This is a simple pair 1", "This is a simple pair 2"),
+                ]
+
+                # Simple input tests
+                try:
+                    tokenizer_r.encode(sentence1, max_length=max_length)
+                    tokenizer_r.encode_plus(sentence1, max_length=max_length)
+
+                    tokenizer_r.batch_encode_plus(sentence2, max_length=max_length)
+                    tokenizer_r.encode(pair1, max_length=max_length)
+                    tokenizer_r.batch_encode_plus(pair2, max_length=max_length)
+                except ValueError:
+                    self.fail("Nougat Tokenizer should be able to deal with padding")
+
+                tokenizer_r.pad_token = None  # Hotfixing padding = None
+                self.assertRaises(
+                    ValueError, tokenizer_r.encode, sentence1, max_length=max_length, padding="max_length"
+                )
+
+                # Simple input
+                self.assertRaises(
+                    ValueError, tokenizer_r.encode_plus, sentence1, max_length=max_length, padding="max_length"
+                )
+
+                # Simple input
+                self.assertRaises(
+                    ValueError,
+                    tokenizer_r.batch_encode_plus,
+                    sentence2,
+                    max_length=max_length,
+                    padding="max_length",
+                )
+
+                # Pair input
+                self.assertRaises(ValueError, tokenizer_r.encode, pair1, max_length=max_length, padding="max_length")
+
+                # Pair input
+                self.assertRaises(
+                    ValueError, tokenizer_r.encode_plus, pair1, max_length=max_length, padding="max_length"
+                )
+
+                # Pair input
+                self.assertRaises(
+                    ValueError,
+                    tokenizer_r.batch_encode_plus,
+                    pair2,
+                    max_length=max_length,
+                    padding="max_length",
+                )
+
+    @unittest.skip("NougatTokenizerFast does not have tokenizer_file in its signature")
+    def test_rust_tokenizer_signature(self):
+        pass
+
+    @unittest.skip("NougatTokenizerFast does not support pretokenized inputs")
+    def test_pretokenized_inputs(self):
+        pass
+
+    @unittest.skip("NougatTokenizerFast directly inherits from PreTrainedTokenizerFast")
+    def test_prepare_for_model(self):
+        pass
+
+    @unittest.skip("This needs a slow tokenizer. Nougat does not have one!")
+    def test_encode_decode_with_spaces(self):
+        pass
+
+
+class MarkdownCompatibleTest(unittest.TestCase):
+    def test_equation_tag(self):
+        input_text = "(3.2) \\[Equation Text\\]"
+        excepted_output = "\\[Equation Text \\tag{3.2}\\]"
+        self.assertEqual(markdown_compatible(input_text), excepted_output)
+
+    def test_equation_tag_letters(self):
+        input_text = "(18a) \\[Equation Text\\]"
+        excepted_output = "\\[Equation Text \\tag{18a}\\]"
+        self.assertEqual(markdown_compatible(input_text), excepted_output)
+
+    def test_bold_formatting(self):
+        input_text = r"This is \bm{bold} text."
+        expected_output = r"This is \mathbf{bold} text."
+        self.assertEqual(markdown_compatible(input_text), expected_output)
+
+    def test_url_conversion(self):
+        input_text = "Visit my website at https://www.example.com"
+        expected_output = "Visit my website at [https://www.example.com](https://www.example.com)"
+        self.assertEqual(markdown_compatible(input_text), expected_output)
+
+    def test_algorithm_code_block(self):
+        input_text = "```python\nprint('Hello, world!')\n```"
+        expected_output = "```\npython\nprint('Hello, world!')\n```"
+        self.assertEqual(markdown_compatible(input_text), expected_output)
+
+    def test_escape_characters(self):
+        input_text = r"Escaped characters like \n should not be \\[affected\\]"
+        expected_output = r"Escaped characters like \n should not be \\[affected\\]"
+        self.assertEqual(markdown_compatible(input_text), expected_output)
+
+    def test_nested_tags(self):
+        input_text = r"This is a super nested \bm{\bm{\bm{\bm{\bm{bold}}}}} tag."
+        expected_output = r"This is a super nested \mathbf{\mathbf{\mathbf{\mathbf{\mathbf{bold}}}}} tag."
+        self.assertEqual(markdown_compatible(input_text), expected_output)
+
+
+class TestNormalizeListLikeLines(unittest.TestCase):
+    def test_two_level_lines(self):
+        input_str = "* Item 1 * Item 2"
+        expected_output = "* Item 1\n* Item 2\n"
+        self.assertEqual(normalize_list_like_lines(input_str), expected_output)
+
+    def test_three_level_lines(self):
+        input_str = "- I. Item 1 - II. Item 2 - III. Item 3"
+        expected_output = "- I. Item 1\n- II. Item 2\n- III. Item 3\n"
+        self.assertEqual(normalize_list_like_lines(input_str), expected_output)
+
+    def test_nested_lines(self):
+        input_str = "- I. Item 1 - I.1 Sub-item 1 - I.1.1 Sub-sub-item 1 - II. Item 2"
+        expected_output = "- I. Item 1\n\t- I.1 Sub-item 1\n\t\t- I.1.1 Sub-sub-item 1\n- II. Item 2\n"
+        self.assertEqual(normalize_list_like_lines(input_str), expected_output)
+
+
+@require_tokenizers
+class NougatPostProcessingTest(unittest.TestCase):
+    def setUp(self):
+        super().setUp()
+        self.tokenizer = NougatTokenizerFast.from_pretrained("facebook/nougat-base")
+
+    def test_correct_tables_basic(self):
+        input_str = "\\begin{table} \\begin{tabular}{l l}  & \\ \\end{tabular} \\end{table}"
+        expected_output = "\\begin{table}\n\\begin{tabular}{l l}  & \\ \\end{tabular}\n\\end{table}"
+        self.assertEqual(self.tokenizer.correct_tables(input_str), expected_output)
+
+    def test_correct_tables_high_count(self):
+        input_str = "\\begin{tabular}" * 20
+        expected_output = ""
+        self.assertEqual(self.tokenizer.correct_tables(input_str), expected_output)
+
+    @require_levenshtein
+    @require_nltk
+    def test_postprocess_as_nougat_no_markdown(self):
+        input_str = "# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books"  # noqa: E231
+        expected_output = "\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books"  # noqa: E231
+        self.assertEqual(self.tokenizer.post_process_single(input_str, fix_markdown=False), expected_output)
--- a/tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py
+++ b/tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py
@ -18,10 +18,13 @@ import tempfile
 import unittest

 from datasets import load_dataset
+from huggingface_hub import hf_hub_download
 from packaging import version

-from transformers import DonutProcessor, TrOCRProcessor
+from transformers import DonutProcessor, NougatProcessor, TrOCRProcessor
 from transformers.testing_utils import (
+    require_levenshtein,
+    require_nltk,
    require_sentencepiece,
    require_torch,
    require_vision,
@ -998,3 +1001,79 @@ class DonutModelIntegrationTest(unittest.TestCase):
                outputs.scores[0][0, :3], torch.tensor([-17.6490, -4.8381, -15.7577], device=torch_device), atol=1e-4
            )
        )
+
+
+@require_levenshtein
+@require_nltk
+@require_torch
+@require_vision
+@slow
+class NougatModelIntegrationTest(unittest.TestCase):
+    @cached_property
+    def default_processor(self):
+        return NougatProcessor.from_pretrained("facebook/nougat-base") if is_vision_available() else None
+
+    @cached_property
+    def default_model(self):
+        return VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base").to(torch_device)
+
+    @cached_property
+    def default_image(self):
+        filepath = hf_hub_download(
+            repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_pdf.png", repo_type="dataset"
+        )
+        image = Image.open(filepath).convert("RGB")
+        return image
+
+    def test_forward_pass(self):
+        processor = self.default_processor
+        model = self.default_model
+        image = self.default_image
+        pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(torch_device)
+
+        decoder_input_ids = torch.tensor([[0]]).to(torch_device)
+        outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)
+        logits = outputs.logits
+
+        # verify the logits
+        expected_shape = torch.Size((1, 1, model.decoder.config.vocab_size))
+        self.assertEqual(outputs.logits.shape, expected_shape)
+
+        expected_slice = torch.tensor(
+            [1.6253, -4.2179, 5.8532, -2.7911, -5.0609, -4.7397, -4.2890, -5.1073, -4.8908, -4.9729]
+        ).to(torch_device)
+
+        self.assertTrue(torch.allclose(logits[0, 0, :10], expected_slice, atol=1e-4))
+
+    def test_generation(self):
+        processor = self.default_processor
+        model = self.default_model
+        image = self.default_image
+        pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(torch_device)
+
+        outputs = model.generate(
+            pixel_values,
+            min_length=1,
+            max_length=3584,
+            bad_words_ids=[[processor.tokenizer.unk_token_id]],
+            return_dict_in_generate=True,
+            output_scores=True,
+        )
+
+        # verify generated sequence
+        generated = processor.batch_decode(outputs.sequences, skip_special_tokens=True)[0]
+        expected_raw_generation = "# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books"
+        self.assertTrue(generated == expected_raw_generation)
+
+        # verify postprocessed sequence
+        generated = processor.post_process_generation(generated, fix_markdown=False)
+        expected_generation = "\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books"
+        self.assertTrue(generated == expected_generation)
+
+        # verify scores
+        self.assertEqual(len(outputs.scores), 741)
+        self.assertTrue(
+            torch.allclose(
+                outputs.scores[0][0, :3], torch.tensor([1.6253, -4.2179, 5.8532], device=torch_device), atol=1e-4
+            )
+        )
--- a/utils/not_doctested.txt
+++ b/utils/not_doctested.txt
@ -692,6 +692,7 @@ src/transformers/models/nezha/modeling_nezha.py
 src/transformers/models/nllb_moe/configuration_nllb_moe.py
 src/transformers/models/nllb_moe/convert_nllb_moe_sharded_original_checkpoint_to_pytorch.py
 src/transformers/models/nllb_moe/modeling_nllb_moe.py
+src/transformers/models/nougat/convert_nougat_to_hf.py
 src/transformers/models/nystromformer/configuration_nystromformer.py
 src/transformers/models/nystromformer/convert_nystromformer_original_pytorch_checkpoint_to_pytorch.py
 src/transformers/models/nystromformer/modeling_nystromformer.py