Add Nougat (#25942)

* Add conversion script

* Add NougatImageProcessor

* Add crop margin

* More improvements

* Add docs, READMEs

* Remove print statements

* Include model_max_length

* Add NougatTokenizerFast

* Fix imports

* Improve postprocessing

* Improve image processor

* Fix image processor

* Improve normalize method

* More improvements

* More improvements

* Add processor, improve docs

* Simplify fast tokenizer

* Remove test file

* Fix docstrings

* Use NougatProcessor in conversion script

* Add is_levensthein_available

* Add tokenizer tests

* More improvements

* Use numpy instead of opencv

* Add is_cv2_available

* Fix cv2_available

* Add is_nltk_available

* Add image processor tests, improve crop_margin

* Add integration tests

* Improve integration test

* Use do_rescale instead of hacks, thanks Amy

* Remove random_padding

* Address comments

* Address more comments

* Add import

* Address more comments

* Address more comments

* Address comment

* Address comment

* Set max_model_input_sizes

* Add tests

* Add requires_backends

* Add Nougat to exotic tests

* Use to_pil_image

* Address comment regarding nltk

* Add NLTK

* Improve variable names, integration test

* Add test

* refactor, document, and test regexes

* remove named capture groups, add comments

* format

* add non-markdown fixed tokenization

* format

* correct flakyness of args parse

* add regex comments

* test functionalities for crop_image, align long axis and expected output

* add regex tests

* remove cv2 dependency

* test crop_margin equality between cv2 and python

* refactor table regexes to markdown

add newline

* change print to log, improve doc

* fix high count tables correction

* address PR comments: naming, linting, asserts

* Address comments

* Add copied from

* Update conversion script

* Update conversion script to convert both small and base versions

* Add inference example

* Add more info

* Fix style

* Add require annotators to test

* Define all keyword arguments explicitly

* Move cv2 annotator

* Add tokenizer init method

* Transfer checkpoints

* Add reference to Donut

* Address comments

* Skip test

* Remove cv2 method

* Add copied from statements

* Use cached_property

* Fix docstring

* Add file to not doctested

---------

Co-authored-by: Pablo Montalvo <pablo.montalvo.leroux@gmail.com>
This commit is contained in:
NielsRogge 2023-09-26 07:06:04 +02:00 committed by GitHub
parent 5e09af2acd
commit ace74d16bd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
31 changed files with 2347 additions and 5 deletions

View File

@ -466,13 +466,15 @@ exotic_models_job = CircleCIJob(
"sudo apt install tesseract-ocr",
"pip install -U --upgrade-strategy eager pytesseract",
"pip install -U --upgrade-strategy eager natten",
# TODO (ydshieh): Remove this line once `https://github.com/facebookresearch/detectron2/issues/5010` is resolved
'pip install -U --upgrade-strategy eager "Pillow<10.0.0"',
"pip install -U --upgrade-strategy eager python-Levenshtein",
"pip install -U --upgrade-strategy eager opencv-python",
"pip install -U --upgrade-strategy eager nltk",
],
tests_to_run=[
"tests/models/*layoutlmv*",
"tests/models/*nat",
"tests/models/deta",
"tests/models/nougat",
],
pytest_num_workers=1,
pytest_options={"durations": 100},

View File

@ -426,6 +426,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).

View File

@ -402,6 +402,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).

View File

@ -374,6 +374,7 @@ conda install -c huggingface transformers
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (हुआवेई नूह के आर्क लैब से) साथ में कागज़ [NEZHA: चीनी भाषा समझ के लिए तंत्रिका प्रासंगिक प्रतिनिधित्व](https :/ /arxiv.org/abs/1909.00204) जुन्किउ वेई, ज़ियाओज़े रेन, ज़िआओगुआंग ली, वेनयोंग हुआंग, यी लियाओ, याशेंग वांग, जियाशू लिन, शिन जियांग, जिओ चेन और कुन लियू द्वारा।
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (फ्रॉम मेटा) साथ में पेपर [नो लैंग्वेज लेफ्ट बिहाइंड: स्केलिंग ह्यूमन-सेंटेड मशीन ट्रांसलेशन] (https://arxiv.org/abs/2207.04672) एनएलएलबी टीम द्वारा प्रकाशित।
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta से) the NLLB team. द्वाराअनुसंधान पत्र [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) के साथ जारी किया गया
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (Meta AI से) Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. द्वाराअनुसंधान पत्र [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) के साथ जारी किया गया
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (विस्कॉन्सिन विश्वविद्यालय - मैडिसन से) साथ में कागज [Nyströmformer: A Nyström- आधारित एल्गोरिथम आत्म-ध्यान का अनुमान लगाने के लिए ](https://arxiv.org/abs/2102.03902) युनयांग ज़िओंग, झानपेंग ज़ेंग, रुद्रसिस चक्रवर्ती, मिंगक्सिंग टैन, ग्लेन फंग, यिन ली, विकास सिंह द्वारा पोस्ट किया गया।
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs से) पेपर [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) जितेश जैन, जिआचेन ली, मांगटिक चिउ, अली हसनी, निकिता ओरलोव, हम्फ्री शि के द्वारा जारी किया गया है।
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).

View File

@ -436,6 +436,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noahs Ark Lab から) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu から公開された研究論文: [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204)
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta から) the NLLB team から公開された研究論文: [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta から) the NLLB team. から公開された研究論文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (Meta AI から) Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. から公開された研究論文 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418)
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison から) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh から公開された研究論文: [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902)
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs から) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi から公開された研究論文: [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220)
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).

View File

@ -351,6 +351,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noahs Ark Lab 에서) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 의 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 논문과 함께 발표했습니다.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta 에서) the NLLB team 의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 논문과 함께 발표했습니다.
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta 에서 제공)은 the NLLB team.의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)논문과 함께 발표했습니다.
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (Meta AI 에서 제공)은 Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.의 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418)논문과 함께 발표했습니다.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison 에서) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 의 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 논문과 함께 발표했습니다.
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs 에서) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 의 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 논문과 함께 발표했습니다.
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).

View File

@ -375,6 +375,7 @@ conda install -c huggingface transformers
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (来自华为诺亚方舟实验室) 伴随论文 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 由 Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 发布。
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (来自 Meta AI) 伴随论文 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) 由 Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic 发布。
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (来自 the University of Wisconsin - Madison) 伴随论文 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 由 Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 发布。
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (来自 SHI Labs) 伴随论文 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 由 Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 发布。
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (来自 [s-JoL](https://huggingface.co/s-JoL)) 由 [Open-Llama](https://github.com/s-JoL/Open-Llama) 发布.

View File

@ -387,6 +387,7 @@ conda install -c huggingface transformers
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nougat](https://huggingface.co/docs/transformers/main/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).

View File

@ -685,6 +685,8 @@
title: MatCha
- local: model_doc/mgp-str
title: MGP-STR
- local: model_doc/nougat
title: Nougat
- local: model_doc/oneformer
title: OneFormer
- local: model_doc/owlvit

View File

@ -191,6 +191,7 @@ The documentation is organized into five sections:
1. **[Nezha](model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[NLLB-MOE](model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nougat](model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OpenLlama](model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released in [Open-Llama](https://github.com/s-JoL/Open-Llama).
@ -411,6 +412,7 @@ Flax), PyTorch, and/or TensorFlow.
| NAT | ✅ | ❌ | ❌ |
| Nezha | ✅ | ❌ | ❌ |
| NLLB-MOE | ✅ | ❌ | ❌ |
| Nougat | ✅ | ✅ | ✅ |
| Nyströmformer | ✅ | ❌ | ❌ |
| OneFormer | ✅ | ❌ | ❌ |
| OpenAI GPT | ✅ | ✅ | ❌ |

View File

@ -0,0 +1,109 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
specific language governing permissions and limitations under the License. -->
# Nougat
## Overview
The Nougat model was proposed in [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by
Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. Nougat uses the same architecture as [Donut](donut), meaning an image Transformer
encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.
The abstract from the paper is the following:
*Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.*
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/nougat_architecture.jpg"
alt="drawing" width="600"/>
<small> Nougat high-level overview. Taken from the <a href="https://arxiv.org/abs/2308.13418">original paper</a>. </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
[here](https://github.com/facebookresearch/nougat).
Tips:
- The quickest way to get started with Nougat is by checking the [tutorial
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Nougat), which show how to use the model
at inference time as well as fine-tuning on custom data.
- Nougat is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. The model is identical to [Donut](donut) in terms of architecture.
## Inference
Nougat's [`VisionEncoderDecoder`] model accepts images as input and makes use of
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
The [`NougatImageProcessor`] class is responsible for preprocessing the input image and
[`NougatTokenizerFast`] decodes the generated target tokens to the target string. The
[`NougatProcessor`] wraps [`NougatImageProcessor`] and [`NougatTokenizerFast`] classes
into a single instance to both extract the input features and decode the predicted token ids.
- Step-by-step PDF transcription
```py
>>> from huggingface_hub import hf_hub_download
>>> import re
>>> from PIL import Image
>>> from transformers import NougatProcessor, VisionEncoderDecoderModel
>>> from datasets import load_dataset
>>> import torch
>>> processor = NougatProcessor.from_pretrained("facebook/nougat-base")
>>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model.to(device) # doctest: +IGNORE_RESULT
>>> # prepare PDF image for the model
>>> filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
>>> image = Image.open(filepath)
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> # generate transcription (here we only generate 30 tokens)
>>> outputs = model.generate(
... pixel_values.to(device),
... min_length=1,
... max_new_tokens=30,
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
... )
>>> sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
>>> sequence = processor.post_process_generation(sequence, fix_markdown=False)
>>> # note: we're using repr here such for the sake of printing the \n characters, feel free to just print the sequence
>>> print(repr(sequence))
'\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@'
```
See the [model hub](https://huggingface.co/models?filter=nougat) to look for Nougat checkpoints.
## NougatImageProcessor
[[autodoc]] NougatImageProcessor
- preprocess
## NougatTokenizerFast
[[autodoc]] NougatTokenizerFast
## NougatProcessor
[[autodoc]] NougatProcessor
- __call__
- from_pretrained
- save_pretrained
- batch_decode
- decode
- post_process_generation

View File

@ -453,6 +453,7 @@ _import_structure = {
"models.nezha": ["NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP", "NezhaConfig"],
"models.nllb": [],
"models.nllb_moe": ["NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP", "NllbMoeConfig"],
"models.nougat": ["NougatProcessor"],
"models.nystromformer": [
"NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"NystromformerConfig",
@ -859,6 +860,7 @@ else:
_import_structure["models.mt5"].append("MT5TokenizerFast")
_import_structure["models.mvp"].append("MvpTokenizerFast")
_import_structure["models.nllb"].append("NllbTokenizerFast")
_import_structure["models.nougat"].append("NougatTokenizerFast")
_import_structure["models.openai"].append("OpenAIGPTTokenizerFast")
_import_structure["models.pegasus"].append("PegasusTokenizerFast")
_import_structure["models.realm"].append("RealmTokenizerFast")
@ -973,6 +975,7 @@ else:
_import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
_import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"])
_import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
_import_structure["models.nougat"].append("NougatImageProcessor")
_import_structure["models.oneformer"].extend(["OneFormerImageProcessor"])
_import_structure["models.owlvit"].extend(["OwlViTFeatureExtractor", "OwlViTImageProcessor"])
_import_structure["models.perceiver"].extend(["PerceiverFeatureExtractor", "PerceiverImageProcessor"])
@ -4566,6 +4569,7 @@ if TYPE_CHECKING:
from .models.nat import NAT_PRETRAINED_CONFIG_ARCHIVE_MAP, NatConfig
from .models.nezha import NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP, NezhaConfig
from .models.nllb_moe import NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP, NllbMoeConfig
from .models.nougat import NougatProcessor
from .models.nystromformer import NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, NystromformerConfig
from .models.oneformer import ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, OneFormerConfig, OneFormerProcessor
from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer
@ -4944,6 +4948,7 @@ if TYPE_CHECKING:
from .models.mt5 import MT5TokenizerFast
from .models.mvp import MvpTokenizerFast
from .models.nllb import NllbTokenizerFast
from .models.nougat import NougatTokenizerFast
from .models.openai import OpenAIGPTTokenizerFast
from .models.pegasus import PegasusTokenizerFast
from .models.realm import RealmTokenizerFast
@ -5029,6 +5034,7 @@ if TYPE_CHECKING:
from .models.mobilenet_v1 import MobileNetV1FeatureExtractor, MobileNetV1ImageProcessor
from .models.mobilenet_v2 import MobileNetV2FeatureExtractor, MobileNetV2ImageProcessor
from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
from .models.nougat import NougatImageProcessor
from .models.oneformer import OneFormerImageProcessor
from .models.owlvit import OwlViTFeatureExtractor, OwlViTImageProcessor
from .models.perceiver import PerceiverFeatureExtractor, PerceiverImageProcessor

View File

@ -147,6 +147,7 @@ from . import (
nezha,
nllb,
nllb_moe,
nougat,
nystromformer,
oneformer,
openai,

View File

@ -152,6 +152,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("nat", "NatConfig"),
("nezha", "NezhaConfig"),
("nllb-moe", "NllbMoeConfig"),
("nougat", "VisionEncoderDecoderConfig"),
("nystromformer", "NystromformerConfig"),
("oneformer", "OneFormerConfig"),
("open-llama", "OpenLlamaConfig"),
@ -580,6 +581,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("nezha", "Nezha"),
("nllb", "NLLB"),
("nllb-moe", "NLLB-MOE"),
("nougat", "Nougat"),
("nystromformer", "Nyströmformer"),
("oneformer", "OneFormer"),
("open-llama", "OpenLlama"),

View File

@ -82,6 +82,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
("mobilevit", "MobileViTImageProcessor"),
("mobilevitv2", "MobileViTImageProcessor"),
("nat", "ViTImageProcessor"),
("nougat", "NougatImageProcessor"),
("oneformer", "OneFormerImageProcessor"),
("owlvit", "OwlViTImageProcessor"),
("perceiver", "PerceiverImageProcessor"),

View File

@ -265,8 +265,7 @@ class DonutImageProcessor(BaseImageProcessor):
**kwargs,
) -> np.ndarray:
"""
Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
resized to keep the input aspect ratio.
Resizes `image` to `(height, width)` specified by `size` using the PIL library.
Args:
image (`np.ndarray`):

View File

@ -0,0 +1,63 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_vision_available
_import_structure = {
"processing_nougat": ["NougatProcessor"],
}
try:
if not is_tokenizers_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["tokenization_nougat_fast"] = ["NougatTokenizerFast"]
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_nougat"] = ["NougatImageProcessor"]
if TYPE_CHECKING:
from .processing_nougat import NougatProcessor
try:
if not is_tokenizers_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .tokenization_nougat_fast import NougatTokenizerFast
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_nougat import NougatImageProcessor
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)

View File

@ -0,0 +1,282 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert Nougat checkpoints using the original `nougat` library. URL:
https://github.com/facebookresearch/nougat/tree/main"""
import argparse
import torch
from huggingface_hub import hf_hub_download
from nougat import NougatModel
from nougat.dataset.rasterize import rasterize_paper
from nougat.utils.checkpoint import get_checkpoint
from PIL import Image
from transformers import (
DonutSwinConfig,
DonutSwinModel,
MBartConfig,
MBartForCausalLM,
NougatImageProcessor,
NougatProcessor,
NougatTokenizerFast,
VisionEncoderDecoderModel,
)
def get_configs(model):
original_config = model.config
encoder_config = DonutSwinConfig(
image_size=original_config.input_size,
patch_size=4,
depths=original_config.encoder_layer,
num_heads=[4, 8, 16, 32],
window_size=original_config.window_size,
embed_dim=128,
)
decoder_config = MBartConfig(
is_decoder=True,
is_encoder_decoder=False,
add_cross_attention=True,
decoder_layers=original_config.decoder_layer,
max_position_embeddings=original_config.max_position_embeddings,
vocab_size=len(
model.decoder.tokenizer
), # several special tokens are added to the vocab of XLMRobertaTokenizer, see repo on the hub (added_tokens.json)
scale_embedding=True,
add_final_layer_norm=True,
tie_word_embeddings=False,
)
return encoder_config, decoder_config
# Copied from transformers.models.donut.convert_donut_to_pytorch.rename_key
def rename_key(name):
if "encoder.model" in name:
name = name.replace("encoder.model", "encoder")
if "decoder.model" in name:
name = name.replace("decoder.model", "decoder")
if "patch_embed.proj" in name:
name = name.replace("patch_embed.proj", "embeddings.patch_embeddings.projection")
if "patch_embed.norm" in name:
name = name.replace("patch_embed.norm", "embeddings.norm")
if name.startswith("encoder"):
if "layers" in name:
name = "encoder." + name
if "attn.proj" in name:
name = name.replace("attn.proj", "attention.output.dense")
if "attn" in name and "mask" not in name:
name = name.replace("attn", "attention.self")
if "norm1" in name:
name = name.replace("norm1", "layernorm_before")
if "norm2" in name:
name = name.replace("norm2", "layernorm_after")
if "mlp.fc1" in name:
name = name.replace("mlp.fc1", "intermediate.dense")
if "mlp.fc2" in name:
name = name.replace("mlp.fc2", "output.dense")
if name == "encoder.norm.weight":
name = "encoder.layernorm.weight"
if name == "encoder.norm.bias":
name = "encoder.layernorm.bias"
return name
# Copied from transformers.models.donut.convert_donut_to_pytorch.convert_state_dict
def convert_state_dict(orig_state_dict, model):
for key in orig_state_dict.copy().keys():
val = orig_state_dict.pop(key)
if "qkv" in key:
key_split = key.split(".")
layer_num = int(key_split[3])
block_num = int(key_split[5])
dim = model.encoder.encoder.layers[layer_num].blocks[block_num].attention.self.all_head_size
if "weight" in key:
orig_state_dict[
f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.query.weight"
] = val[:dim, :]
orig_state_dict[
f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.key.weight"
] = val[dim : dim * 2, :]
orig_state_dict[
f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.value.weight"
] = val[-dim:, :]
else:
orig_state_dict[
f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.query.bias"
] = val[:dim]
orig_state_dict[
f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.key.bias"
] = val[dim : dim * 2]
orig_state_dict[
f"encoder.encoder.layers.{layer_num}.blocks.{block_num}.attention.self.value.bias"
] = val[-dim:]
elif "attn_mask" in key or key in ["encoder.model.norm.weight", "encoder.model.norm.bias"]:
# HuggingFace implementation doesn't use attn_mask buffer
# and model doesn't use final LayerNorms for the encoder
pass
else:
orig_state_dict[rename_key(key)] = val
return orig_state_dict
def convert_nougat_checkpoint(model_tag, pytorch_dump_folder_path=None, push_to_hub=False):
# load original model
checkpoint_path = get_checkpoint(None, model_tag)
original_model = NougatModel.from_pretrained(checkpoint_path)
original_model.eval()
# load HuggingFace model
encoder_config, decoder_config = get_configs(original_model)
encoder = DonutSwinModel(encoder_config)
decoder = MBartForCausalLM(decoder_config)
model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)
model.eval()
state_dict = original_model.state_dict()
new_state_dict = convert_state_dict(state_dict, model)
model.load_state_dict(new_state_dict)
# verify results on PDF
filepath = hf_hub_download(repo_id="ysharma/nougat", filename="input/nougat.pdf", repo_type="space")
images = rasterize_paper(pdf=filepath, return_pil=True)
image = Image.open(images[0])
tokenizer_file = checkpoint_path / "tokenizer.json"
tokenizer = NougatTokenizerFast(tokenizer_file=str(tokenizer_file))
tokenizer.pad_token = "<pad>"
tokenizer.bos_token = "<s>"
tokenizer.eos_token = "</s>"
tokenizer.unk_token = "<unk>"
tokenizer.model_max_length = original_model.config.max_length
size = {"height": original_model.config.input_size[0], "width": original_model.config.input_size[1]}
image_processor = NougatImageProcessor(
do_align_long_axis=original_model.config.align_long_axis,
size=size,
)
processor = NougatProcessor(image_processor=image_processor, tokenizer=tokenizer)
# verify pixel_values
pixel_values = processor(image, return_tensors="pt").pixel_values
original_pixel_values = original_model.encoder.prepare_input(image).unsqueeze(0)
assert torch.allclose(original_pixel_values, pixel_values)
# verify patch embeddings
original_patch_embed = original_model.encoder.model.patch_embed(pixel_values)
patch_embeddings, _ = model.encoder.embeddings(pixel_values)
assert torch.allclose(original_patch_embed, patch_embeddings)
# verify encoder hidden states
original_last_hidden_state = original_model.encoder(pixel_values)
last_hidden_state = model.encoder(pixel_values).last_hidden_state
assert torch.allclose(original_last_hidden_state, last_hidden_state, atol=1e-2)
# NOTE original model does not use tied weights for embeddings of decoder
original_embeddings = original_model.decoder.model.model.decoder.embed_tokens
embeddings = model.decoder.model.decoder.embed_tokens
assert torch.allclose(original_embeddings.weight, embeddings.weight, atol=1e-3)
# verify decoder hidden states
prompt = "hello world"
decoder_input_ids = original_model.decoder.tokenizer(
prompt, add_special_tokens=False, return_tensors="pt"
).input_ids
decoder_attention_mask = torch.ones_like(decoder_input_ids)
original_logits = original_model(
image_tensors=pixel_values, decoder_input_ids=decoder_input_ids, attention_mask=decoder_attention_mask
).logits
logits = model(
pixel_values,
decoder_input_ids=decoder_input_ids[:, :-1],
decoder_attention_mask=decoder_attention_mask[:, :-1],
).logits
assert torch.allclose(original_logits, logits, atol=1e-3)
# verify generation
outputs = model.generate(
pixel_values,
min_length=1,
max_length=30,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
use_cache=True,
bad_words_ids=[
[tokenizer.unk_token_id],
],
return_dict_in_generate=True,
do_sample=False,
)
generated = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0]
if model_tag == "0.1.0-base":
expected_generation = "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lblec"
elif model_tag == "0.1.0-small":
expected_generation = (
"# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lble"
)
else:
raise ValueError(f"Unexpected model tag: {model_tag}")
assert generated == expected_generation
print("Looks ok!")
if pytorch_dump_folder_path is not None:
print(f"Saving model and processor to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
processor.save_pretrained(pytorch_dump_folder_path)
if push_to_hub:
tag_to_name = {"0.1.0-base": "nougat-base", "0.1.0-small": "nougat-small"}
model_name = tag_to_name[model_tag]
model.push_to_hub(f"facebook/{model_name}")
processor.push_to_hub(f"facebook/{model_name}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--model_tag",
default="0.1.0-base",
required=False,
type=str,
choices=["0.1.0-base", "0.1.0-small"],
help="Tag of the original model you'd like to convert.",
)
parser.add_argument(
"--pytorch_dump_folder_path",
default=None,
required=False,
type=str,
help="Path to the output PyTorch model directory.",
)
parser.add_argument(
"--push_to_hub",
action="store_true",
help="Whether or not to push the converted model and processor to the 🤗 hub.",
)
args = parser.parse_args()
convert_nougat_checkpoint(args.model_tag, args.pytorch_dump_folder_path, args.push_to_hub)

View File

@ -0,0 +1,510 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Image processor class for Nougat."""
from typing import Dict, List, Optional, Union
import numpy as np
from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
from ...image_transforms import (
get_resize_output_image_size,
pad,
resize,
to_channel_dimension_format,
to_pil_image,
)
from ...image_utils import (
IMAGENET_DEFAULT_MEAN,
IMAGENET_DEFAULT_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
get_image_size,
infer_channel_dimension_format,
is_scaled_image,
make_list_of_images,
to_numpy_array,
valid_images,
)
from ...utils import TensorType, logging
from ...utils.import_utils import is_cv2_available, is_vision_available
logger = logging.get_logger(__name__)
if is_cv2_available():
pass
if is_vision_available():
import PIL
class NougatImageProcessor(BaseImageProcessor):
r"""
Constructs a Nougat image processor.
Args:
do_crop_margin (`bool`, *optional*, defaults to `True`):
Whether to crop the image margins.
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
`do_resize` in the `preprocess` method.
size (`Dict[str, int]` *optional*, defaults to `{"height": 896, "width": 672}`):
Size of the image after resizing. Can be overridden by `size` in the `preprocess` method.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
do_thumbnail (`bool`, *optional*, defaults to `True`):
Whether to resize the image using thumbnail method.
do_align_long_axis (`bool`, *optional*, defaults to `False`):
Whether to align the long axis of the image with the long axis of `size` by rotating by 90 degrees.
do_pad (`bool`, *optional*, defaults to `True`):
Whether to pad the images to the largest image size in the batch.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
parameter in the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
`preprocess` method.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_MEAN`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
Image standard deviation.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_crop_margin: bool = True,
do_resize: bool = True,
size: Dict[str, int] = None,
resample: PILImageResampling = PILImageResampling.BILINEAR,
do_thumbnail: bool = True,
do_align_long_axis: bool = False,
do_pad: bool = True,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"height": 896, "width": 672}
size = get_size_dict(size)
self.do_crop_margin = do_crop_margin
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_thumbnail = do_thumbnail
self.do_align_long_axis = do_align_long_axis
self.do_pad = do_pad
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
def python_find_non_zero(self, image: np.array):
"""This is a reimplementation of a findNonZero function equivalent to cv2."""
non_zero_indices = np.column_stack(np.nonzero(image))
idxvec = non_zero_indices[:, [1, 0]]
idxvec = idxvec.reshape(-1, 1, 2)
return idxvec
def python_bounding_rect(self, coordinates):
"""This is a reimplementation of a BoundingRect function equivalent to cv2."""
min_values = np.min(coordinates, axis=(0, 1)).astype(int)
max_values = np.max(coordinates, axis=(0, 1)).astype(int)
x_min, y_min = min_values[0], min_values[1]
width = max_values[0] - x_min + 1
height = max_values[1] - y_min + 1
return x_min, y_min, width, height
def crop_margin(
self,
image: np.array,
gray_threshold: int = 200,
data_format: Optional[ChannelDimension] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.array:
"""
Crops the margin of the image. Gray pixels are considered margin (i.e., pixels with a value below the
threshold).
Args:
image (`np.array`):
The image to be cropped.
gray_threshold (`int`, *optional*, defaults to `200`)
Value below which pixels are considered to be gray.
data_format (`ChannelDimension`, *optional*):
The channel dimension format of the output image. If unset, will use the inferred format from the
input.
input_data_format (`ChannelDimension`, *optional*):
The channel dimension format of the input image. If unset, will use the inferred format from the input.
"""
if input_data_format is None:
input_data_format = infer_channel_dimension_format(image)
image = to_pil_image(image, input_data_format=input_data_format)
data = np.array(image.convert("L")).astype(np.uint8)
max_val = data.max()
min_val = data.min()
if max_val == min_val:
image = np.array(image)
image = (
to_channel_dimension_format(image, data_format, input_data_format)
if data_format is not None
else image
)
return image
data = (data - min_val) / (max_val - min_val) * 255
gray = data < gray_threshold
coords = self.python_find_non_zero(gray)
x_min, y_min, width, height = self.python_bounding_rect(coords)
image = image.crop((x_min, y_min, x_min + width, y_min + height))
image = np.array(image).astype(np.uint8)
image = to_channel_dimension_format(image, input_data_format, ChannelDimension.LAST)
image = (
to_channel_dimension_format(image, data_format, input_data_format) if data_format is not None else image
)
return image
# Copied from transformers.models.donut.image_processing_donut.DonutImageProcessor.align_long_axis
def align_long_axis(
self,
image: np.ndarray,
size: Dict[str, int],
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.ndarray:
"""
Align the long axis of the image to the longest axis of the specified size.
Args:
image (`np.ndarray`):
The image to be aligned.
size (`Dict[str, int]`):
The size `{"height": h, "width": w}` to align the long axis to.
data_format (`str` or `ChannelDimension`, *optional*):
The data format of the output image. If unset, the same format as the input image is used.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format of the input image. If not provided, it will be inferred.
Returns:
`np.ndarray`: The aligned image.
"""
input_height, input_width = get_image_size(image, channel_dim=input_data_format)
output_height, output_width = size["height"], size["width"]
if (output_width < output_height and input_width > input_height) or (
output_width > output_height and input_width < input_height
):
image = np.rot90(image, 3)
if data_format is not None:
image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
return image
def pad_image(
self,
image: np.ndarray,
size: Dict[str, int],
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.ndarray:
"""
Pad the image to the specified size at the top, bottom, left and right.
Args:
image (`np.ndarray`):
The image to be padded.
size (`Dict[str, int]`):
The size `{"height": h, "width": w}` to pad the image to.
data_format (`str` or `ChannelDimension`, *optional*):
The data format of the output image. If unset, the same format as the input image is used.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format of the input image. If not provided, it will be inferred.
"""
output_height, output_width = size["height"], size["width"]
input_height, input_width = get_image_size(image, channel_dim=input_data_format)
delta_width = output_width - input_width
delta_height = output_height - input_height
pad_top = delta_height // 2
pad_left = delta_width // 2
pad_bottom = delta_height - pad_top
pad_right = delta_width - pad_left
padding = ((pad_top, pad_bottom), (pad_left, pad_right))
return pad(image, padding, data_format=data_format, input_data_format=input_data_format)
# Copied from transformers.models.donut.image_processing_donut.DonutImageProcessor.thumbnail
def thumbnail(
self,
image: np.ndarray,
size: Dict[str, int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resize the image to make a thumbnail. The image is resized so that no dimension is larger than any
corresponding dimension of the specified size.
Args:
image (`np.ndarray`):
The image to be resized.
size (`Dict[str, int]`):
The size `{"height": h, "width": w}` to resize the image to.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
The resampling filter to use.
data_format (`Optional[Union[str, ChannelDimension]]`, *optional*):
The data format of the output image. If unset, the same format as the input image is used.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format of the input image. If not provided, it will be inferred.
"""
input_height, input_width = get_image_size(image, channel_dim=input_data_format)
output_height, output_width = size["height"], size["width"]
# We always resize to the smallest of either the input or output size.
height = min(input_height, output_height)
width = min(input_width, output_width)
if height == input_height and width == input_width:
return image
if input_height > input_width:
width = int(input_width * height / input_height)
elif input_width > input_height:
height = int(input_height * width / input_width)
return resize(
image,
size=(height, width),
resample=resample,
reducing_gap=2.0,
data_format=data_format,
input_data_format=input_data_format,
**kwargs,
)
# Copied from transformers.models.donut.image_processing_donut.DonutImageProcessor.resize
def resize(
self,
image: np.ndarray,
size: Dict[str, int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resizes `image` to `(height, width)` specified by `size` using the PIL library.
Args:
image (`np.ndarray`):
Image to resize.
size (`Dict[str, int]`):
Size of the output image.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
Resampling filter to use when resiizing the image.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format of the image. If not provided, it will be the same as the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format of the input image. If not provided, it will be inferred.
"""
size = get_size_dict(size)
shortest_edge = min(size["height"], size["width"])
output_size = get_resize_output_image_size(
image, size=shortest_edge, default_to_square=False, input_data_format=input_data_format
)
resized_image = resize(
image,
size=output_size,
resample=resample,
data_format=data_format,
input_data_format=input_data_format,
**kwargs,
)
return resized_image
def preprocess(
self,
images: ImageInput,
do_crop_margin: bool = None,
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
do_thumbnail: bool = None,
do_align_long_axis: bool = None,
do_pad: bool = None,
do_rescale: bool = None,
rescale_factor: Union[int, float] = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> PIL.Image.Image:
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255.
do_crop_margin (`bool`, *optional*, defaults to `self.do_crop_margin`):
Whether to crop the image margins.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image after resizing. Shortest edge of the image is resized to min(size["height"],
size["width"]) with the longest edge resized to keep the input aspect ratio.
resample (`int`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
do_thumbnail (`bool`, *optional*, defaults to `self.do_thumbnail`):
Whether to resize the image using thumbnail method.
do_align_long_axis (`bool`, *optional*, defaults to `self.do_align_long_axis`):
Whether to align the long axis of the image with the long axis of `size` by rotating by 90 degrees.
do_pad (`bool`, *optional*, defaults to `self.do_pad`):
Whether to pad the images to the largest image size in the batch.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image by the specified scale `rescale_factor`.
rescale_factor (`int` or `float`, *optional*, defaults to `self.rescale_factor`):
Scale factor to use if rescaling the image.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: defaults to the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_crop_margin = do_crop_margin if do_crop_margin is not None else self.do_crop_margin
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
resample = resample if resample is not None else self.resample
do_thumbnail = do_thumbnail if do_thumbnail is not None else self.do_thumbnail
do_align_long_axis = do_align_long_axis if do_align_long_axis is not None else self.do_align_long_axis
do_pad = do_pad if do_pad is not None else self.do_pad
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
if do_resize and size is None:
raise ValueError("Size must be specified if do_resize is True.")
if do_pad and size is None:
raise ValueError("Size must be specified if do_pad is True.")
if do_rescale and rescale_factor is None:
raise ValueError("Rescale factor must be specified if do_rescale is True.")
if do_normalize and (image_mean is None or image_std is None):
raise ValueError("Image mean and std must be specified if do_normalize is True.")
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if is_scaled_image(images[0]) and do_rescale:
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
if do_crop_margin:
images = [self.crop_margin(image, input_data_format=input_data_format) for image in images]
if do_align_long_axis:
images = [self.align_long_axis(image, size=size, input_data_format=input_data_format) for image in images]
if do_resize:
images = [
self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
for image in images
]
if do_thumbnail:
images = [self.thumbnail(image=image, size=size, input_data_format=input_data_format) for image in images]
if do_pad:
images = [self.pad_image(image=image, size=size, input_data_format=input_data_format) for image in images]
if do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]
images = [
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
]
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)

View File

@ -0,0 +1,160 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for Nougat.
"""
from typing import Dict, List, Optional, Union
from transformers.image_utils import ChannelDimension, PILImageResampling
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput, TruncationStrategy
from ...processing_utils import ProcessorMixin
from ...utils import PaddingStrategy, TensorType
class NougatProcessor(ProcessorMixin):
r"""
Constructs a Nougat processor which wraps a Nougat image processor and a Nougat tokenizer into a single processor.
[`NougatProcessor`] offers all the functionalities of [`NougatImageProcessor`] and [`NougatTokenizerFast`]. See the
[`~NougatProcessor.__call__`] and [`~NougatProcessor.decode`] for more information.
Args:
image_processor ([`NougatImageProcessor`]):
An instance of [`NougatImageProcessor`]. The image processor is a required input.
tokenizer ([`NougatTokenizerFast`]):
An instance of [`NougatTokenizerFast`]. The tokenizer is a required input.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(self, image_processor, tokenizer):
super().__init__(image_processor, tokenizer)
self.current_processor = self.image_processor
def __call__(
self,
images=None,
text=None,
do_crop_margin: bool = None,
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
do_thumbnail: bool = None,
do_align_long_axis: bool = None,
do_pad: bool = None,
do_rescale: bool = None,
rescale_factor: Union[int, float] = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
text_pair_target: Optional[
Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length: Optional[int] = None,
stride: int = 0,
is_split_into_words: bool = False,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
):
if images is None and text is None:
raise ValueError("You need to specify either an `images` or `text` input to process.")
if images is not None:
inputs = self.image_processor(
images,
do_crop_margin=do_crop_margin,
do_resize=do_resize,
size=size,
resample=resample,
do_thumbnail=do_thumbnail,
do_align_long_axis=do_align_long_axis,
do_pad=do_pad,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
return_tensors=return_tensors,
data_format=data_format,
input_data_format=input_data_format,
)
if text is not None:
encodings = self.tokenizer(
text,
text_pair=text_pair,
text_target=text_target,
text_pair_target=text_pair_target,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
stride=stride,
is_split_into_words=is_split_into_words,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
)
if text is None:
return inputs
elif images is None:
return encodings
else:
inputs["labels"] = encodings["input_ids"]
return inputs
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to NougatTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please refer
to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to NougatTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
def post_process_generation(self, *args, **kwargs):
"""
This method forwards all its arguments to NougatTokenizer's [`~PreTrainedTokenizer.post_process_generation`].
Please refer to the docstring of this method for more information.
"""
return self.tokenizer.post_process_generation(*args, **kwargs)

View File

@ -0,0 +1,634 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fast tokenizer class for Nougat.
"""
import re
from functools import partial
from multiprocessing import Pool
from typing import List, Union
import numpy as np
from transformers.tokenization_utils_base import INIT_TOKENIZER_DOCSTRING
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
from transformers.utils import add_end_docstrings
from ...utils import is_levenshtein_available, is_nltk_available, logging, requires_backends
if is_levenshtein_available():
from Levenshtein import ratio
if is_nltk_available():
import nltk
logger = logging.get_logger(__name__)
INIT_TOKENIZER_DOCSTRING += """
tokenizer_object ([`tokenizers.Tokenizer`]):
A [`tokenizers.Tokenizer`] object from 🤗 tokenizers to instantiate from. See [Using tokenizers from 🤗
tokenizers](../fast_tokenizers) for more information.
tokenizer_file ([`str`]):
A path to a local JSON file representing a previously serialized [`tokenizers.Tokenizer`] object from 🤗
tokenizers.
"""
PRETRAINED_VOCAB_FILES_MAP = {
"tokenizer_file": {
"facebook/nougat-base": "https://huggingface.co/facebook/nougat-base/tokenizer/blob/main/tokenizer.json",
},
}
VOCAB_FILES_NAMES = {"tokenizer_file": "tokenizer.json"}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"facebook/nougat-base": 3584}
def markdown_compatible(text: str) -> str:
"""
Make text compatible with Markdown formatting.
This function makes various text formatting adjustments to make it compatible with Markdown.
Args:
text (`str`):
The input text to be made Markdown-compatible.
Returns:
`str`: The Markdown-compatible text.
"""
# equation tag
# Replace lines that start with a pattern like (decimal) \[some text\] with \[[some text] \tag{decimal}\].
text = re.sub(r"^\(([\d.]+[a-zA-Z]?)\) \\\[(.+?)\\\]$", r"\[\2 \\tag{\1}\]", text, flags=re.M)
# Replace lines that start with a pattern like \[some text\] (decimal) with \[[some text] \tag{decimal}\].
text = re.sub(r"^\\\[(.+?)\\\] \(([\d.]+[a-zA-Z]?)\)$", r"\[\1 \\tag{\2}\]", text, flags=re.M)
# Replace lines that start with a pattern like \[some text\] (digits) \[another text\] with \[[some text] \tag{digits}\] [another text].
text = re.sub(
r"^\\\[(.+?)\\\] \(([\d.]+[a-zA-Z]?)\) (\\\[.+?\\\])$",
r"\[\1 \\tag{\2}\] \3",
text,
flags=re.M,
)
# multi line
text = text.replace(r"\. ", ". ")
# bold formatting
text = text.replace(r"\bm{", r"\mathbf{").replace(r"{\\bm ", r"\mathbf{")
text = re.sub(r"\\mbox{ ?\\boldmath\$(.*?)\$}", r"\\mathbf{\1}", text)
# Reformat urls (http, ftp and https only) to markdown [url](url) clickable format
text = re.sub(
r"((?:http|ftp|https):\/\/(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-]))",
r"[\1](\1)",
text,
)
# algorithms
text = re.sub(r"```\s*(.+?)\s*```", r"```\n\1\n```", text, flags=re.S)
return text
def normalize_list_like_lines(generation):
"""
Normalize lines in the given text that resemble list items. The function looks for lines that start optionally with
'-' or '*', possibly followed by Roman numerals or digits indicating nesting levels. The function reformats such
lines to make them more structured.
Args:
generation (str): The input text containing lines that need to be normalized.
Returns:
str: The input text with the list-like lines normalized.
Note:
The function uses regular expressions to identify and reformat the list-like lines. The patterns capture
optional bullet points, nesting levels indicated by numerals, and the actual list item content. The
normalization adjusts the bullet point style and nesting levels based on the captured patterns.
"""
# This matches lines starting with - or *, not followed by - or * (lists)
# that are then numbered by digits \d or roman numerals (one or more)
# and then, optional additional numbering of this line is captured
# this is then fed to re.finditer.
pattern = r"(?:^)(-|\*)?(?!-|\*) ?((?:\d|[ixv])+ )?.+? (-|\*) (((?:\d|[ixv])+)\.(\d|[ixv]) )?.*(?:$)"
for match in reversed(list(re.finditer(pattern, generation, flags=re.I | re.M))):
start, stop = match.span()
delim = match.group(3) + " "
splits = match.group(0).split(delim)
replacement = ""
if match.group(1) is not None:
splits = splits[1:]
delim1 = match.group(1) + " "
else:
delim1 = ""
continue # Skip false positives
pre, post = generation[:start], generation[stop:]
for i, item in enumerate(splits):
level = 0
potential_numeral, _, rest = item.strip().partition(" ")
if not rest:
continue
# Infer current nesting level based on detected numbering
if re.match(r"^[\dixv]+((?:\.[\dixv])?)+$", potential_numeral, flags=re.I | re.M):
level = potential_numeral.count(".")
replacement += (
("\n" if i > 0 else "") + ("\t" * level) + (delim if i > 0 or start == 0 else delim1) + item.strip()
)
if post == "":
post = "\n"
generation = pre + replacement + post
return generation
def find_next_punctuation(text: str, start_idx=0):
"""
Find the index of the next punctuation mark.
Args:
text (`str`):
String to examine
start_idx (`int`, *optional*)
Index where to start
"""
for i in range(start_idx, len(text)):
if text[i] in [".", "?", "!", "\n"]:
return i
return None
def truncate_repetitions(text: str, min_len: int = 30) -> str:
"""
Attempt to truncate repeating segments in the input string.
This function looks for the longest repeating substring at the end of the input string and truncates it to appear
only once. To be considered for removal, repetitions need to be continuous.
Args:
text (`str`):
The input raw prediction to be truncated.
min_len (int):
The minimum length of the repeating segment.
Returns:
`str`: The input string with repeated segments truncated.
"""
text_lower = text.lower()
text_length = len(text_lower)
if text_length < 2 * min_len:
return text
# try to find a length at which the tail is repeating
max_repetition_length = None
for repetition_length in range(min_len, int(text_length / 2)):
# check if there is a repetition at the end
same = True
for i in range(0, repetition_length):
if text_lower[text_length - repetition_length - i - 1] != text_lower[text_length - i - 1]:
same = False
break
if same:
max_repetition_length = repetition_length
if max_repetition_length is None:
return text
lcs = text_lower[-max_repetition_length:]
# remove all but the last repetition
substituted_text = text
substituted_text_lower = text_lower
while substituted_text_lower.endswith(lcs):
substituted_text = substituted_text[:-max_repetition_length]
substituted_text_lower = substituted_text_lower[:-max_repetition_length]
# this is the tail with the repetitions
repeating_tail = text_lower[len(substituted_text_lower) :]
# add until next punctuation and make sure last sentence is not repeating
substituted_text_lower_out = substituted_text_lower
while True:
sentence_end = find_next_punctuation(text_lower, len(substituted_text_lower_out))
sentence_start = find_next_punctuation(text_lower[::-1], len(substituted_text_lower_out))
if sentence_end and sentence_start:
sentence = text_lower[sentence_start:sentence_end]
substituted_text_lower_out = text_lower[: sentence_end + 1]
if sentence in repeating_tail:
break
else:
break
text_out = text[: len(substituted_text_lower_out)]
return text_out
def remove_numbers(lines):
def _clean(s):
return re.sub(r"(?:[\d_]|\*\*)", "", s).strip()
if type(lines) is str:
return _clean(lines)
out = []
for l in lines:
out.append(_clean(l))
return out
def get_slices(lines, clean_lines):
"""
Get slices of text based on specific criteria within the lines.
This function identifies and returns slices of text from the input lines based on certain conditions.
These conditions were chosen by the Nougat authors:
- The slice is less than 200 characters long.
- The slice is more than 3 characters long.
- The slice does not start with "[MISSING_PAGE".
- The slice is either the same as the next slice or the ratio of the two in terms of Levensthein distance is
greater than 0.9.
Args:
lines (`List[str]`):
The list of lines containing the text.
clean_lines (`List[str]`):
A cleaned version of the text (without numbers).
Returns:
`List[tuple]`: A list of tuples representing the start and end indices of text slices.
"""
indices = np.zeros(len(lines))
for i in range(len(lines) - 1):
j = i + 1
while not clean_lines[j] and j < len(lines) - 1:
j += 1
if (
len(clean_lines[i]) < 200
and len(clean_lines[i]) > 3
and len(clean_lines[j]) < 200
and len(clean_lines[j]) > 3
and not clean_lines[i].startswith("[MISSING_PAGE")
and (clean_lines[i] == clean_lines[j] or ratio(clean_lines[i], clean_lines[j]) > 0.9)
):
indices[i:j] = 1
ids = np.where(indices)[0]
slices = []
if len(ids) == 0:
return slices
j0 = 0
for j, x in enumerate(np.diff(ids) > 3):
if x:
slices.append((ids[j0], ids[j] + 2))
j0 = j + 1
slices.append((ids[j0], ids[-1] + 2))
return [sli for sli in slices if sli[1] - sli[0] > 15]
def remove_slice_from_lines(lines, clean_text, slice) -> str:
"""
Remove a slice of text from the lines based on specific criteria.
This function identifies a slice of text within the lines and removes it based on certain conditions.
Args:
lines (list of str): The list of lines containing the text.
clean_text (list of str): A cleaned version of the text (without numbers).
slice (tuple): A tuple representing the start and end indices of the slice to be removed.
Returns:
str: The removed slice of text as a single string.
"""
base = clean_text[slice[0]]
section = list(slice)
check_start_flag = False
# backwards pass, at most 5 lines
for line_idx in range(max(0, slice[0] - 1), max(0, slice[0] - 5), -1):
if not lines[line_idx]:
continue
if lines[line_idx] == "## References":
section[0] = line_idx
break
elif ratio(base, remove_numbers(lines[line_idx])) < 0.9:
section[0] = line_idx + 1
potential_ref = remove_numbers(lines[max(0, line_idx - 1)].partition("* [")[-1])
if len(potential_ref) >= 0.75 * len(base) and ratio(base, potential_ref) < 0.9:
section[0] = line_idx
check_start_flag = True
break
# forward pass, at most 5 lines
for line_idx in range(min(len(lines), slice[1]), min(len(lines), slice[1] + 5)):
if ratio(base, remove_numbers(lines[line_idx])) < 0.9:
section[1] = line_idx
break
if len(lines) <= section[1]:
section[1] = len(lines) - 1
to_delete = "\n".join(lines[section[0] : section[1] + 1])
# cut off next page content
itera, iterb = enumerate(lines[section[1] - 1]), enumerate(lines[section[1]])
while True:
try:
(ia, a) = next(itera)
while a.isnumeric():
(ia, a) = next(itera)
(ib, b) = next(iterb)
while b.isnumeric():
(ib, b) = next(iterb)
if a != b:
break
except StopIteration:
break
if check_start_flag and "* [" in to_delete:
to_delete = "* [" + to_delete.partition("* [")[-1]
try:
delta = len(lines[section[1]]) - ib - 1
if delta > 0:
to_delete = to_delete[:-delta]
except UnboundLocalError:
pass
return to_delete.strip()
@add_end_docstrings(INIT_TOKENIZER_DOCSTRING)
class NougatTokenizerFast(PreTrainedTokenizerFast):
"""
Fast tokenizer for Nougat (backed by HuggingFace tokenizers library).
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
refer to this superclass for more information regarding those methods. This class mainly adds Nougat-specific
methods for postprocessing the generated text.
Args:
vocab_file (`str`):
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
contains the vocabulary necessary to instantiate a tokenizer.
tokenizer_file (`str`):
[tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
contains everything needed to load the tokenizer.
clean_up_tokenization_spaces (`str`, *optional*, defaults to `False`):
Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra
spaces.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["input_ids", "attention_mask"]
slow_tokenizer_class = None
def __init__(
self,
vocab_file=None,
tokenizer_file=None,
clean_up_tokenization_spaces=False,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
pad_token="<pad>",
**kwargs,
):
super().__init__(
vocab_file=vocab_file,
tokenizer_file=tokenizer_file,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
pad_token=pad_token,
**kwargs,
)
self.vocab_file = vocab_file
def remove_hallucinated_references(self, text: str) -> str:
"""
Remove hallucinated or missing references from the text.
This function identifies and removes references that are marked as missing or hallucinated from the input text.
Args:
text (`str`):
The input text containing references.
Returns:
`str`: The text with hallucinated references removed.
"""
lines = text.split("\n")
if len(lines) == 0:
return ""
clean_lines = remove_numbers(lines)
slices = get_slices(lines, clean_lines)
to_delete = []
for slice in slices:
to_delete.append(remove_slice_from_lines(lines, clean_lines, slice))
for to_delete in reversed(to_delete):
text = text.replace(to_delete, "\n\n[MISSING_PAGE_POST]\n\n")
text = re.sub(
r"## References\n+\[MISSING_PAGE_POST(:\d+)?\]",
"\n\n[MISSING_PAGE_POST\\1]",
text,
)
return text
def correct_tables(self, generation: str) -> str:
"""
Takes a generated string and fixes tables/tabulars to make them match the markdown format needed.
Args:
generation (str): The generated text to be postprocessed.
Returns:
str: The postprocessed text.
Example:
```python
correct_tables("\\begin{table} \\begin{tabular}{l l} & \\ \\end{tabular} \\end{table}")
"\\begin{table}\n\\begin{tabular}{l l} & \\ \\end{tabular}\n\\end{table}"
```
"""
# remove obvious wrong tables
for l in generation.split("\n"):
if l.count("\\begin{tabular}") > 15 or l.count("\\multicolumn") > 60 or l.count("&") > 400:
generation = generation.replace(l, "")
# whitespace corrections
generation = generation.replace("\\begin{table} \\begin{tabular}", "\\begin{table}\n\\begin{tabular}")
generation = generation.replace("\\end{tabular} \\end{table}", "\\end{tabular}\n\\end{table}")
generation = generation.replace("\\end{table} Tab", "\\end{table}\nTab")
generation = re.sub(r"(^.+)\\begin{tab", r"\1\n\\begin{tab", generation, flags=re.M)
# Remove left-aligned empty LaTeX tabular blocks.
generation = generation.replace(r"\begin{tabular}{l l} & \\ \end{tabular}", "")
# Remove tabulars with just 2 newline characters.
generation = generation.replace("\\begin{tabular}{}\n\n\\end{tabular}", "")
return generation
def post_process_single(self, generation: str, fix_markdown: bool = True) -> str:
"""
Postprocess a single generated text. Regular expressions used here are taken directly from the Nougat article
authors. These expressions are commented for clarity and tested end-to-end in most cases.
Args:
generation (str): The generated text to be postprocessed.
fix_markdown (bool, optional): Whether to perform Markdown formatting fixes. Default is True.
Returns:
str: The postprocessed text.
"""
generation = re.sub(
r"(?:\n|^)#+ \d*\W? ?(.{100,})", r"\n\1", generation
) # too long section titles probably are none
generation = generation.strip()
# Remove LaTeX left margin tag
generation = generation.replace("\n* [leftmargin=*]\n", "\n")
# Remove lines with markdown headings starting with #, with numerals,
# and possibly roman numerals with trailing spaces and newlines
generation = re.sub(r"^#+ (?:\.?(?:\d|[ixv])+)*\s*(?:$|\n\s*)", "", generation, flags=re.M)
# most likely hallucinated titles
lines = generation.split("\n")
if lines[-1].startswith("#") and lines[-1].lstrip("#").startswith(" ") and len(lines) > 1:
logger.info("Likely hallucinated title at the end of the page: " + lines[-1])
generation = "\n".join(lines[:-1])
# obvious repetition detection
generation = truncate_repetitions(generation)
# Reference corrections
generation = self.remove_hallucinated_references(generation)
# Remove lines starting with asterisks and numbers like "*[1]" and followed by capital letters and periods (ie too long references)
generation = re.sub(r"^\* \[\d+\](\s?[A-W]\.+\s?){10,}.*$", "", generation, flags=re.M)
# Remove empty brackets after a reference number in brackets. *[12][]ABC will become *[12]ABC
generation = re.sub(r"^(\* \[\d+\])\[\](.*)$", r"\1\2", generation, flags=re.M)
# Remove single characters before or after 2 new lines
generation = re.sub(r"(^\w\n\n|\n\n\w$)", "", generation)
# pmc math artifact correction
generation = re.sub(
r"([\s.,()])_([a-zA-Z0-9])__([a-zA-Z0-9]){1,3}_([\s.,:()])",
r"\1\(\2_{\3}\)\4",
generation,
)
generation = re.sub(r"([\s.,\d])_([a-zA-Z0-9])_([\s.,\d;])", r"\1\(\2\)\3", generation)
# footnote mistakes
generation = re.sub(
r"(\nFootnote .*?:) (?:footnotetext|thanks):\W*(.*(?:\n\n|$))",
r"\1 \2",
generation,
)
# TODO Come up with footnote formatting inside a table
generation = re.sub(r"\[FOOTNOTE:.+?\](.*?)\[ENDFOOTNOTE\]", "", generation)
# itemize post processing
generation = normalize_list_like_lines(generation)
if generation.endswith((".", "}")):
generation += "\n\n"
if re.match(r"[A-Z0-9,;:]$", generation):
# add space in case it there is a comma or word ending
generation += " "
elif generation.startswith(("#", "**", "\\begin")):
generation = "\n\n" + generation
elif generation.split("\n")[-1].startswith(("#", "Figure", "Table")):
generation = generation + "\n\n"
else:
try:
last_word = generation.split(" ")[-1]
if last_word in nltk.corpus.words.words():
generation += " "
except LookupError:
# add space just in case. Will split words but better than concatenating them
generation += " "
# table corrections
generation = self.correct_tables(generation)
# Remove optional, empty square brackets after begin{array}
generation = generation.replace("\\begin{array}[]{", "\\begin{array}{")
# Remove empty or malformed LaTeX tabular blocks with 2 or more columns specified, with spaces and ampersands.
generation = re.sub(
r"\\begin{tabular}{([clr ]){2,}}\s*[& ]*\s*(\\\\)? \\end{tabular}",
"",
generation,
)
# Remove lines containing "S.A.B." one or more times. Was included in Nougat's code.
generation = re.sub(r"(\*\*S\. A\. B\.\*\*\n+){2,}", "", generation)
# Remove markdown-style headers that are incomplete or empty on multiple lines.
generation = re.sub(r"^#+( [\[\d\w])?$", "", generation, flags=re.M)
# Remove lines with just one period.
generation = re.sub(r"^\.\s*$", "", generation, flags=re.M)
# Replace instances of three or more newlines with just two newlines.
generation = re.sub(r"\n{3,}", "\n\n", generation)
if fix_markdown:
return markdown_compatible(generation)
else:
return generation
def post_process_generation(
self,
generation: Union[str, List[str]],
fix_markdown: bool = True,
num_workers: int = None,
) -> Union[str, List[str]]:
"""
Postprocess a generated text or a list of generated texts.
This function can be used to perform postprocessing on generated text, such as fixing Markdown formatting.
Postprocessing is quite slow so it is recommended to use multiprocessing to speed up the process.
Args:
generation (Union[str, List[str]]):
The generated text or a list of generated texts.
fix_markdown (`bool`, *optional*, defaults to `True`):
Whether to perform Markdown formatting fixes.
num_workers (`int`, *optional*):
Optional number of workers to pass to leverage multiprocessing (postprocessing several texts in
parallel).
Returns:
Union[str, List[str]]: The postprocessed text or list of postprocessed texts.
"""
requires_backends(self, ["nltk", "levenshtein"])
if isinstance(generation, list):
if num_workers is not None and isinstance(num_workers, int):
with Pool(num_workers) as p:
return p.map(partial(self.post_process_single, fix_markdown=fix_markdown), generation)
else:
return [self.post_process_single(s, fix_markdown=fix_markdown) for s in generation]
else:
return self.post_process_single(generation, fix_markdown=fix_markdown)

View File

@ -55,6 +55,7 @@ from .utils import (
is_auto_gptq_available,
is_bitsandbytes_available,
is_bs4_available,
is_cv2_available,
is_cython_available,
is_decord_available,
is_detectron2_available,
@ -69,8 +70,10 @@ from .utils import (
is_jinja_available,
is_jumanpp_available,
is_keras_nlp_available,
is_levenshtein_available,
is_librosa_available,
is_natten_available,
is_nltk_available,
is_onnx_available,
is_optimum_available,
is_pandas_available,
@ -311,6 +314,36 @@ def require_bs4(test_case):
return unittest.skipUnless(is_bs4_available(), "test requires BeautifulSoup4")(test_case)
def require_cv2(test_case):
"""
Decorator marking a test that requires OpenCV.
These tests are skipped when OpenCV isn't installed.
"""
return unittest.skipUnless(is_cv2_available(), "test requires OpenCV")(test_case)
def require_levenshtein(test_case):
"""
Decorator marking a test that requires Levenshtein.
These tests are skipped when Levenshtein isn't installed.
"""
return unittest.skipUnless(is_levenshtein_available(), "test requires Levenshtein")(test_case)
def require_nltk(test_case):
"""
Decorator marking a test that requires NLTK.
These tests are skipped when NLTK isn't installed.
"""
return unittest.skipUnless(is_nltk_available(), "test requires NLTK")(test_case)
def require_accelerate(test_case):
"""
Decorator marking a test that requires accelerate. These tests are skipped when accelerate isn't installed.

View File

@ -108,6 +108,7 @@ from .import_utils import (
is_bitsandbytes_available,
is_bs4_available,
is_coloredlogs_available,
is_cv2_available,
is_cython_available,
is_datasets_available,
is_decord_available,
@ -125,9 +126,11 @@ from .import_utils import (
is_jumanpp_available,
is_kenlm_available,
is_keras_nlp_available,
is_levenshtein_available,
is_librosa_available,
is_natten_available,
is_ninja_available,
is_nltk_available,
is_onnx_available,
is_openai_available,
is_optimum_available,

View File

@ -310,6 +310,13 @@ class NllbTokenizerFast(metaclass=DummyObject):
requires_backends(self, ["tokenizers"])
class NougatTokenizerFast(metaclass=DummyObject):
_backends = ["tokenizers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["tokenizers"])
class OpenAIGPTTokenizerFast(metaclass=DummyObject):
_backends = ["tokenizers"]

View File

@ -359,6 +359,13 @@ class MobileViTImageProcessor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class NougatImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class OneFormerImageProcessor(metaclass=DummyObject):
_backends = ["vision"]

View File

@ -75,6 +75,8 @@ _flash_attn_available = _is_package_available("flash_attn")
# `importlib.metadata.version` doesn't work with `bs4` but `beautifulsoup4`. For `importlib.util.find_spec`, reversed.
_bs4_available = importlib.util.find_spec("bs4") is not None
_coloredlogs_available = _is_package_available("coloredlogs")
# `importlib.metadata.util` doesn't work with `opencv-python-headless`.
_cv2_available = importlib.util.find_spec("cv2") is not None
_datasets_available = _is_package_available("datasets")
_decord_available = importlib.util.find_spec("decord") is not None
_detectron2_available = _is_package_available("detectron2")
@ -95,8 +97,10 @@ _jieba_available = _is_package_available("jieba")
_jinja_available = _is_package_available("jinja2")
_kenlm_available = _is_package_available("kenlm")
_keras_nlp_available = _is_package_available("keras_nlp")
_levenshtein_available = _is_package_available("Levenshtein")
_librosa_available = _is_package_available("librosa")
_natten_available = _is_package_available("natten")
_nltk_available = _is_package_available("nltk")
_onnx_available = _is_package_available("onnx")
_openai_available = _is_package_available("openai")
_optimum_available = _is_package_available("optimum")
@ -240,6 +244,10 @@ def is_kenlm_available():
return _kenlm_available
def is_cv2_available():
return _cv2_available
def is_torch_available():
return _torch_available
@ -629,6 +637,10 @@ def is_auto_gptq_available():
return _auto_gptq_available
def is_levenshtein_available():
return _levenshtein_available
def is_optimum_neuron_available():
return _optimum_available and _is_package_available("optimum.neuron")
@ -759,6 +771,10 @@ def is_natten_available():
return _natten_available
def is_nltk_available():
return _nltk_available
def is_torchaudio_available():
return _torchaudio_available
@ -813,6 +829,16 @@ def is_jinja_available():
return _jinja_available
# docstyle-ignore
CV2_IMPORT_ERROR = """
{0} requires the OpenCV library but it was not found in your environment. You can install it with:
```
pip install opencv-python
```
Please note that you may need to restart your runtime after installation.
"""
# docstyle-ignore
DATASETS_IMPORT_ERROR = """
{0} requires the 🤗 Datasets library but it was not found in your environment. You can install it with:
@ -959,6 +985,11 @@ installation section: https://github.com/rspeer/python-ftfy/tree/master#installi
that match your environment. Please note that you may need to restart your runtime after installation.
"""
LEVENSHTEIN_IMPORT_ERROR = """
{0} requires the python-Levenshtein library but it was not found in your environment. You can install it with pip: `pip
install python-Levenshtein`. Please note that you may need to restart your runtime after installation.
"""
# docstyle-ignore
PYTORCH_QUANTIZATION_IMPORT_ERROR = """
{0} requires the pytorch-quantization library but it was not found in your environment. You can install it with pip:
@ -1028,6 +1059,14 @@ shi-labs.com/natten . You can also install it with pip (may take longer to build
`pip install natten`. Please note that you may need to restart your runtime after installation.
"""
# docstyle-ignore
NLTK_IMPORT_ERROR = """
{0} requires the NLTK library but it was not found in your environment. You can install it by referring to:
https://www.nltk.org/install.html. Please note that you may need to restart your runtime after installation.
"""
# docstyle-ignore
VISION_IMPORT_ERROR = """
{0} requires the PIL library but it was not found in your environment. You can install it with pip:
@ -1109,6 +1148,7 @@ jinja2`. Please note that you may need to restart your runtime after installatio
BACKENDS_MAPPING = OrderedDict(
[
("bs4", (is_bs4_available, BS4_IMPORT_ERROR)),
("cv2", (is_cv2_available, CV2_IMPORT_ERROR)),
("datasets", (is_datasets_available, DATASETS_IMPORT_ERROR)),
("detectron2", (is_detectron2_available, DETECTRON2_IMPORT_ERROR)),
("essentia", (is_essentia_available, ESSENTIA_IMPORT_ERROR)),
@ -1118,6 +1158,7 @@ BACKENDS_MAPPING = OrderedDict(
("pandas", (is_pandas_available, PANDAS_IMPORT_ERROR)),
("phonemizer", (is_phonemizer_available, PHONEMIZER_IMPORT_ERROR)),
("pretty_midi", (is_pretty_midi_available, PRETTY_MIDI_IMPORT_ERROR)),
("levenshtein", (is_levenshtein_available, LEVENSHTEIN_IMPORT_ERROR)),
("librosa", (is_librosa_available, LIBROSA_IMPORT_ERROR)),
("protobuf", (is_protobuf_available, PROTOBUF_IMPORT_ERROR)),
("pyctcdecode", (is_pyctcdecode_available, PYCTCDECODE_IMPORT_ERROR)),
@ -1132,6 +1173,7 @@ BACKENDS_MAPPING = OrderedDict(
("tensorflow_text", (is_tensorflow_text_available, TENSORFLOW_TEXT_IMPORT_ERROR)),
("timm", (is_timm_available, TIMM_IMPORT_ERROR)),
("natten", (is_natten_available, NATTEN_IMPORT_ERROR)),
("nltk", (is_nltk_available, NLTK_IMPORT_ERROR)),
("tokenizers", (is_tokenizers_available, TOKENIZERS_IMPORT_ERROR)),
("torch", (is_torch_available, PYTORCH_IMPORT_ERROR)),
("torchvision", (is_torchvision_available, TORCHVISION_IMPORT_ERROR)),

View File

View File

@ -0,0 +1,194 @@
# coding=utf-8
# Copyright 2023 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
from huggingface_hub import hf_hub_download
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import cached_property, is_torch_available, is_vision_available
from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
if is_torch_available():
import torch
if is_vision_available():
from PIL import Image
from transformers import NougatImageProcessor
class NougatImageProcessingTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
num_channels=3,
image_size=18,
min_resolution=30,
max_resolution=400,
do_crop_margin=True,
do_resize=True,
size=None,
do_thumbnail=True,
do_align_long_axis: bool = False,
do_pad=True,
do_normalize: bool = True,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
):
size = size if size is not None else {"height": 20, "width": 20}
self.parent = parent
self.batch_size = batch_size
self.num_channels = num_channels
self.image_size = image_size
self.min_resolution = min_resolution
self.max_resolution = max_resolution
self.do_crop_margin = do_crop_margin
self.do_resize = do_resize
self.size = size
self.do_thumbnail = do_thumbnail
self.do_align_long_axis = do_align_long_axis
self.do_pad = do_pad
self.do_normalize = do_normalize
self.image_mean = image_mean
self.image_std = image_std
def prepare_image_processor_dict(self):
return {
"do_crop_margin": self.do_crop_margin,
"do_resize": self.do_resize,
"size": self.size,
"do_thumbnail": self.do_thumbnail,
"do_align_long_axis": self.do_align_long_axis,
"do_pad": self.do_pad,
"do_normalize": self.do_normalize,
"image_mean": self.image_mean,
"image_std": self.image_std,
}
def expected_output_image_shape(self, images):
return self.num_channels, self.size["height"], self.size["width"]
def prepare_dummy_image(self):
filepath = hf_hub_download(
repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_pdf.png", repo_type="dataset"
)
image = Image.open(filepath).convert("RGB")
return image
def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
return prepare_image_inputs(
batch_size=self.batch_size,
num_channels=self.num_channels,
min_resolution=self.min_resolution,
max_resolution=self.max_resolution,
equal_resolution=equal_resolution,
numpify=numpify,
torchify=torchify,
)
@require_torch
@require_vision
class NougatImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
image_processing_class = NougatImageProcessor if is_vision_available() else None
def setUp(self):
self.image_processor_tester = NougatImageProcessingTester(self)
@property
def image_processor_dict(self):
return self.image_processor_tester.prepare_image_processor_dict()
@cached_property
def image_processor(self):
return self.image_processing_class(**self.image_processor_dict)
def test_image_processor_properties(self):
image_processing = self.image_processing_class(**self.image_processor_dict)
self.assertTrue(hasattr(image_processing, "do_resize"))
self.assertTrue(hasattr(image_processing, "size"))
self.assertTrue(hasattr(image_processing, "do_normalize"))
self.assertTrue(hasattr(image_processing, "image_mean"))
self.assertTrue(hasattr(image_processing, "image_std"))
def test_image_processor_from_dict_with_kwargs(self):
image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
self.assertEqual(image_processor.size, {"height": 20, "width": 20})
image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42)
self.assertEqual(image_processor.size, {"height": 42, "width": 42})
def test_expected_output(self):
dummy_image = self.image_processor_tester.prepare_dummy_image()
image_processor = self.image_processor
inputs = image_processor(dummy_image, return_tensors="pt")
self.assertTrue(torch.allclose(inputs["pixel_values"].mean(), torch.tensor(0.4906), atol=1e-3, rtol=1e-3))
def test_crop_margin_all_white(self):
image = np.uint8(np.ones((100, 100, 3)) * 255)
image_processor = self.image_processor
cropped_image = image_processor.crop_margin(image)
self.assertTrue(np.array_equal(image, cropped_image))
def test_crop_margin_centered_black_square(self):
image = np.ones((100, 100, 3), dtype=np.uint8) * 255
image[45:55, 45:55, :] = 0
image_processor = self.image_processor
cropped_image = image_processor.crop_margin(image)
expected_cropped = image[45:55, 45:55, :]
self.assertTrue(np.array_equal(expected_cropped, cropped_image))
def test_align_long_axis_no_rotation(self):
image = np.uint8(np.ones((100, 200, 3)) * 255)
image_processor = self.image_processor
size = {"height": 200, "width": 300}
aligned_image = image_processor.align_long_axis(image, size)
self.assertEqual(image.shape, aligned_image.shape)
def test_align_long_axis_with_rotation(self):
image = np.uint8(np.ones((200, 100, 3)) * 255)
image_processor = self.image_processor
size = {"height": 300, "width": 200}
aligned_image = image_processor.align_long_axis(image, size)
self.assertEqual((200, 100, 3), aligned_image.shape)
def test_align_long_axis_data_format(self):
image = np.uint8(np.ones((100, 200, 3)) * 255)
data_format = "channels_first"
size = {"height": 200, "width": 300}
image_processor = self.image_processor
aligned_image = image_processor.align_long_axis(image, size, data_format=data_format)
self.assertEqual((3, 100, 200), aligned_image.shape)
def prepare_dummy_np_image(self):
filepath = hf_hub_download(
repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_pdf.png", repo_type="dataset"
)
image = Image.open(filepath).convert("RGB")
return np.array(image)
def test_crop_margin_equality_cv2_python(self):
image = self.prepare_dummy_np_image()
image_processor = self.image_processor
image_cropped_python = image_processor.crop_margin(image)
self.assertEqual(image_cropped_python.shape, (850, 685, 3))
self.assertEqual(image_cropped_python.mean(), 237.43881150708458)

View File

@ -0,0 +1,196 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import NougatTokenizerFast
from transformers.models.nougat.tokenization_nougat_fast import markdown_compatible, normalize_list_like_lines
from transformers.testing_utils import require_levenshtein, require_nltk, require_tokenizers
from ...test_tokenization_common import TokenizerTesterMixin
@require_tokenizers
class NougatTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
slow_tokenizer_class = None
rust_tokenizer_class = NougatTokenizerFast
tokenizer_class = NougatTokenizerFast
test_rust_tokenizer = True
test_slow_tokenizer = False
from_pretrained_vocab_key = "tokenizer_file"
special_tokens_map = {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
def setUp(self):
super().setUp()
tokenizer = NougatTokenizerFast.from_pretrained("facebook/nougat-base")
tokenizer.save_pretrained(self.tmpdirname)
def get_rust_tokenizer(self, **kwargs):
kwargs.update(self.special_tokens_map)
return NougatTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
def test_padding(self, max_length=6):
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
# Simple input
sentence1 = "This is a simple input"
sentence2 = ["This is a simple input 1", "This is a simple input 2"]
pair1 = ("This is a simple input", "This is a pair")
pair2 = [
("This is a simple input 1", "This is a simple input 2"),
("This is a simple pair 1", "This is a simple pair 2"),
]
# Simple input tests
try:
tokenizer_r.encode(sentence1, max_length=max_length)
tokenizer_r.encode_plus(sentence1, max_length=max_length)
tokenizer_r.batch_encode_plus(sentence2, max_length=max_length)
tokenizer_r.encode(pair1, max_length=max_length)
tokenizer_r.batch_encode_plus(pair2, max_length=max_length)
except ValueError:
self.fail("Nougat Tokenizer should be able to deal with padding")
tokenizer_r.pad_token = None # Hotfixing padding = None
self.assertRaises(
ValueError, tokenizer_r.encode, sentence1, max_length=max_length, padding="max_length"
)
# Simple input
self.assertRaises(
ValueError, tokenizer_r.encode_plus, sentence1, max_length=max_length, padding="max_length"
)
# Simple input
self.assertRaises(
ValueError,
tokenizer_r.batch_encode_plus,
sentence2,
max_length=max_length,
padding="max_length",
)
# Pair input
self.assertRaises(ValueError, tokenizer_r.encode, pair1, max_length=max_length, padding="max_length")
# Pair input
self.assertRaises(
ValueError, tokenizer_r.encode_plus, pair1, max_length=max_length, padding="max_length"
)
# Pair input
self.assertRaises(
ValueError,
tokenizer_r.batch_encode_plus,
pair2,
max_length=max_length,
padding="max_length",
)
@unittest.skip("NougatTokenizerFast does not have tokenizer_file in its signature")
def test_rust_tokenizer_signature(self):
pass
@unittest.skip("NougatTokenizerFast does not support pretokenized inputs")
def test_pretokenized_inputs(self):
pass
@unittest.skip("NougatTokenizerFast directly inherits from PreTrainedTokenizerFast")
def test_prepare_for_model(self):
pass
@unittest.skip("This needs a slow tokenizer. Nougat does not have one!")
def test_encode_decode_with_spaces(self):
pass
class MarkdownCompatibleTest(unittest.TestCase):
def test_equation_tag(self):
input_text = "(3.2) \\[Equation Text\\]"
excepted_output = "\\[Equation Text \\tag{3.2}\\]"
self.assertEqual(markdown_compatible(input_text), excepted_output)
def test_equation_tag_letters(self):
input_text = "(18a) \\[Equation Text\\]"
excepted_output = "\\[Equation Text \\tag{18a}\\]"
self.assertEqual(markdown_compatible(input_text), excepted_output)
def test_bold_formatting(self):
input_text = r"This is \bm{bold} text."
expected_output = r"This is \mathbf{bold} text."
self.assertEqual(markdown_compatible(input_text), expected_output)
def test_url_conversion(self):
input_text = "Visit my website at https://www.example.com"
expected_output = "Visit my website at [https://www.example.com](https://www.example.com)"
self.assertEqual(markdown_compatible(input_text), expected_output)
def test_algorithm_code_block(self):
input_text = "```python\nprint('Hello, world!')\n```"
expected_output = "```\npython\nprint('Hello, world!')\n```"
self.assertEqual(markdown_compatible(input_text), expected_output)
def test_escape_characters(self):
input_text = r"Escaped characters like \n should not be \\[affected\\]"
expected_output = r"Escaped characters like \n should not be \\[affected\\]"
self.assertEqual(markdown_compatible(input_text), expected_output)
def test_nested_tags(self):
input_text = r"This is a super nested \bm{\bm{\bm{\bm{\bm{bold}}}}} tag."
expected_output = r"This is a super nested \mathbf{\mathbf{\mathbf{\mathbf{\mathbf{bold}}}}} tag."
self.assertEqual(markdown_compatible(input_text), expected_output)
class TestNormalizeListLikeLines(unittest.TestCase):
def test_two_level_lines(self):
input_str = "* Item 1 * Item 2"
expected_output = "* Item 1\n* Item 2\n"
self.assertEqual(normalize_list_like_lines(input_str), expected_output)
def test_three_level_lines(self):
input_str = "- I. Item 1 - II. Item 2 - III. Item 3"
expected_output = "- I. Item 1\n- II. Item 2\n- III. Item 3\n"
self.assertEqual(normalize_list_like_lines(input_str), expected_output)
def test_nested_lines(self):
input_str = "- I. Item 1 - I.1 Sub-item 1 - I.1.1 Sub-sub-item 1 - II. Item 2"
expected_output = "- I. Item 1\n\t- I.1 Sub-item 1\n\t\t- I.1.1 Sub-sub-item 1\n- II. Item 2\n"
self.assertEqual(normalize_list_like_lines(input_str), expected_output)
@require_tokenizers
class NougatPostProcessingTest(unittest.TestCase):
def setUp(self):
super().setUp()
self.tokenizer = NougatTokenizerFast.from_pretrained("facebook/nougat-base")
def test_correct_tables_basic(self):
input_str = "\\begin{table} \\begin{tabular}{l l} & \\ \\end{tabular} \\end{table}"
expected_output = "\\begin{table}\n\\begin{tabular}{l l} & \\ \\end{tabular}\n\\end{table}"
self.assertEqual(self.tokenizer.correct_tables(input_str), expected_output)
def test_correct_tables_high_count(self):
input_str = "\\begin{tabular}" * 20
expected_output = ""
self.assertEqual(self.tokenizer.correct_tables(input_str), expected_output)
@require_levenshtein
@require_nltk
def test_postprocess_as_nougat_no_markdown(self):
input_str = "# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books" # noqa: E231
expected_output = "\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books" # noqa: E231
self.assertEqual(self.tokenizer.post_process_single(input_str, fix_markdown=False), expected_output)

View File

@ -18,10 +18,13 @@ import tempfile
import unittest
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from packaging import version
from transformers import DonutProcessor, TrOCRProcessor
from transformers import DonutProcessor, NougatProcessor, TrOCRProcessor
from transformers.testing_utils import (
require_levenshtein,
require_nltk,
require_sentencepiece,
require_torch,
require_vision,
@ -998,3 +1001,79 @@ class DonutModelIntegrationTest(unittest.TestCase):
outputs.scores[0][0, :3], torch.tensor([-17.6490, -4.8381, -15.7577], device=torch_device), atol=1e-4
)
)
@require_levenshtein
@require_nltk
@require_torch
@require_vision
@slow
class NougatModelIntegrationTest(unittest.TestCase):
@cached_property
def default_processor(self):
return NougatProcessor.from_pretrained("facebook/nougat-base") if is_vision_available() else None
@cached_property
def default_model(self):
return VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base").to(torch_device)
@cached_property
def default_image(self):
filepath = hf_hub_download(
repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_pdf.png", repo_type="dataset"
)
image = Image.open(filepath).convert("RGB")
return image
def test_forward_pass(self):
processor = self.default_processor
model = self.default_model
image = self.default_image
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(torch_device)
decoder_input_ids = torch.tensor([[0]]).to(torch_device)
outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)
logits = outputs.logits
# verify the logits
expected_shape = torch.Size((1, 1, model.decoder.config.vocab_size))
self.assertEqual(outputs.logits.shape, expected_shape)
expected_slice = torch.tensor(
[1.6253, -4.2179, 5.8532, -2.7911, -5.0609, -4.7397, -4.2890, -5.1073, -4.8908, -4.9729]
).to(torch_device)
self.assertTrue(torch.allclose(logits[0, 0, :10], expected_slice, atol=1e-4))
def test_generation(self):
processor = self.default_processor
model = self.default_model
image = self.default_image
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(torch_device)
outputs = model.generate(
pixel_values,
min_length=1,
max_length=3584,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
output_scores=True,
)
# verify generated sequence
generated = processor.batch_decode(outputs.sequences, skip_special_tokens=True)[0]
expected_raw_generation = "# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books"
self.assertTrue(generated == expected_raw_generation)
# verify postprocessed sequence
generated = processor.post_process_generation(generated, fix_markdown=False)
expected_generation = "\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucurull\n\nThomas Scialom\n\nRobert Stojnic\n\nMeta AI\n\nThe paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\n###### Abstract\n\nScientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (**N**eural **O**ptical **U**nderstanding for **A**cademic Documents), a Visual Transformer model that performs an _Optical Character Recognition_ (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.\n\n## 1 Introduction\n\nThe majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.\n\nExisting Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.\n\nConverting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M2 papers using GROBID [4], but are missing meaningful representations of the mathematical equations.\n\nFootnote 2: The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc\n\nTo this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text.\n\nThe primary contributions in this paper are\n\n* Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub3 Footnote 3: https://github.com/facebookresearch/nougat\n* We introduce a pipeline to create dataset for pairing PDFs to source code\n* Our method is only dependent on the image of a page, allowing access to scanned papers and books"
self.assertTrue(generated == expected_generation)
# verify scores
self.assertEqual(len(outputs.scores), 741)
self.assertTrue(
torch.allclose(
outputs.scores[0][0, :3], torch.tensor([1.6253, -4.2179, 5.8532], device=torch_device), atol=1e-4
)
)

View File

@ -692,6 +692,7 @@ src/transformers/models/nezha/modeling_nezha.py
src/transformers/models/nllb_moe/configuration_nllb_moe.py
src/transformers/models/nllb_moe/convert_nllb_moe_sharded_original_checkpoint_to_pytorch.py
src/transformers/models/nllb_moe/modeling_nllb_moe.py
src/transformers/models/nougat/convert_nougat_to_hf.py
src/transformers/models/nystromformer/configuration_nystromformer.py
src/transformers/models/nystromformer/convert_nystromformer_original_pytorch_checkpoint_to_pytorch.py
src/transformers/models/nystromformer/modeling_nystromformer.py