mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 13:20:12 +06:00
add new model of MGP-STR (#21418)
* add new model of MGP-STR * fix the check failings * remove torch and numpy from mgp_tokenization * remove unused import from modeling_mgp_str * add test_processing_mgp_str * rm test_processing_mgp_str.py * add test_processing_mgp_str * add test_processing_mgp_str * add test_processing_mgp_str * rm test_processing_mgp_str and add softmax outs to model * rm test_processing_mgp_str and add softmax outs to model * rewrite the code of mgp-str according to PR suggestions * rewrite the code of mgp-str according to PR suggestions * add new model of MGP-STR * fix the check failings * remove torch and numpy from mgp_tokenization * remove unused import from modeling_mgp_str * add test_processing_mgp_str * rm test_processing_mgp_str.py * add test_processing_mgp_str * add test_processing_mgp_str * add test_processing_mgp_str * rm test_processing_mgp_str and add softmax outs to model * rewrite the code of mgp-str according to PR suggestions * rewrite the code of mgp-str according to PR suggestions * remove representation_size from MGPSTRConfig * reformat configuration_mgp_str.py * format test_processor_mgp_str.py * add test for tokenizer and complete model/processer test and model file * rm Unnecessary tupple in modeling_mgp_str * reduce hidden_size/layers/label_size in test_model * add integration tests and change MGPSTR to Mgpstr * add test for logit values * reformat test model file --------- Co-authored-by: yue kun <yuekun.wp@alibaba-inc.com>
This commit is contained in:
parent
32e3466d38
commit
102b5ff4a8
@ -377,6 +377,7 @@ Current number of checkpoints: ** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
||||||
@ -498,3 +499,4 @@ We now have a [paper](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) you
|
|||||||
pages = "38--45"
|
pages = "38--45"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -365,6 +365,7 @@ Número actual de puntos de control: ** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
||||||
|
@ -337,6 +337,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [एक्स्टेंसिबल बहुभाषी प्रीट्रेनिंग और फाइनट्यूनिंग के साथ बहुभाषी अनुवाद](https://arxiv युकिंग टैंग, चाउ ट्रान, जियान ली, पेंग-जेन चेन, नमन गोयल, विश्रव चौधरी, जियाताओ गु, एंजेला फैन द्वारा .org/abs/2008.00401)।
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [एक्स्टेंसिबल बहुभाषी प्रीट्रेनिंग और फाइनट्यूनिंग के साथ बहुभाषी अनुवाद](https://arxiv युकिंग टैंग, चाउ ट्रान, जियान ली, पेंग-जेन चेन, नमन गोयल, विश्रव चौधरी, जियाताओ गु, एंजेला फैन द्वारा .org/abs/2008.00401)।
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA से) कागज के साथ [Megatron-LM: मॉडल का उपयोग करके बहु-अरब पैरामीटर भाषा मॉडल का प्रशिक्षण Parallelism](https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा।
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA से) कागज के साथ [Megatron-LM: मॉडल का उपयोग करके बहु-अरब पैरामीटर भाषा मॉडल का प्रशिक्षण Parallelism](https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा।
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA से) साथ वाला पेपर [Megatron-LM: ट्रेनिंग मल्टी-बिलियन पैरामीटर लैंग्वेज मॉडल्स यूजिंग मॉडल पैरेललिज़्म] (https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा पोस्ट किया गया।
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA से) साथ वाला पेपर [Megatron-LM: ट्रेनिंग मल्टी-बिलियन पैरामीटर लैंग्वेज मॉडल्स यूजिंग मॉडल पैरेललिज़्म] (https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा पोस्ट किया गया।
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research से) Peng Wang, Cheng Da, and Cong Yao. द्वाराअनुसंधान पत्र [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) के साथ जारी किया गया
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (फ्रॉम Studio Ousia) साथ में पेपर [mLUKE: द पावर ऑफ एंटिटी रिप्रेजेंटेशन इन मल्टीलिंगुअल प्रीट्रेन्ड लैंग्वेज मॉडल्स](https://arxiv.org/abs/2110.08151) रयोकन री, इकुया यामाडा, और योशिमासा त्सुरोका द्वारा।
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (फ्रॉम Studio Ousia) साथ में पेपर [mLUKE: द पावर ऑफ एंटिटी रिप्रेजेंटेशन इन मल्टीलिंगुअल प्रीट्रेन्ड लैंग्वेज मॉडल्स](https://arxiv.org/abs/2110.08151) रयोकन री, इकुया यामाडा, और योशिमासा त्सुरोका द्वारा।
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [मोबाइलबर्ट: संसाधन-सीमित उपकरणों के लिए एक कॉम्पैक्ट टास्क-अज्ञेय बीईआरटी] (https://arxiv.org/abs/2004.02984) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, और Denny Zhou द्वारा पोस्ट किया गया।
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [मोबाइलबर्ट: संसाधन-सीमित उपकरणों के लिए एक कॉम्पैक्ट टास्क-अज्ञेय बीईआरटी] (https://arxiv.org/abs/2004.02984) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, और Denny Zhou द्वारा पोस्ट किया गया।
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
||||||
|
@ -399,6 +399,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
|||||||
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook から) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan から公開された研究論文: [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401)
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook から) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan から公開された研究論文: [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401)
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research から) Peng Wang, Cheng Da, and Cong Yao. から公開された研究論文 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592)
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia から) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka から公開された研究論文: [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151)
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia から) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka から公開された研究論文: [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151)
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain から) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou から公開された研究論文: [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984)
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain から) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou から公開された研究論文: [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984)
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. から) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam から公開された研究論文: [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. から) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam から公開された研究論文: [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)
|
||||||
|
@ -314,6 +314,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||||||
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook 에서) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 의 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 논문과 함께 발표했습니다.
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook 에서) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 의 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 논문과 함께 발표했습니다.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다.
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다.
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다.
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research 에서 제공)은 Peng Wang, Cheng Da, and Cong Yao.의 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592)논문과 함께 발표했습니다.
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia 에서) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 의 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 논문과 함께 발표했습니다.
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia 에서) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 의 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 논문과 함께 발표했습니다.
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain 에서) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 의 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 논문과 함께 발표했습니다.
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain 에서) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 의 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 논문과 함께 발표했습니다.
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. 에서) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 의 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 논문과 함께 발표했습니다.
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. 에서) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 의 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 논문과 함께 발표했습니다.
|
||||||
|
@ -338,6 +338,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (来自 Alibaba Research) 伴随论文 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) 由 Peng Wang, Cheng Da, and Cong Yao 发布。
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (来自 Studio Ousia) 伴随论文 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 由 Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 发布。
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (来自 Studio Ousia) 伴随论文 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 由 Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 发布。
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (来自 CMU/Google Brain) 伴随论文 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 由 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 发布。
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (来自 CMU/Google Brain) 伴随论文 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 由 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 发布。
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (来自 Google Inc.) 伴随论文 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 由 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 发布。
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (来自 Google Inc.) 伴随论文 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 由 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 发布。
|
||||||
|
@ -350,6 +350,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
|
||||||
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
||||||
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
||||||
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
||||||
|
@ -580,6 +580,8 @@
|
|||||||
title: LiLT
|
title: LiLT
|
||||||
- local: model_doc/lxmert
|
- local: model_doc/lxmert
|
||||||
title: LXMERT
|
title: LXMERT
|
||||||
|
- local: model_doc/mgp-str
|
||||||
|
title: MGP-STR
|
||||||
- local: model_doc/oneformer
|
- local: model_doc/oneformer
|
||||||
title: OneFormer
|
title: OneFormer
|
||||||
- local: model_doc/owlvit
|
- local: model_doc/owlvit
|
||||||
|
@ -151,6 +151,7 @@ The documentation is organized into five sections:
|
|||||||
1. **[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
1. **[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
1. **[MGP-STR](model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
|
||||||
1. **[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
1. **[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
||||||
1. **[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
1. **[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
|
||||||
1. **[MobileNetV1](model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
1. **[MobileNetV1](model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
|
||||||
@ -340,6 +341,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| MaskFormerSwin | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| MaskFormerSwin | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| Megatron-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| Megatron-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
| MGP-STR | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| MobileNetV1 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| MobileNetV1 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| MobileNetV2 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| MobileNetV2 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
86
docs/source/en/model_doc/mgp-str.mdx
Normal file
86
docs/source/en/model_doc/mgp-str.mdx
Normal file
@ -0,0 +1,86 @@
|
|||||||
|
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# MGP-STR
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The MGP-STR model was proposed in [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. MGP-STR is a conceptually **simple** yet **powerful** vision Scene Text Recognition (STR) model, which is built upon the [Vision Transformer (ViT)](vit). To integrate linguistic knowledge, Multi-Granularity Prediction (MGP) strategy is proposed to inject information from the language modality into the model in an implicit way.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks.*
|
||||||
|
|
||||||
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/mgp_str_architecture.png"
|
||||||
|
alt="drawing" width="600"/>
|
||||||
|
|
||||||
|
<small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small>
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
|
||||||
|
- This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).
|
||||||
|
|
||||||
|
## Inference
|
||||||
|
|
||||||
|
[`MgpstrModel`] accepts images as input and generates three types of predictions, which represent textual information at different granularities.
|
||||||
|
The three types of predictions are fused to give the final prediction result.
|
||||||
|
|
||||||
|
The [`ViTImageProcessor`] class is responsible for preprocessing the input image and
|
||||||
|
[`MgpstrTokenizer`] decodes the generated character tokens to the target string. The
|
||||||
|
[`MgpstrProcessor`] wraps [`ViTImageProcessor`] and [`MgpstrTokenizer`]
|
||||||
|
into a single instance to both extract the input features and decode the predicted token ids.
|
||||||
|
|
||||||
|
- Step-by-step Optical Character Recognition (OCR)
|
||||||
|
|
||||||
|
``` py
|
||||||
|
>>> from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
|
||||||
|
>>> import requests
|
||||||
|
>>> from PIL import Image
|
||||||
|
|
||||||
|
>>> processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
|
||||||
|
>>> model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')
|
||||||
|
|
||||||
|
>>> # load image from the IIIT-5k dataset
|
||||||
|
>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
|
||||||
|
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||||
|
|
||||||
|
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
|
||||||
|
>>> outputs = model(pixel_values)
|
||||||
|
|
||||||
|
>>> generated_text = processor.batch_decode(outputs.logits)['generated_text']
|
||||||
|
```
|
||||||
|
|
||||||
|
## MgpstrConfig
|
||||||
|
|
||||||
|
[[autodoc]] MgpstrConfig
|
||||||
|
|
||||||
|
## MgpstrTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] MgpstrTokenizer
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## MgpstrProcessor
|
||||||
|
|
||||||
|
[[autodoc]] MgpstrProcessor
|
||||||
|
- __call__
|
||||||
|
- batch_decode
|
||||||
|
|
||||||
|
## MgpstrModel
|
||||||
|
|
||||||
|
[[autodoc]] MgpstrModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## MgpstrForSceneTextRecognition
|
||||||
|
|
||||||
|
[[autodoc]] MgpstrForSceneTextRecognition
|
||||||
|
- forward
|
@ -370,6 +370,7 @@ _import_structure = {
|
|||||||
"models.mctct": ["MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MCTCTConfig", "MCTCTProcessor"],
|
"models.mctct": ["MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MCTCTConfig", "MCTCTProcessor"],
|
||||||
"models.megatron_bert": ["MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegatronBertConfig"],
|
"models.megatron_bert": ["MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegatronBertConfig"],
|
||||||
"models.megatron_gpt2": [],
|
"models.megatron_gpt2": [],
|
||||||
|
"models.mgp_str": ["MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP", "MgpstrConfig", "MgpstrProcessor", "MgpstrTokenizer"],
|
||||||
"models.mluke": [],
|
"models.mluke": [],
|
||||||
"models.mmbt": ["MMBTConfig"],
|
"models.mmbt": ["MMBTConfig"],
|
||||||
"models.mobilebert": ["MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileBertConfig", "MobileBertTokenizer"],
|
"models.mobilebert": ["MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileBertConfig", "MobileBertTokenizer"],
|
||||||
@ -1904,6 +1905,14 @@ else:
|
|||||||
"MegatronBertPreTrainedModel",
|
"MegatronBertPreTrainedModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.mgp_str"].extend(
|
||||||
|
[
|
||||||
|
"MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"MgpstrForSceneTextRecognition",
|
||||||
|
"MgpstrModel",
|
||||||
|
"MgpstrPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.mmbt"].extend(["MMBTForClassification", "MMBTModel", "ModalEmbeddings"])
|
_import_structure["models.mmbt"].extend(["MMBTForClassification", "MMBTModel", "ModalEmbeddings"])
|
||||||
_import_structure["models.mobilebert"].extend(
|
_import_structure["models.mobilebert"].extend(
|
||||||
[
|
[
|
||||||
@ -3962,6 +3971,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.mbart import MBartConfig
|
from .models.mbart import MBartConfig
|
||||||
from .models.mctct import MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, MCTCTConfig, MCTCTProcessor
|
from .models.mctct import MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, MCTCTConfig, MCTCTProcessor
|
||||||
from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig
|
from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig
|
||||||
|
from .models.mgp_str import MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP, MgpstrConfig, MgpstrProcessor, MgpstrTokenizer
|
||||||
from .models.mmbt import MMBTConfig
|
from .models.mmbt import MMBTConfig
|
||||||
from .models.mobilebert import MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileBertConfig, MobileBertTokenizer
|
from .models.mobilebert import MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileBertConfig, MobileBertTokenizer
|
||||||
from .models.mobilenet_v1 import MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileNetV1Config
|
from .models.mobilenet_v1 import MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileNetV1Config
|
||||||
@ -5249,6 +5259,12 @@ if TYPE_CHECKING:
|
|||||||
MegatronBertModel,
|
MegatronBertModel,
|
||||||
MegatronBertPreTrainedModel,
|
MegatronBertPreTrainedModel,
|
||||||
)
|
)
|
||||||
|
from .models.mgp_str import (
|
||||||
|
MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
MgpstrForSceneTextRecognition,
|
||||||
|
MgpstrModel,
|
||||||
|
MgpstrPreTrainedModel,
|
||||||
|
)
|
||||||
from .models.mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings
|
from .models.mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings
|
||||||
from .models.mobilebert import (
|
from .models.mobilebert import (
|
||||||
MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
@ -116,6 +116,7 @@ from . import (
|
|||||||
mctct,
|
mctct,
|
||||||
megatron_bert,
|
megatron_bert,
|
||||||
megatron_gpt2,
|
megatron_gpt2,
|
||||||
|
mgp_str,
|
||||||
mluke,
|
mluke,
|
||||||
mmbt,
|
mmbt,
|
||||||
mobilebert,
|
mobilebert,
|
||||||
|
@ -121,6 +121,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
("mbart", "MBartConfig"),
|
("mbart", "MBartConfig"),
|
||||||
("mctct", "MCTCTConfig"),
|
("mctct", "MCTCTConfig"),
|
||||||
("megatron-bert", "MegatronBertConfig"),
|
("megatron-bert", "MegatronBertConfig"),
|
||||||
|
("mgp-str", "MgpstrConfig"),
|
||||||
("mobilebert", "MobileBertConfig"),
|
("mobilebert", "MobileBertConfig"),
|
||||||
("mobilenet_v1", "MobileNetV1Config"),
|
("mobilenet_v1", "MobileNetV1Config"),
|
||||||
("mobilenet_v2", "MobileNetV2Config"),
|
("mobilenet_v2", "MobileNetV2Config"),
|
||||||
@ -294,6 +295,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
("mbart", "MBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("mbart", "MBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("mctct", "MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("mctct", "MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("megatron-bert", "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("megatron-bert", "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("mgp-str", "MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("mobilenet_v1", "MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("mobilenet_v1", "MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("mobilenet_v2", "MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("mobilenet_v2", "MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("mobilevit", "MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("mobilevit", "MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
@ -476,6 +478,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("mctct", "M-CTC-T"),
|
("mctct", "M-CTC-T"),
|
||||||
("megatron-bert", "Megatron-BERT"),
|
("megatron-bert", "Megatron-BERT"),
|
||||||
("megatron_gpt2", "Megatron-GPT2"),
|
("megatron_gpt2", "Megatron-GPT2"),
|
||||||
|
("mgp-str", "MGP-STR"),
|
||||||
("mluke", "mLUKE"),
|
("mluke", "mLUKE"),
|
||||||
("mobilebert", "MobileBERT"),
|
("mobilebert", "MobileBERT"),
|
||||||
("mobilenet_v1", "MobileNetV1"),
|
("mobilenet_v1", "MobileNetV1"),
|
||||||
|
@ -69,6 +69,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
|
|||||||
("levit", "LevitImageProcessor"),
|
("levit", "LevitImageProcessor"),
|
||||||
("mask2former", "Mask2FormerImageProcessor"),
|
("mask2former", "Mask2FormerImageProcessor"),
|
||||||
("maskformer", "MaskFormerImageProcessor"),
|
("maskformer", "MaskFormerImageProcessor"),
|
||||||
|
("mgp-str", "ViTImageProcessor"),
|
||||||
("mobilenet_v1", "MobileNetV1ImageProcessor"),
|
("mobilenet_v1", "MobileNetV1ImageProcessor"),
|
||||||
("mobilenet_v2", "MobileNetV2ImageProcessor"),
|
("mobilenet_v2", "MobileNetV2ImageProcessor"),
|
||||||
("mobilenet_v2", "MobileNetV2ImageProcessor"),
|
("mobilenet_v2", "MobileNetV2ImageProcessor"),
|
||||||
|
@ -119,6 +119,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("mbart", "MBartModel"),
|
("mbart", "MBartModel"),
|
||||||
("mctct", "MCTCTModel"),
|
("mctct", "MCTCTModel"),
|
||||||
("megatron-bert", "MegatronBertModel"),
|
("megatron-bert", "MegatronBertModel"),
|
||||||
|
("mgp-str", "MgpstrForSceneTextRecognition"),
|
||||||
("mobilebert", "MobileBertModel"),
|
("mobilebert", "MobileBertModel"),
|
||||||
("mobilenet_v1", "MobileNetV1Model"),
|
("mobilenet_v1", "MobileNetV1Model"),
|
||||||
("mobilenet_v2", "MobileNetV2Model"),
|
("mobilenet_v2", "MobileNetV2Model"),
|
||||||
|
@ -57,6 +57,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
|||||||
("layoutlmv2", "LayoutLMv2Processor"),
|
("layoutlmv2", "LayoutLMv2Processor"),
|
||||||
("layoutlmv3", "LayoutLMv3Processor"),
|
("layoutlmv3", "LayoutLMv3Processor"),
|
||||||
("markuplm", "MarkupLMProcessor"),
|
("markuplm", "MarkupLMProcessor"),
|
||||||
|
("mgp-str", "MgpstrProcessor"),
|
||||||
("oneformer", "OneFormerProcessor"),
|
("oneformer", "OneFormerProcessor"),
|
||||||
("owlvit", "OwlViTProcessor"),
|
("owlvit", "OwlViTProcessor"),
|
||||||
("sew", "Wav2Vec2Processor"),
|
("sew", "Wav2Vec2Processor"),
|
||||||
|
@ -194,6 +194,7 @@ else:
|
|||||||
),
|
),
|
||||||
),
|
),
|
||||||
("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
|
("mgp-str", ("MgpstrTokenizer", None)),
|
||||||
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
|
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
|
||||||
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
|
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
|
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
|
62
src/transformers/models/mgp_str/__init__.py
Normal file
62
src/transformers/models/mgp_str/__init__.py
Normal file
@ -0,0 +1,62 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_mgp_str": ["MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP", "MgpstrConfig"],
|
||||||
|
"processing_mgp_str": ["MgpstrProcessor"],
|
||||||
|
"tokenization_mgp_str": ["MgpstrTokenizer"],
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
_import_structure["modeling_mgp_str"] = [
|
||||||
|
"MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"MgpstrModel",
|
||||||
|
"MgpstrPreTrainedModel",
|
||||||
|
"MgpstrForSceneTextRecognition",
|
||||||
|
]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_mgp_str import MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP, MgpstrConfig
|
||||||
|
from .processing_mgp_str.py import MgpstrProcessor
|
||||||
|
from .tokenization_mgp_str import MgpstrTokenizer
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
from .modeling_mgp_str import (
|
||||||
|
MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
MgpstrForSceneTextRecognition,
|
||||||
|
MgpstrModel,
|
||||||
|
MgpstrPreTrainedModel,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
137
src/transformers/models/mgp_str/configuration_mgp_str.py
Normal file
137
src/transformers/models/mgp_str/configuration_mgp_str.py
Normal file
@ -0,0 +1,137 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" MGP-STR model configuration"""
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"alibaba-damo/mgp-str-base": "https://huggingface.co/alibaba-damo/mgp-str-base/resolve/main/config.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of an [`MgpstrModel`]. It is used to instantiate an
|
||||||
|
MGP-STR model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||||
|
with the defaults will yield a similar configuration to that of the MGP-STR
|
||||||
|
[alibaba-damo/mgp-str-base](https://huggingface.co/alibaba-damo/mgp-str-base) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_size (`List[int]`, *optional*, defaults to `[32, 128]`):
|
||||||
|
The size (resolution) of each image.
|
||||||
|
patch_size (`int`, *optional*, defaults to 4):
|
||||||
|
The size (resolution) of each patch.
|
||||||
|
num_channels (`int`, *optional*, defaults to 3):
|
||||||
|
The number of input channels.
|
||||||
|
max_token_length (`int`, *optional*, defaults to 27):
|
||||||
|
The max number of output tokens.
|
||||||
|
num_character_labels (`int`, *optional*, defaults to 38):
|
||||||
|
The number of classes for character head .
|
||||||
|
num_bpe_labels (`int`, *optional*, defaults to 50257):
|
||||||
|
The number of classes for bpe head .
|
||||||
|
num_wordpiece_labels (`int`, *optional*, defaults to 30522):
|
||||||
|
The number of classes for wordpiece head .
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
The embedding dimension.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
mlp_ratio (`float`, *optional*, defaults to 4.0):
|
||||||
|
The ratio of mlp hidden dim to embedding dim.
|
||||||
|
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to add a bias to the queries, keys and values.
|
||||||
|
distilled (`bool`, *optional*, defaults to `False`):
|
||||||
|
Model includes a distillation token and head as in DeiT models.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
drop_rate (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings, encoder.
|
||||||
|
attn_drop_rate (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
drop_path_rate (`float`, *optional*, defaults to 0.0):
|
||||||
|
The stochastic depth rate.
|
||||||
|
output_a3_attentions (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the model should returns A^3 module attentions.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MgpstrConfig, MgpstrForSceneTextRecognition
|
||||||
|
|
||||||
|
>>> # Initializing a Mgpstr mgp-str-base style configuration
|
||||||
|
>>> configuration = MgpstrConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model (with random weights) from the mgp-str-base style configuration
|
||||||
|
>>> model = MgpstrForSceneTextRecognition(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
model_type = "mgp-str"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
image_size=[32, 128],
|
||||||
|
patch_size=4,
|
||||||
|
num_channels=3,
|
||||||
|
max_token_length=27,
|
||||||
|
num_character_labels=38,
|
||||||
|
num_bpe_labels=50257,
|
||||||
|
num_wordpiece_labels=30522,
|
||||||
|
hidden_size=768,
|
||||||
|
num_hidden_layers=12,
|
||||||
|
num_attention_heads=12,
|
||||||
|
mlp_ratio=4.0,
|
||||||
|
qkv_bias=True,
|
||||||
|
distilled=False,
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
drop_rate=0.0,
|
||||||
|
attn_drop_rate=0.0,
|
||||||
|
drop_path_rate=0.0,
|
||||||
|
output_a3_attentions=False,
|
||||||
|
initializer_range=0.02,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.max_token_length = max_token_length
|
||||||
|
self.num_character_labels = num_character_labels
|
||||||
|
self.num_bpe_labels = num_bpe_labels
|
||||||
|
self.num_wordpiece_labels = num_wordpiece_labels
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.mlp_ratio = mlp_ratio
|
||||||
|
self.distilled = distilled
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.drop_rate = drop_rate
|
||||||
|
self.qkv_bias = qkv_bias
|
||||||
|
self.attn_drop_rate = attn_drop_rate
|
||||||
|
self.drop_path_rate = drop_path_rate
|
||||||
|
self.output_a3_attentions = output_a3_attentions
|
||||||
|
self.initializer_range = initializer_range
|
512
src/transformers/models/mgp_str/modeling_mgp_str.py
Normal file
512
src/transformers/models/mgp_str/modeling_mgp_str.py
Normal file
@ -0,0 +1,512 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 Alibaba Research and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" PyTorch MGP-STR model."""
|
||||||
|
|
||||||
|
import collections.abc
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Optional, Tuple, Union
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torch.nn.functional as F
|
||||||
|
import torch.utils.checkpoint
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
from ...modeling_outputs import BaseModelOutput
|
||||||
|
from ...modeling_utils import PreTrainedModel
|
||||||
|
from ...utils import (
|
||||||
|
ModelOutput,
|
||||||
|
add_start_docstrings,
|
||||||
|
add_start_docstrings_to_model_forward,
|
||||||
|
logging,
|
||||||
|
replace_return_docstrings,
|
||||||
|
)
|
||||||
|
from .configuration_mgp_str import MgpstrConfig
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
# General docstring
|
||||||
|
_CONFIG_FOR_DOC = "MgpstrConfig"
|
||||||
|
_TOKENIZER_FOR_DOC = "MgpstrTokenizer"
|
||||||
|
|
||||||
|
# Base docstring
|
||||||
|
_CHECKPOINT_FOR_DOC = "alibaba-damo/mgp-str-base"
|
||||||
|
|
||||||
|
MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
||||||
|
"alibaba-damo/mgp-str-base",
|
||||||
|
# See all MGP-STR models at https://huggingface.co/models?filter=mgp-str
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.beit.modeling_beit.drop_path
|
||||||
|
def drop_path(input, drop_prob: float = 0.0, training: bool = False):
|
||||||
|
"""
|
||||||
|
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
|
||||||
|
|
||||||
|
Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
|
||||||
|
however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
|
||||||
|
See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
|
||||||
|
layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
|
||||||
|
argument.
|
||||||
|
"""
|
||||||
|
if drop_prob == 0.0 or not training:
|
||||||
|
return input
|
||||||
|
keep_prob = 1 - drop_prob
|
||||||
|
shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
|
||||||
|
random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
|
||||||
|
random_tensor.floor_() # binarize
|
||||||
|
output = input.div(keep_prob) * random_tensor
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->Mgpstr
|
||||||
|
class MgpstrDropPath(nn.Module):
|
||||||
|
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
|
||||||
|
|
||||||
|
def __init__(self, drop_prob: Optional[float] = None) -> None:
|
||||||
|
super().__init__()
|
||||||
|
self.drop_prob = drop_prob
|
||||||
|
|
||||||
|
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
|
||||||
|
return drop_path(hidden_states, self.drop_prob, self.training)
|
||||||
|
|
||||||
|
def extra_repr(self) -> str:
|
||||||
|
return "p={}".format(self.drop_prob)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class MgpstrModelOutput(ModelOutput):
|
||||||
|
"""
|
||||||
|
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
logits (`tuple(torch.FloatTensor)` of shape `(batch_size, config.num_character_labels)`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for the output of character of shape `(batch_size,
|
||||||
|
config.max_token_length, config.num_character_labels)`, + one for the output of bpe of shape `(batch_size,
|
||||||
|
config.max_token_length, config.num_bpe_labels)`, + one for the output of wordpiece of shape `(batch_size,
|
||||||
|
config.max_token_length, config.num_wordpiece_labels)`) .
|
||||||
|
|
||||||
|
Classification scores (before SoftMax) of character, bpe and wordpiece.
|
||||||
|
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
|
||||||
|
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
|
||||||
|
|
||||||
|
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
|
||||||
|
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, config.max_token_length,
|
||||||
|
sequence_length, sequence_length)`.
|
||||||
|
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
||||||
|
heads.
|
||||||
|
a3_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_a3_attentions=True` is passed or when `config.output_a3_attentions=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for the attention of character, + one for the attention of bpe`, + one
|
||||||
|
for the attention of wordpiece) of shape `(batch_size, config.max_token_length, sequence_length)`.
|
||||||
|
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
||||||
|
heads.
|
||||||
|
"""
|
||||||
|
|
||||||
|
logits: Tuple[torch.FloatTensor] = None
|
||||||
|
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
a3_attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrEmbeddings(nn.Module):
|
||||||
|
"""2D Image to Patch Embedding"""
|
||||||
|
|
||||||
|
def __init__(self, config: MgpstrConfig):
|
||||||
|
super().__init__()
|
||||||
|
image_size = (
|
||||||
|
config.image_size
|
||||||
|
if isinstance(config.image_size, collections.abc.Iterable)
|
||||||
|
else (config.image_size, config.image_size)
|
||||||
|
)
|
||||||
|
patch_size = (
|
||||||
|
config.patch_size
|
||||||
|
if isinstance(config.patch_size, collections.abc.Iterable)
|
||||||
|
else (config.patch_size, config.patch_size)
|
||||||
|
)
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.grid_size = (image_size[0] // patch_size[0], image_size[1] // patch_size[1])
|
||||||
|
self.num_patches = self.grid_size[0] * self.grid_size[1]
|
||||||
|
self.num_tokens = 2 if config.distilled else 1
|
||||||
|
|
||||||
|
self.proj = nn.Conv2d(config.num_channels, config.hidden_size, kernel_size=patch_size, stride=patch_size)
|
||||||
|
|
||||||
|
self.cls_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
|
||||||
|
|
||||||
|
self.pos_embed = nn.Parameter(torch.zeros(1, self.num_patches + self.num_tokens, config.hidden_size))
|
||||||
|
self.pos_drop = nn.Dropout(p=config.drop_rate)
|
||||||
|
|
||||||
|
def forward(self, pixel_values):
|
||||||
|
batch_size, channel, height, width = pixel_values.shape
|
||||||
|
if height != self.image_size[0] or width != self.image_size[1]:
|
||||||
|
raise ValueError(
|
||||||
|
f"Input image size ({height}*{width}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})."
|
||||||
|
)
|
||||||
|
|
||||||
|
patch_embeddings = self.proj(pixel_values)
|
||||||
|
patch_embeddings = patch_embeddings.flatten(2).transpose(1, 2) # BCHW -> BNC
|
||||||
|
|
||||||
|
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
|
||||||
|
embedding_output = torch.cat((cls_tokens, patch_embeddings), dim=1)
|
||||||
|
embedding_output = embedding_output + self.pos_embed
|
||||||
|
embedding_output = self.pos_drop(embedding_output)
|
||||||
|
|
||||||
|
return embedding_output
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrMlp(nn.Module):
|
||||||
|
"""MLP as used in Vision Transformer, MLP-Mixer and related networks"""
|
||||||
|
|
||||||
|
def __init__(self, config: MgpstrConfig, hidden_features):
|
||||||
|
super().__init__()
|
||||||
|
hidden_features = hidden_features or config.hidden_size
|
||||||
|
self.fc1 = nn.Linear(config.hidden_size, hidden_features)
|
||||||
|
self.act = nn.GELU()
|
||||||
|
self.fc2 = nn.Linear(hidden_features, config.hidden_size)
|
||||||
|
self.drop = nn.Dropout(config.drop_rate)
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
hidden_states = self.fc1(hidden_states)
|
||||||
|
hidden_states = self.act(hidden_states)
|
||||||
|
hidden_states = self.drop(hidden_states)
|
||||||
|
hidden_states = self.fc2(hidden_states)
|
||||||
|
hidden_states = self.drop(hidden_states)
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrAttention(nn.Module):
|
||||||
|
def __init__(self, config: MgpstrConfig):
|
||||||
|
super().__init__()
|
||||||
|
self.num_heads = config.num_attention_heads
|
||||||
|
head_dim = config.hidden_size // config.num_attention_heads
|
||||||
|
self.scale = head_dim**-0.5
|
||||||
|
|
||||||
|
self.qkv = nn.Linear(config.hidden_size, config.hidden_size * 3, bias=config.qkv_bias)
|
||||||
|
self.attn_drop = nn.Dropout(config.attn_drop_rate)
|
||||||
|
self.proj = nn.Linear(config.hidden_size, config.hidden_size)
|
||||||
|
self.proj_drop = nn.Dropout(config.drop_rate)
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
batch_size, num, channel = hidden_states.shape
|
||||||
|
qkv = (
|
||||||
|
self.qkv(hidden_states)
|
||||||
|
.reshape(batch_size, num, 3, self.num_heads, channel // self.num_heads)
|
||||||
|
.permute(2, 0, 3, 1, 4)
|
||||||
|
)
|
||||||
|
query, key, value = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
|
||||||
|
|
||||||
|
attention_probs = (query @ key.transpose(-2, -1)) * self.scale
|
||||||
|
attention_probs = attention_probs.softmax(dim=-1)
|
||||||
|
attention_probs = self.attn_drop(attention_probs)
|
||||||
|
|
||||||
|
context_layer = (attention_probs @ value).transpose(1, 2).reshape(batch_size, num, channel)
|
||||||
|
context_layer = self.proj(context_layer)
|
||||||
|
context_layer = self.proj_drop(context_layer)
|
||||||
|
return (context_layer, attention_probs)
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrLayer(nn.Module):
|
||||||
|
def __init__(self, config: MgpstrConfig, drop_path=None):
|
||||||
|
super().__init__()
|
||||||
|
self.norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
self.attn = MgpstrAttention(config)
|
||||||
|
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
|
||||||
|
self.drop_path = MgpstrDropPath(drop_path) if drop_path is not None else nn.Identity()
|
||||||
|
self.norm2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
mlp_hidden_dim = int(config.hidden_size * config.mlp_ratio)
|
||||||
|
self.mlp = MgpstrMlp(config, mlp_hidden_dim)
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
self_attention_outputs = self.attn(self.norm1(hidden_states))
|
||||||
|
attention_output = self_attention_outputs[0]
|
||||||
|
outputs = self_attention_outputs[1]
|
||||||
|
|
||||||
|
# first residual connection
|
||||||
|
hidden_states = self.drop_path(attention_output) + hidden_states
|
||||||
|
|
||||||
|
# second residual connection is done here
|
||||||
|
layer_output = hidden_states + self.drop_path(self.mlp(self.norm2(hidden_states)))
|
||||||
|
|
||||||
|
outputs = (layer_output, outputs)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrEncoder(nn.Module):
|
||||||
|
def __init__(self, config: MgpstrConfig):
|
||||||
|
super().__init__()
|
||||||
|
# stochastic depth decay rule
|
||||||
|
dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
|
||||||
|
|
||||||
|
self.blocks = nn.Sequential(
|
||||||
|
*[MgpstrLayer(config=config, drop_path=dpr[i]) for i in range(config.num_hidden_layers)]
|
||||||
|
)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, output_attentions=False, output_hidden_states=False, return_dict=True):
|
||||||
|
all_hidden_states = () if output_hidden_states else None
|
||||||
|
all_self_attentions = () if output_attentions else None
|
||||||
|
|
||||||
|
for _, blk in enumerate(self.blocks):
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
layer_outputs = blk(hidden_states)
|
||||||
|
hidden_states = layer_outputs[0]
|
||||||
|
|
||||||
|
if output_attentions:
|
||||||
|
all_self_attentions = all_self_attentions + (layer_outputs[1],)
|
||||||
|
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
|
||||||
|
return BaseModelOutput(
|
||||||
|
last_hidden_state=hidden_states,
|
||||||
|
hidden_states=all_hidden_states,
|
||||||
|
attentions=all_self_attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrA3Module(nn.Module):
|
||||||
|
def __init__(self, config: MgpstrConfig):
|
||||||
|
super().__init__()
|
||||||
|
self.token_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
self.tokenLearner = nn.Sequential(
|
||||||
|
nn.Conv2d(config.hidden_size, config.hidden_size, kernel_size=(1, 1), stride=1, groups=8, bias=False),
|
||||||
|
nn.Conv2d(config.hidden_size, config.max_token_length, kernel_size=(1, 1), stride=1, bias=False),
|
||||||
|
)
|
||||||
|
self.feat = nn.Conv2d(
|
||||||
|
config.hidden_size, config.hidden_size, kernel_size=(1, 1), stride=1, groups=8, bias=False
|
||||||
|
)
|
||||||
|
self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
hidden_states = self.token_norm(hidden_states)
|
||||||
|
hidden_states = hidden_states.transpose(1, 2).unsqueeze(-1)
|
||||||
|
selected = self.tokenLearner(hidden_states)
|
||||||
|
selected = selected.flatten(2)
|
||||||
|
attentions = F.softmax(selected, dim=-1)
|
||||||
|
|
||||||
|
feat = self.feat(hidden_states)
|
||||||
|
feat = feat.flatten(2).transpose(1, 2)
|
||||||
|
feat = torch.einsum("...si,...id->...sd", attentions, feat)
|
||||||
|
a3_out = self.norm(feat)
|
||||||
|
|
||||||
|
return (a3_out, attentions)
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrPreTrainedModel(PreTrainedModel):
|
||||||
|
"""
|
||||||
|
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
|
||||||
|
models.
|
||||||
|
"""
|
||||||
|
|
||||||
|
config_class = MgpstrConfig
|
||||||
|
base_model_prefix = "mgp_str"
|
||||||
|
|
||||||
|
def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
|
||||||
|
"""Initialize the weights"""
|
||||||
|
if isinstance(module, MgpstrEmbeddings):
|
||||||
|
nn.init.trunc_normal_(module.pos_embed, mean=0.0, std=self.config.initializer_range)
|
||||||
|
nn.init.trunc_normal_(module.cls_token, mean=0.0, std=self.config.initializer_range)
|
||||||
|
elif isinstance(module, (nn.Linear, nn.Conv2d)):
|
||||||
|
module.weight.data = nn.init.trunc_normal_(module.weight.data, mean=0.0, std=self.config.initializer_range)
|
||||||
|
if module.bias is not None:
|
||||||
|
module.bias.data.zero_()
|
||||||
|
elif isinstance(module, nn.LayerNorm):
|
||||||
|
module.bias.data.zero_()
|
||||||
|
module.weight.data.fill_(1.0)
|
||||||
|
|
||||||
|
def _set_gradient_checkpointing(self, module: MgpstrEncoder, value: bool = False) -> None:
|
||||||
|
if isinstance(module, MgpstrEncoder):
|
||||||
|
module.gradient_checkpointing = value
|
||||||
|
|
||||||
|
|
||||||
|
MGP_STR_START_DOCSTRING = r"""
|
||||||
|
This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
|
||||||
|
as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
|
||||||
|
behavior.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
config ([`MgpstrConfig`]): Model configuration class with all the parameters of the model.
|
||||||
|
Initializing with a config file does not load the weights associated with the model, only the
|
||||||
|
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
|
||||||
|
"""
|
||||||
|
|
||||||
|
MGP_STR_INPUTS_DOCSTRING = r"""
|
||||||
|
Args:
|
||||||
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`ViTImageProcessor.__call__`]
|
||||||
|
for details.
|
||||||
|
output_attentions (`bool`, *optional*):
|
||||||
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
|
tensors for more detail.
|
||||||
|
output_hidden_states (`bool`, *optional*):
|
||||||
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
more detail.
|
||||||
|
return_dict (`bool`, *optional*):
|
||||||
|
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings(
|
||||||
|
"The bare MGP-STR Model transformer outputting raw hidden-states without any specific head on top.",
|
||||||
|
MGP_STR_START_DOCSTRING,
|
||||||
|
)
|
||||||
|
class MgpstrModel(MgpstrPreTrainedModel):
|
||||||
|
def __init__(self, config: MgpstrConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.config = config
|
||||||
|
self.embeddings = MgpstrEmbeddings(config)
|
||||||
|
self.encoder = MgpstrEncoder(config)
|
||||||
|
|
||||||
|
def get_input_embeddings(self) -> nn.Module:
|
||||||
|
return self.embeddings.proj
|
||||||
|
|
||||||
|
@add_start_docstrings_to_model_forward(MGP_STR_INPUTS_DOCSTRING)
|
||||||
|
def forward(self, pixel_values, output_attentions=None, output_hidden_states=None, return_dict=None):
|
||||||
|
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||||
|
output_hidden_states = (
|
||||||
|
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||||
|
)
|
||||||
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
|
||||||
|
if pixel_values is None:
|
||||||
|
raise ValueError("You have to specify pixel_values")
|
||||||
|
|
||||||
|
embedding_output = self.embeddings(pixel_values)
|
||||||
|
|
||||||
|
encoder_outputs = self.encoder(
|
||||||
|
embedding_output,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
output_hidden_states=output_hidden_states,
|
||||||
|
return_dict=return_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return encoder_outputs
|
||||||
|
return BaseModelOutput(
|
||||||
|
last_hidden_state=encoder_outputs.last_hidden_state,
|
||||||
|
hidden_states=encoder_outputs.hidden_states,
|
||||||
|
attentions=encoder_outputs.attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings(
|
||||||
|
"""
|
||||||
|
MGP-STR Model transformer with three classification heads on top (three A^3 modules and three linear layer on top
|
||||||
|
of the transformer encoder output) for scene text recognition (STR) .
|
||||||
|
""",
|
||||||
|
MGP_STR_START_DOCSTRING,
|
||||||
|
)
|
||||||
|
class MgpstrForSceneTextRecognition(MgpstrPreTrainedModel):
|
||||||
|
config_class = MgpstrConfig
|
||||||
|
main_input_name = "pixel_values"
|
||||||
|
|
||||||
|
def __init__(self, config: MgpstrConfig) -> None:
|
||||||
|
super().__init__(config)
|
||||||
|
|
||||||
|
self.num_labels = config.num_labels
|
||||||
|
self.mgp_str = MgpstrModel(config)
|
||||||
|
|
||||||
|
self.char_a3_module = MgpstrA3Module(config)
|
||||||
|
self.bpe_a3_module = MgpstrA3Module(config)
|
||||||
|
self.wp_a3_module = MgpstrA3Module(config)
|
||||||
|
|
||||||
|
self.char_head = nn.Linear(config.hidden_size, config.num_character_labels)
|
||||||
|
self.bpe_head = nn.Linear(config.hidden_size, config.num_bpe_labels)
|
||||||
|
self.wp_head = nn.Linear(config.hidden_size, config.num_wordpiece_labels)
|
||||||
|
|
||||||
|
@add_start_docstrings_to_model_forward(MGP_STR_INPUTS_DOCSTRING)
|
||||||
|
@replace_return_docstrings(output_type=MgpstrModelOutput, config_class=MgpstrConfig)
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
pixel_values,
|
||||||
|
output_attentions=None,
|
||||||
|
output_a3_attentions=None,
|
||||||
|
output_hidden_states=None,
|
||||||
|
return_dict=None,
|
||||||
|
):
|
||||||
|
r"""
|
||||||
|
output_a3_attentions (`bool`, *optional*):
|
||||||
|
Whether or not to return the attentions tensors of a3 modules. See `a3_attentions` under returned tensors
|
||||||
|
for more detail.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import (
|
||||||
|
... MgpstrProcessor,
|
||||||
|
... MgpstrForSceneTextRecognition,
|
||||||
|
... )
|
||||||
|
>>> import requests
|
||||||
|
>>> from PIL import Image
|
||||||
|
|
||||||
|
>>> # load image from the IIIT-5k dataset
|
||||||
|
>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
|
||||||
|
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||||
|
|
||||||
|
>>> processor = MgpstrProcessor.from_pretrained("alibaba-damo/mgp-str-base")
|
||||||
|
>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
>>> model = MgpstrForSceneTextRecognition.from_pretrained("alibaba-damo/mgp-str-base")
|
||||||
|
|
||||||
|
>>> # inference
|
||||||
|
>>> outputs = model(pixel_values)
|
||||||
|
>>> out_strs = processor.batch_decode(outputs.logits)
|
||||||
|
>>> out_strs["generated_text"]
|
||||||
|
'["ticket"]'
|
||||||
|
```"""
|
||||||
|
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||||
|
output_hidden_states = (
|
||||||
|
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||||
|
)
|
||||||
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
|
||||||
|
mgp_outputs = self.mgp_str(
|
||||||
|
pixel_values,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
output_hidden_states=output_hidden_states,
|
||||||
|
return_dict=return_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
sequence_output = mgp_outputs[0]
|
||||||
|
|
||||||
|
char_a3_out, char_attention = self.char_a3_module(sequence_output)
|
||||||
|
bpe_a3_out, bpe_attention = self.bpe_a3_module(sequence_output)
|
||||||
|
wp_a3_out, wp_attention = self.wp_a3_module(sequence_output)
|
||||||
|
|
||||||
|
char_logits = self.char_head(char_a3_out)
|
||||||
|
bpe_logits = self.bpe_head(bpe_a3_out)
|
||||||
|
wp_logits = self.wp_head(wp_a3_out)
|
||||||
|
|
||||||
|
all_a3_attentions = (char_attention, bpe_attention, wp_attention) if output_a3_attentions else None
|
||||||
|
all_logits = (char_logits, bpe_logits, wp_logits)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
outputs = (all_logits, all_a3_attentions) + mgp_outputs[1:]
|
||||||
|
return tuple(output for output in outputs if output is not None)
|
||||||
|
return MgpstrModelOutput(
|
||||||
|
logits=all_logits,
|
||||||
|
hidden_states=mgp_outputs.hidden_states,
|
||||||
|
attentions=mgp_outputs.attentions,
|
||||||
|
a3_attentions=all_a3_attentions,
|
||||||
|
)
|
228
src/transformers/models/mgp_str/processing_mgp_str.py
Normal file
228
src/transformers/models/mgp_str/processing_mgp_str.py
Normal file
@ -0,0 +1,228 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Processor class for MGP-STR."""
|
||||||
|
|
||||||
|
import warnings
|
||||||
|
|
||||||
|
from transformers import AutoTokenizer
|
||||||
|
from transformers.utils import is_torch_available
|
||||||
|
from transformers.utils.generic import ExplicitEnum
|
||||||
|
|
||||||
|
from ...processing_utils import ProcessorMixin
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
|
||||||
|
class DecodeType(ExplicitEnum):
|
||||||
|
CHARACTER = "char"
|
||||||
|
BPE = "bpe"
|
||||||
|
WORDPIECE = "wp"
|
||||||
|
|
||||||
|
|
||||||
|
SUPPORTED_ANNOTATION_FORMATS = (DecodeType.CHARACTER, DecodeType.BPE, DecodeType.WORDPIECE)
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrProcessor(ProcessorMixin):
|
||||||
|
r"""
|
||||||
|
Constructs a MGP-STR processor which wraps an image processor and MGP-STR tokenizers into a single
|
||||||
|
|
||||||
|
[`MgpstrProcessor`] offers all the functionalities of `ViTImageProcessor`] and [`MgpstrTokenizer`]. See the
|
||||||
|
[`~MgpstrProcessor.__call__`] and [`~MgpstrProcessor.batch_decode`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_processor (`ViTImageProcessor`):
|
||||||
|
An instance of `ViTImageProcessor`. The image processor is a required input.
|
||||||
|
tokenizer ([`MgpstrTokenizer`]):
|
||||||
|
The tokenizer is a required input.
|
||||||
|
"""
|
||||||
|
attributes = ["image_processor", "char_tokenizer"]
|
||||||
|
image_processor_class = "ViTImageProcessor"
|
||||||
|
char_tokenizer_class = "MgpstrTokenizer"
|
||||||
|
|
||||||
|
def __init__(self, image_processor=None, tokenizer=None, **kwargs):
|
||||||
|
if "feature_extractor" in kwargs:
|
||||||
|
warnings.warn(
|
||||||
|
"The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
|
||||||
|
" instead.",
|
||||||
|
FutureWarning,
|
||||||
|
)
|
||||||
|
feature_extractor = kwargs.pop("feature_extractor")
|
||||||
|
|
||||||
|
image_processor = image_processor if image_processor is not None else feature_extractor
|
||||||
|
if image_processor is None:
|
||||||
|
raise ValueError("You need to specify an `image_processor`.")
|
||||||
|
if tokenizer is None:
|
||||||
|
raise ValueError("You need to specify a `tokenizer`.")
|
||||||
|
|
||||||
|
self.char_tokenizer = tokenizer
|
||||||
|
self.bpe_tokenizer = AutoTokenizer.from_pretrained("gpt2")
|
||||||
|
self.wp_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||||
|
|
||||||
|
super().__init__(image_processor, tokenizer)
|
||||||
|
|
||||||
|
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
|
||||||
|
"""
|
||||||
|
When used in normal mode, this method forwards all its arguments to ViTImageProcessor's
|
||||||
|
[`~ViTImageProcessor.__call__`] and returns its output. This method also forwards the `text` and `kwargs`
|
||||||
|
arguments to MgpstrTokenizer's [`~MgpstrTokenizer.__call__`] if `text` is not `None` to encode the text. Please
|
||||||
|
refer to the doctsring of the above methods for more information.
|
||||||
|
"""
|
||||||
|
if images is None and text is None:
|
||||||
|
raise ValueError("You need to specify either an `images` or `text` input to process.")
|
||||||
|
|
||||||
|
if images is not None:
|
||||||
|
inputs = self.image_processor(images, return_tensors=return_tensors, **kwargs)
|
||||||
|
if text is not None:
|
||||||
|
encodings = self.char_tokenizer(text, return_tensors=return_tensors, **kwargs)
|
||||||
|
|
||||||
|
if text is None:
|
||||||
|
return inputs
|
||||||
|
elif images is None:
|
||||||
|
return encodings
|
||||||
|
else:
|
||||||
|
inputs["labels"] = encodings["input_ids"]
|
||||||
|
return inputs
|
||||||
|
|
||||||
|
def batch_decode(self, sequences):
|
||||||
|
"""
|
||||||
|
Convert a list of lists of token ids into a list of strings by calling decode.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
sequences (`torch.Tensor`):
|
||||||
|
List of tokenized input ids.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Dict[str, any]`: Dictionary of all the outputs of the decoded results.
|
||||||
|
generated_text (`List[str]`): The final results after fusion of char, bpe, and wp. scores
|
||||||
|
(`List[float]`): The final scores after fusion of char, bpe, and wp. char_preds (`List[str]`): The list
|
||||||
|
of character decoded sentences. bpe_preds (`List[str]`): The list of bpe decoded sentences. wp_preds
|
||||||
|
(`List[str]`): The list of wp decoded sentences.
|
||||||
|
|
||||||
|
This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||||
|
refer to the docstring of this method for more information.
|
||||||
|
"""
|
||||||
|
char_preds, bpe_preds, wp_preds = sequences
|
||||||
|
batch_size = char_preds.size(0)
|
||||||
|
|
||||||
|
char_strs, char_scores = self._decode_helper(char_preds, "char")
|
||||||
|
bpe_strs, bpe_scores = self._decode_helper(bpe_preds, "bpe")
|
||||||
|
wp_strs, wp_scores = self._decode_helper(wp_preds, "wp")
|
||||||
|
|
||||||
|
final_strs = []
|
||||||
|
final_scores = []
|
||||||
|
for i in range(batch_size):
|
||||||
|
scores = [char_scores[i], bpe_scores[i], wp_scores[i]]
|
||||||
|
strs = [char_strs[i], bpe_strs[i], wp_strs[i]]
|
||||||
|
max_score_index = scores.index(max(scores))
|
||||||
|
final_strs.append(strs[max_score_index])
|
||||||
|
final_scores.append(scores[max_score_index])
|
||||||
|
|
||||||
|
out = {}
|
||||||
|
out["generated_text"] = final_strs
|
||||||
|
out["scores"] = final_scores
|
||||||
|
out["char_preds"] = char_strs
|
||||||
|
out["bpe_preds"] = bpe_strs
|
||||||
|
out["wp_preds"] = wp_strs
|
||||||
|
return out
|
||||||
|
|
||||||
|
def _decode_helper(self, pred_logits, format):
|
||||||
|
"""
|
||||||
|
Convert a list of lists of bpe token ids into a list of strings by calling bpe tokenizer.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
pred_logits (`torch.Tensor`):
|
||||||
|
List of model prediction logits.
|
||||||
|
format (`Union[DecoderType, str]`):
|
||||||
|
Type of model prediction. Must be one of ['char', 'bpe', 'wp'].
|
||||||
|
Returns:
|
||||||
|
`tuple`:
|
||||||
|
dec_strs(`str`): The decode strings of model prediction. conf_scores(`List[float]`): The confidence
|
||||||
|
score of model prediction.
|
||||||
|
"""
|
||||||
|
if format == DecodeType.CHARACTER:
|
||||||
|
decoder = self.char_decode
|
||||||
|
eos_token = 1
|
||||||
|
eos_str = "[s]"
|
||||||
|
elif format == DecodeType.BPE:
|
||||||
|
decoder = self.bpe_decode
|
||||||
|
eos_token = 2
|
||||||
|
eos_str = "#"
|
||||||
|
elif format == DecodeType.WORDPIECE:
|
||||||
|
decoder = self.wp_decode
|
||||||
|
eos_token = 102
|
||||||
|
eos_str = "[SEP]"
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Format {format} is not supported.")
|
||||||
|
|
||||||
|
dec_strs, conf_scores = [], []
|
||||||
|
batch_size = pred_logits.size(0)
|
||||||
|
batch_max_length = pred_logits.size(1)
|
||||||
|
_, preds_index = pred_logits.topk(1, dim=-1, largest=True, sorted=True)
|
||||||
|
preds_index = preds_index.view(-1, batch_max_length)[:, 1:]
|
||||||
|
preds_str = decoder(preds_index)
|
||||||
|
preds_max_prob, _ = torch.nn.functional.softmax(pred_logits, dim=2).max(dim=2)
|
||||||
|
preds_max_prob = preds_max_prob[:, 1:]
|
||||||
|
|
||||||
|
for index in range(batch_size):
|
||||||
|
pred_eos = preds_str[index].find(eos_str)
|
||||||
|
pred = preds_str[index][:pred_eos]
|
||||||
|
pred_index = preds_index[index].cpu().tolist()
|
||||||
|
pred_eos_index = pred_index.index(eos_token) if eos_token in pred_index else -1
|
||||||
|
pred_max_prob = preds_max_prob[index][: pred_eos_index + 1]
|
||||||
|
confidence_score = pred_max_prob.cumprod(dim=0)[-1] if pred_max_prob.nelement() != 0 else 0.0
|
||||||
|
dec_strs.append(pred)
|
||||||
|
conf_scores.append(confidence_score)
|
||||||
|
|
||||||
|
return dec_strs, conf_scores
|
||||||
|
|
||||||
|
def char_decode(self, sequences):
|
||||||
|
"""
|
||||||
|
Convert a list of lists of char token ids into a list of strings by calling char tokenizer.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
sequences (`torch.Tensor`):
|
||||||
|
List of tokenized input ids.
|
||||||
|
Returns:
|
||||||
|
`List[str]`: The list of char decoded sentences.
|
||||||
|
"""
|
||||||
|
decode_strs = [seq.replace(" ", "") for seq in self.char_tokenizer.batch_decode(sequences)]
|
||||||
|
return decode_strs
|
||||||
|
|
||||||
|
def bpe_decode(self, sequences):
|
||||||
|
"""
|
||||||
|
Convert a list of lists of bpe token ids into a list of strings by calling bpe tokenizer.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
sequences (`torch.Tensor`):
|
||||||
|
List of tokenized input ids.
|
||||||
|
Returns:
|
||||||
|
`List[str]`: The list of bpe decoded sentences.
|
||||||
|
"""
|
||||||
|
return self.bpe_tokenizer.batch_decode(sequences)
|
||||||
|
|
||||||
|
def wp_decode(self, sequences):
|
||||||
|
"""
|
||||||
|
Convert a list of lists of word piece token ids into a list of strings by calling word piece tokenizer.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
sequences (`torch.Tensor`):
|
||||||
|
List of tokenized input ids.
|
||||||
|
Returns:
|
||||||
|
`List[str]`: The list of wp decoded sentences.
|
||||||
|
"""
|
||||||
|
decode_strs = [seq.replace(" ", "") for seq in self.wp_tokenizer.batch_decode(sequences)]
|
||||||
|
return decode_strs
|
110
src/transformers/models/mgp_str/tokenization_mgp_str.py
Normal file
110
src/transformers/models/mgp_str/tokenization_mgp_str.py
Normal file
@ -0,0 +1,110 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Tokenization classes for MGT-STR CHAR."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
from typing import Optional, Tuple
|
||||||
|
|
||||||
|
from ...tokenization_utils import PreTrainedTokenizer
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
VOCAB_FILES_NAMES = {"vocab_file": "vocab.json"}
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_FILES_MAP = {
|
||||||
|
"vocab_file": {
|
||||||
|
"mgp-str": "https://huggingface.co/alibaba-damo/mgp-str-base/blob/main/vocab.json",
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"mgp-str": 27}
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrTokenizer(PreTrainedTokenizer):
|
||||||
|
"""
|
||||||
|
Construct a MGP-STR char tokenizer.
|
||||||
|
|
||||||
|
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
|
||||||
|
this superclass for more information regarding those methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_file (`str`):
|
||||||
|
Path to the vocabulary file.
|
||||||
|
unk_token (`str`, *optional*, defaults to `"[GO]"`):
|
||||||
|
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||||
|
token instead.
|
||||||
|
bos_token (`str`, *optional*, defaults to `"[GO]"`):
|
||||||
|
The beginning of sequence token.
|
||||||
|
eos_token (`str`, *optional*, defaults to `"[s]"`):
|
||||||
|
The end of sequence token.
|
||||||
|
pad_token (`str` or `tokenizers.AddedToken`, *optional*, , defaults to `"[GO]"`):
|
||||||
|
A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
|
||||||
|
attention mechanisms or loss computation.
|
||||||
|
"""
|
||||||
|
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||||
|
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||||
|
|
||||||
|
def __init__(self, vocab_file, unk_token="[GO]", bos_token="[GO]", eos_token="[s]", pad_token="[GO]", **kwargs):
|
||||||
|
super().__init__(
|
||||||
|
unk_token=unk_token,
|
||||||
|
bos_token=bos_token,
|
||||||
|
eos_token=eos_token,
|
||||||
|
pad_token=pad_token,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(vocab_file, encoding="utf-8") as vocab_handle:
|
||||||
|
self.vocab = json.load(vocab_handle)
|
||||||
|
self.decoder = {v: k for k, v in self.vocab.items()}
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab_size(self):
|
||||||
|
return len(self.vocab)
|
||||||
|
|
||||||
|
def get_vocab(self):
|
||||||
|
return dict(self.vocab, **self.added_tokens_encoder)
|
||||||
|
|
||||||
|
def _tokenize(self, text):
|
||||||
|
"""Tokenize a string."""
|
||||||
|
char_tokens = []
|
||||||
|
for s in text:
|
||||||
|
char_tokens.extend(s)
|
||||||
|
return char_tokens
|
||||||
|
|
||||||
|
def _convert_token_to_id(self, token):
|
||||||
|
"""Converts a token (str) in an id using the vocab."""
|
||||||
|
return self.vocab.get(token, self.vocab.get(self.unk_token))
|
||||||
|
|
||||||
|
def _convert_id_to_token(self, index):
|
||||||
|
"""Converts an index (integer) in a token (str) using the vocab."""
|
||||||
|
return self.decoder.get(index)
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
|
||||||
|
return
|
||||||
|
vocab_file = os.path.join(
|
||||||
|
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(vocab_file, "w", encoding="utf-8") as f:
|
||||||
|
f.write(json.dumps(self.vocab, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
|
||||||
|
|
||||||
|
return (vocab_file,)
|
@ -4219,6 +4219,30 @@ class MegatronBertPreTrainedModel(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrForSceneTextRecognition(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
class MMBTForClassification(metaclass=DummyObject):
|
class MMBTForClassification(metaclass=DummyObject):
|
||||||
_backends = ["torch"]
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
0
tests/models/mgp_str/__init__.py
Normal file
0
tests/models/mgp_str/__init__.py
Normal file
269
tests/models/mgp_str/test_modeling_mgp_str.py
Normal file
269
tests/models/mgp_str/test_modeling_mgp_str.py
Normal file
@ -0,0 +1,269 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch MGP-STR model. """
|
||||||
|
|
||||||
|
import inspect
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
from transformers import MgpstrConfig
|
||||||
|
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||||
|
from transformers.utils import is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
from transformers import MgpstrForSceneTextRecognition
|
||||||
|
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from transformers import MgpstrProcessor
|
||||||
|
|
||||||
|
|
||||||
|
class MgpstrModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
is_training=False,
|
||||||
|
batch_size=13,
|
||||||
|
image_size=(32, 128),
|
||||||
|
patch_size=4,
|
||||||
|
num_channels=3,
|
||||||
|
max_token_length=27,
|
||||||
|
num_character_labels=38,
|
||||||
|
num_bpe_labels=99,
|
||||||
|
num_wordpiece_labels=99,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
mlp_ratio=4.0,
|
||||||
|
patch_embeds_hidden_size=257,
|
||||||
|
output_hidden_states=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.is_training = is_training
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.max_token_length = max_token_length
|
||||||
|
self.num_character_labels = num_character_labels
|
||||||
|
self.num_bpe_labels = num_bpe_labels
|
||||||
|
self.num_wordpiece_labels = num_wordpiece_labels
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.mlp_ratio = mlp_ratio
|
||||||
|
self.patch_embeds_hidden_size = patch_embeds_hidden_size
|
||||||
|
self.output_hidden_states = output_hidden_states
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size[0], self.image_size[1]])
|
||||||
|
config = self.get_config()
|
||||||
|
return config, pixel_values
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return MgpstrConfig(
|
||||||
|
image_size=self.image_size,
|
||||||
|
patch_size=self.patch_size,
|
||||||
|
num_channels=self.num_channels,
|
||||||
|
max_token_length=self.max_token_length,
|
||||||
|
num_character_labels=self.num_character_labels,
|
||||||
|
num_bpe_labels=self.num_bpe_labels,
|
||||||
|
num_wordpiece_labels=self.num_wordpiece_labels,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
mlp_ratio=self.mlp_ratio,
|
||||||
|
output_hidden_states=self.output_hidden_states,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, pixel_values):
|
||||||
|
model = MgpstrForSceneTextRecognition(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
generated_ids = model(pixel_values)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
generated_ids[0][0].shape, (self.batch_size, self.max_token_length, self.num_character_labels)
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
config, pixel_values = config_and_inputs
|
||||||
|
inputs_dict = {"pixel_values": pixel_values}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class MgpstrModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (MgpstrForSceneTextRecognition,) if is_torch_available() else ()
|
||||||
|
fx_compatible = False
|
||||||
|
|
||||||
|
test_pruning = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_head_masking = False
|
||||||
|
test_attention_outputs = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = MgpstrModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=MgpstrConfig, has_text_modality=False)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip(reason="MgpstrModel does not use inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||||
|
x = model.get_output_embeddings()
|
||||||
|
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||||
|
|
||||||
|
def test_forward_signature(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
signature = inspect.signature(model.forward)
|
||||||
|
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||||
|
arg_names = [*signature.parameters.keys()]
|
||||||
|
|
||||||
|
expected_arg_names = ["pixel_values"]
|
||||||
|
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||||
|
|
||||||
|
@unittest.skip(reason="MgpstrModel does not support feedforward chunking")
|
||||||
|
def test_feed_forward_chunking(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_gradient_checkpointing_backward_compatibility(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
if not model_class.supports_gradient_checkpointing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
config.gradient_checkpointing = True
|
||||||
|
model = model_class(config)
|
||||||
|
self.assertTrue(model.is_gradient_checkpointing)
|
||||||
|
|
||||||
|
def test_hidden_states_output(self):
|
||||||
|
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
hidden_states = outputs.hidden_states
|
||||||
|
|
||||||
|
expected_num_layers = getattr(
|
||||||
|
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
|
||||||
|
)
|
||||||
|
self.assertEqual(len(hidden_states), expected_num_layers)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
list(hidden_states[0].shape[-2:]),
|
||||||
|
[self.model_tester.patch_embeds_hidden_size, self.model_tester.hidden_size],
|
||||||
|
)
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_hidden_states"] = True
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
# check that output_hidden_states also work using config
|
||||||
|
del inputs_dict["output_hidden_states"]
|
||||||
|
config.output_hidden_states = True
|
||||||
|
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
# override as the `logit_scale` parameter initilization is different for MgpstrModel
|
||||||
|
def test_initialization(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
if isinstance(param, (nn.Linear, nn.Conv2d, nn.LayerNorm)):
|
||||||
|
if param.requires_grad:
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
@unittest.skip(reason="Retain_grad is tested in individual model tests")
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
# We will verify our results on an image from the IIIT-5k dataset
|
||||||
|
def prepare_img():
|
||||||
|
url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
|
||||||
|
im = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||||
|
return im
|
||||||
|
|
||||||
|
|
||||||
|
@require_vision
|
||||||
|
@require_torch
|
||||||
|
class MgpstrModelIntegrationTest(unittest.TestCase):
|
||||||
|
@slow
|
||||||
|
def test_inference(self):
|
||||||
|
model_name = "alibaba-damo/mgp-str-base"
|
||||||
|
model = MgpstrForSceneTextRecognition.from_pretrained(model_name).to(torch_device)
|
||||||
|
processor = MgpstrProcessor.from_pretrained(model_name)
|
||||||
|
|
||||||
|
image = prepare_img()
|
||||||
|
inputs = processor(images=image, return_tensors="pt").pixel_values.to(torch_device)
|
||||||
|
|
||||||
|
# forward pass
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(inputs)
|
||||||
|
|
||||||
|
# verify the logits
|
||||||
|
self.assertEqual(outputs.logits[0].shape, torch.Size((1, 27, 38)))
|
||||||
|
|
||||||
|
out_strs = processor.batch_decode(outputs.logits)
|
||||||
|
expected_text = "ticket"
|
||||||
|
|
||||||
|
self.assertEqual(out_strs["generated_text"][0], expected_text)
|
||||||
|
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[[-39.7358, -44.8562, -36.6253], [-62.3605, -64.5908, -59.0069], [-74.6127, -68.9724, -71.7150]]],
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertTrue(torch.allclose(outputs.logits[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
|
211
tests/models/mgp_str/test_processor_mgp_str.py
Normal file
211
tests/models/mgp_str/test_processor_mgp_str.py
Normal file
@ -0,0 +1,211 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the MgpstrProcessor. """
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from transformers import MgpstrTokenizer
|
||||||
|
from transformers.models.mgp_str.tokenization_mgp_str import VOCAB_FILES_NAMES
|
||||||
|
from transformers.testing_utils import require_torch, require_vision
|
||||||
|
from transformers.utils import IMAGE_PROCESSOR_NAME, is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from transformers import MgpstrProcessor, ViTImageProcessor
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_vision
|
||||||
|
class MgpstrProcessorTest(unittest.TestCase):
|
||||||
|
image_processing_class = ViTImageProcessor if is_vision_available() else None
|
||||||
|
|
||||||
|
@property
|
||||||
|
def image_processor_dict(self):
|
||||||
|
return self.image_processor_tester.prepare_image_processor_dict()
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.image_size = (3, 32, 128)
|
||||||
|
self.tmpdirname = tempfile.mkdtemp()
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
vocab = ['[GO]', '[s]', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
||||||
|
# fmt: on
|
||||||
|
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||||
|
|
||||||
|
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
|
||||||
|
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||||
|
fp.write(json.dumps(vocab_tokens) + "\n")
|
||||||
|
|
||||||
|
image_processor_map = {
|
||||||
|
"do_normalize": False,
|
||||||
|
"do_resize": True,
|
||||||
|
"feature_extractor_type": "ViTFeatureExtractor",
|
||||||
|
"resample": 3,
|
||||||
|
"size": {"height": 32, "width": 128},
|
||||||
|
}
|
||||||
|
self.image_processor_file = os.path.join(self.tmpdirname, IMAGE_PROCESSOR_NAME)
|
||||||
|
with open(self.image_processor_file, "w", encoding="utf-8") as fp:
|
||||||
|
json.dump(image_processor_map, fp)
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
return MgpstrTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_image_processor(self, **kwargs):
|
||||||
|
return ViTImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def tearDown(self):
|
||||||
|
shutil.rmtree(self.tmpdirname)
|
||||||
|
|
||||||
|
def prepare_image_inputs(self):
|
||||||
|
"""This function prepares a list of PIL images."""
|
||||||
|
|
||||||
|
image_input = np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)
|
||||||
|
|
||||||
|
image_input = Image.fromarray(np.moveaxis(image_input, 0, -1))
|
||||||
|
|
||||||
|
return image_input
|
||||||
|
|
||||||
|
def test_save_load_pretrained_default(self):
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
processor.save_pretrained(self.tmpdirname)
|
||||||
|
processor = MgpstrProcessor.from_pretrained(self.tmpdirname, use_fast=False)
|
||||||
|
|
||||||
|
self.assertEqual(processor.char_tokenizer.get_vocab(), tokenizer.get_vocab())
|
||||||
|
self.assertIsInstance(processor.char_tokenizer, MgpstrTokenizer)
|
||||||
|
|
||||||
|
self.assertEqual(processor.image_processor.to_json_string(), image_processor.to_json_string())
|
||||||
|
self.assertIsInstance(processor.image_processor, ViTImageProcessor)
|
||||||
|
|
||||||
|
def test_save_load_pretrained_additional_features(self):
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
processor.save_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||||
|
image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
|
||||||
|
|
||||||
|
processor = MgpstrProcessor.from_pretrained(
|
||||||
|
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(processor.char_tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||||
|
self.assertIsInstance(processor.char_tokenizer, MgpstrTokenizer)
|
||||||
|
|
||||||
|
self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
|
||||||
|
self.assertIsInstance(processor.image_processor, ViTImageProcessor)
|
||||||
|
|
||||||
|
def test_image_processor(self):
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
|
||||||
|
image_input = self.prepare_image_inputs()
|
||||||
|
|
||||||
|
input_image_proc = image_processor(image_input, return_tensors="np")
|
||||||
|
input_processor = processor(images=image_input, return_tensors="np")
|
||||||
|
|
||||||
|
for key in input_image_proc.keys():
|
||||||
|
self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||||
|
|
||||||
|
def test_tokenizer(self):
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
|
||||||
|
input_str = "test"
|
||||||
|
|
||||||
|
encoded_processor = processor(text=input_str)
|
||||||
|
|
||||||
|
encoded_tok = tokenizer(input_str)
|
||||||
|
for key in encoded_tok.keys():
|
||||||
|
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||||
|
|
||||||
|
def test_processor(self):
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
|
||||||
|
input_str = "test"
|
||||||
|
image_input = self.prepare_image_inputs()
|
||||||
|
|
||||||
|
inputs = processor(text=input_str, images=image_input)
|
||||||
|
|
||||||
|
self.assertListEqual(list(inputs.keys()), ["pixel_values", "labels"])
|
||||||
|
|
||||||
|
# test if it raises when no input is passed
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
processor()
|
||||||
|
|
||||||
|
def test_tokenizer_decode(self):
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
|
||||||
|
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9], [3, 4, 3, 1, 1, 8, 9]]
|
||||||
|
|
||||||
|
decoded_processor = processor.char_decode(predicted_ids)
|
||||||
|
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||||
|
decode_strs = [seq.replace(" ", "") for seq in decoded_tok]
|
||||||
|
|
||||||
|
self.assertListEqual(decode_strs, decoded_processor)
|
||||||
|
|
||||||
|
def test_model_input_names(self):
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
|
||||||
|
input_str = None
|
||||||
|
image_input = self.prepare_image_inputs()
|
||||||
|
|
||||||
|
inputs = processor(text=input_str, images=image_input)
|
||||||
|
|
||||||
|
self.assertListEqual(list(inputs.keys()), processor.model_input_names)
|
||||||
|
|
||||||
|
def test_processor_batch_decode(self):
|
||||||
|
image_processor = self.get_image_processor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||||
|
|
||||||
|
char_input = torch.randn(1, 27, 38)
|
||||||
|
bpe_input = torch.randn(1, 27, 50257)
|
||||||
|
wp_input = torch.randn(1, 27, 30522)
|
||||||
|
|
||||||
|
results = processor.batch_decode([char_input, bpe_input, wp_input])
|
||||||
|
|
||||||
|
self.assertListEqual(list(results.keys()), ["generated_text", "scores", "char_preds", "bpe_preds", "wp_preds"])
|
96
tests/models/mgp_str/test_tokenization_mgp_str.py
Normal file
96
tests/models/mgp_str/test_tokenization_mgp_str.py
Normal file
@ -0,0 +1,96 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from transformers import MgpstrTokenizer
|
||||||
|
from transformers.models.mgp_str.tokenization_mgp_str import VOCAB_FILES_NAMES
|
||||||
|
from transformers.testing_utils import require_tokenizers
|
||||||
|
|
||||||
|
from ...test_tokenization_common import TokenizerTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
@require_tokenizers
|
||||||
|
class MgpstrTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||||
|
tokenizer_class = MgpstrTokenizer
|
||||||
|
test_rust_tokenizer = False
|
||||||
|
from_pretrained_kwargs = {}
|
||||||
|
test_seq2seq = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
super().setUp()
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
vocab = ['[GO]', '[s]', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
||||||
|
# fmt: on
|
||||||
|
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||||
|
|
||||||
|
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
|
||||||
|
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||||
|
fp.write(json.dumps(vocab_tokens) + "\n")
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
return MgpstrTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_input_output_texts(self, tokenizer):
|
||||||
|
input_text = "tester"
|
||||||
|
output_text = "tester"
|
||||||
|
return input_text, output_text
|
||||||
|
|
||||||
|
@unittest.skip("MGP-STR always lower cases letters.")
|
||||||
|
def test_added_tokens_do_lower_case(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_add_special_tokens(self):
|
||||||
|
tokenizers = self.get_tokenizers(do_lower_case=False)
|
||||||
|
for tokenizer in tokenizers:
|
||||||
|
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||||
|
special_token = "[SPECIAL_TOKEN]"
|
||||||
|
|
||||||
|
tokenizer.add_special_tokens({"cls_token": special_token})
|
||||||
|
encoded_special_token = tokenizer.encode([special_token], add_special_tokens=False)
|
||||||
|
self.assertEqual(len(encoded_special_token), 1)
|
||||||
|
|
||||||
|
decoded = tokenizer.decode(encoded_special_token, skip_special_tokens=True)
|
||||||
|
self.assertTrue(special_token not in decoded)
|
||||||
|
|
||||||
|
def test_internal_consistency(self):
|
||||||
|
tokenizers = self.get_tokenizers()
|
||||||
|
for tokenizer in tokenizers:
|
||||||
|
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||||
|
input_text, output_text = self.get_input_output_texts(tokenizer)
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize(input_text)
|
||||||
|
ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||||
|
ids_2 = tokenizer.encode(input_text, add_special_tokens=False)
|
||||||
|
self.assertListEqual(ids, ids_2)
|
||||||
|
|
||||||
|
tokens_2 = tokenizer.convert_ids_to_tokens(ids)
|
||||||
|
self.assertNotEqual(len(tokens_2), 0)
|
||||||
|
text_2 = tokenizer.decode(ids)
|
||||||
|
self.assertIsInstance(text_2, str)
|
||||||
|
|
||||||
|
self.assertEqual(text_2.replace(" ", ""), output_text)
|
||||||
|
|
||||||
|
@unittest.skip("MGP-STR tokenizer only handles one sequence.")
|
||||||
|
def test_maximum_encoding_length_pair_input(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip("inputs cannot be pretokenized in MgpstrTokenizer")
|
||||||
|
def test_pretokenized_inputs(self):
|
||||||
|
pass
|
@ -95,6 +95,7 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
|||||||
"M2M100Encoder", # Building part of bigger (tested) model.
|
"M2M100Encoder", # Building part of bigger (tested) model.
|
||||||
"M2M100Decoder", # Building part of bigger (tested) model.
|
"M2M100Decoder", # Building part of bigger (tested) model.
|
||||||
"MCTCTEncoder", # Building part of bigger (tested) model.
|
"MCTCTEncoder", # Building part of bigger (tested) model.
|
||||||
|
"MgpstrModel", # Building part of bigger (tested) model.
|
||||||
"Speech2TextEncoder", # Building part of bigger (tested) model.
|
"Speech2TextEncoder", # Building part of bigger (tested) model.
|
||||||
"Speech2TextDecoder", # Building part of bigger (tested) model.
|
"Speech2TextDecoder", # Building part of bigger (tested) model.
|
||||||
"LEDEncoder", # Building part of bigger (tested) model.
|
"LEDEncoder", # Building part of bigger (tested) model.
|
||||||
@ -269,6 +270,7 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||||||
"LukeForEntityClassification",
|
"LukeForEntityClassification",
|
||||||
"LukeForEntityPairClassification",
|
"LukeForEntityPairClassification",
|
||||||
"LukeForEntitySpanClassification",
|
"LukeForEntitySpanClassification",
|
||||||
|
"MgpstrModel",
|
||||||
"OpenAIGPTDoubleHeadsModel",
|
"OpenAIGPTDoubleHeadsModel",
|
||||||
"OwlViTTextModel",
|
"OwlViTTextModel",
|
||||||
"OwlViTVisionModel",
|
"OwlViTVisionModel",
|
||||||
|
Loading…
Reference in New Issue
Block a user