diff --git a/README.md b/README.md index 35f5312148a..1c0a7189c09 100644 --- a/README.md +++ b/README.md @@ -235,6 +235,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[DPR](https://huggingface.co/transformers/model_doc/dpr.html)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. +1. **[EncoderDecoder](https://huggingface.co/transformers/model_doc/encoderdecoder.html)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[FlauBERT](https://huggingface.co/transformers/model_doc/flaubert.html)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[Funnel Transformer](https://huggingface.co/transformers/model_doc/funnel.html)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. @@ -265,7 +266,9 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. 1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/master/model_doc/speechencoderdecoder.html)** 1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. +1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/master/model_doc/speech_to_text_2.html)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. 1. **[Splinter](https://huggingface.co/transformers/model_doc/splinter.html)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. 1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. 1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. diff --git a/docs/source/index.rst b/docs/source/index.rst index 48e0fdbe634..cbdd1f8fd1a 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -176,133 +176,141 @@ Supported models 25. :doc:`DPR ` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain Question Answering `__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. -26. :doc:`ELECTRA ` (from Google Research/Stanford University) released with the paper `ELECTRA: +26. :doc:`EncoderDecoder ` (from Google Research) released with the paper `Leveraging + Pre-trained Checkpoints for Sequence Generation Tasks `__ by Sascha Rothe, Shashi + Narayan, Aliaksei Severyn. +27. :doc:`ELECTRA ` (from Google Research/Stanford University) released with the paper `ELECTRA: Pre-training text encoders as discriminators rather than generators `__ by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. -27. :doc:`FlauBERT ` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model +28. :doc:`FlauBERT ` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model Pre-training for French `__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. -28. :doc:`Funnel Transformer ` (from CMU/Google Brain) released with the paper `Funnel-Transformer: +29. :doc:`Funnel Transformer ` (from CMU/Google Brain) released with the paper `Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing `__ by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. -29. :doc:`GPT ` (from OpenAI) released with the paper `Improving Language Understanding by Generative +30. :doc:`GPT ` (from OpenAI) released with the paper `Improving Language Understanding by Generative Pre-Training `__ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. -30. :doc:`GPT-2 ` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask +31. :doc:`GPT-2 ` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask Learners `__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. -31. :doc:`GPT-J ` (from EleutherAI) released in the repository `kingoflolz/mesh-transformer-jax +32. :doc:`GPT-J ` (from EleutherAI) released in the repository `kingoflolz/mesh-transformer-jax `__ by Ben Wang and Aran Komatsuzaki. -32. :doc:`GPT Neo ` (from EleutherAI) released in the repository `EleutherAI/gpt-neo +33. :doc:`GPT Neo ` (from EleutherAI) released in the repository `EleutherAI/gpt-neo `__ by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. -33. :doc:`Hubert ` (from Facebook) released with the paper `HuBERT: Self-Supervised Speech +34. :doc:`Hubert ` (from Facebook) released with the paper `HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units `__ by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. -34. :doc:`I-BERT ` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization +35. :doc:`I-BERT ` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization `__ by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer -35. :doc:`LayoutLM ` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training +36. :doc:`LayoutLM ` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training of Text and Layout for Document Image Understanding `__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. -36. :doc:`LayoutLMv2 ` (from Microsoft Research Asia) released with the paper `LayoutLMv2: +37. :doc:`LayoutLMv2 ` (from Microsoft Research Asia) released with the paper `LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding `__ by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. -37. :doc:`LayoutXLM ` (from Microsoft Research Asia) released with the paper `LayoutXLM: +38. :doc:`LayoutXLM ` (from Microsoft Research Asia) released with the paper `LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding `__ by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. -38. :doc:`LED ` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer +39. :doc:`LED ` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer `__ by Iz Beltagy, Matthew E. Peters, Arman Cohan. -39. :doc:`Longformer ` (from AllenAI) released with the paper `Longformer: The Long-Document +40. :doc:`Longformer ` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer `__ by Iz Beltagy, Matthew E. Peters, Arman Cohan. -40. :doc:`LUKE ` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity +41. :doc:`LUKE ` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention `__ by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. -41. :doc:`LXMERT ` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality +42. :doc:`LXMERT ` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering `__ by Hao Tan and Mohit Bansal. -42. :doc:`M2M100 ` (from Facebook) released with the paper `Beyond English-Centric Multilingual +43. :doc:`M2M100 ` (from Facebook) released with the paper `Beyond English-Centric Multilingual Machine Translation `__ by by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. -43. :doc:`MarianMT ` Machine translation models trained using `OPUS `__ data by +44. :doc:`MarianMT ` Machine translation models trained using `OPUS `__ data by Jörg Tiedemann. The `Marian Framework `__ is being developed by the Microsoft Translator Team. -44. :doc:`MBart ` (from Facebook) released with the paper `Multilingual Denoising Pre-training for +45. :doc:`MBart ` (from Facebook) released with the paper `Multilingual Denoising Pre-training for Neural Machine Translation `__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. -45. :doc:`MBart-50 ` (from Facebook) released with the paper `Multilingual Translation with Extensible +46. :doc:`MBart-50 ` (from Facebook) released with the paper `Multilingual Translation with Extensible Multilingual Pretraining and Finetuning `__ by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. -46. :doc:`Megatron-BERT ` (from NVIDIA) released with the paper `Megatron-LM: Training +47. :doc:`Megatron-BERT ` (from NVIDIA) released with the paper `Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism `__ by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. -47. :doc:`Megatron-GPT2 ` (from NVIDIA) released with the paper `Megatron-LM: Training +48. :doc:`Megatron-GPT2 ` (from NVIDIA) released with the paper `Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism `__ by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. -48. :doc:`MPNet ` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted +49. :doc:`MPNet ` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted Pre-training for Language Understanding `__ by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. -49. :doc:`MT5 ` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained +50. :doc:`MT5 ` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained text-to-text transformer `__ by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. -50. :doc:`Pegasus ` (from Google) released with the paper `PEGASUS: Pre-training with Extracted +51. :doc:`Pegasus ` (from Google) released with the paper `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization `__> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. -51. :doc:`ProphetNet ` (from Microsoft Research) released with the paper `ProphetNet: Predicting +52. :doc:`ProphetNet ` (from Microsoft Research) released with the paper `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training `__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. -52. :doc:`Reformer ` (from Google Research) released with the paper `Reformer: The Efficient +53. :doc:`Reformer ` (from Google Research) released with the paper `Reformer: The Efficient Transformer `__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. -53. :doc:`RemBERT ` (from Google Research) released with the paper `Rethinking embedding coupling in +54. :doc:`RemBERT ` (from Google Research) released with the paper `Rethinking embedding coupling in pre-trained language models `__ by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. -54. :doc:`RoBERTa ` (from Facebook), released together with the paper a `Robustly Optimized BERT +55. :doc:`RoBERTa ` (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach `__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. -55. :doc:`RoFormer ` (from ZhuiyiTechnology), released together with the paper a `RoFormer: +56. :doc:`RoFormer ` (from ZhuiyiTechnology), released together with the paper a `RoFormer: Enhanced Transformer with Rotary Position Embedding `__ by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. -56. :doc:`SpeechToTextTransformer ` (from Facebook), released together with the paper +57. `SpeechEncoderDecoder `__ +58. :doc:`SpeechToTextTransformer ` (from Facebook), released together with the paper `fairseq S2T: Fast Speech-to-Text Modeling with fairseq `__ by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. -57. :doc:`Splinter ` (from Tel Aviv University), released together with the paper `Few-Shot +59. `SpeechToTextTransformer2 `__ (from + Facebook), released together with the paper `Large-Scale Self- and Semi-Supervised Learning for Speech Translation + `__ by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis + Conneau. +60. :doc:`Splinter ` (from Tel Aviv University), released together with the paper `Few-Shot Question Answering by Pretraining Span Selection `__ by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. -58. :doc:`SqueezeBert ` released with the paper `SqueezeBERT: What can computer vision teach NLP +61. :doc:`SqueezeBert ` released with the paper `SqueezeBERT: What can computer vision teach NLP about efficient neural networks? `__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. -59. :doc:`T5 ` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a +62. :doc:`T5 ` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer `__ by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. -60. :doc:`TAPAS ` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via +63. :doc:`TAPAS ` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via Pre-training `__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. -61. :doc:`Transformer-XL ` (from Google/CMU) released with the paper `Transformer-XL: +64. :doc:`Transformer-XL ` (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context `__ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. -62. :doc:`Vision Transformer (ViT) ` (from Google AI) released with the paper `An Image is Worth 16x16 +65. :doc:`Vision Transformer (ViT) ` (from Google AI) released with the paper `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale `__ by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. -63. :doc:`VisualBERT ` (from UCLA NLP) released with the paper `VisualBERT: A Simple and +66. :doc:`VisualBERT ` (from UCLA NLP) released with the paper `VisualBERT: A Simple and Performant Baseline for Vision and Language `__ by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. -64. :doc:`Wav2Vec2 ` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for +67. :doc:`Wav2Vec2 ` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations `__ by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. -65. :doc:`XLM ` (from Facebook) released together with the paper `Cross-lingual Language Model +68. :doc:`XLM ` (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining `__ by Guillaume Lample and Alexis Conneau. -66. :doc:`XLM-ProphetNet ` (from Microsoft Research) released with the paper `ProphetNet: +69. :doc:`XLM-ProphetNet ` (from Microsoft Research) released with the paper `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training `__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. -67. :doc:`XLM-RoBERTa ` (from Facebook AI), released together with the paper `Unsupervised +70. :doc:`XLM-RoBERTa ` (from Facebook AI), released together with the paper `Unsupervised Cross-lingual Representation Learning at Scale `__ by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. -68. :doc:`XLNet ` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive +71. :doc:`XLNet ` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive Pretraining for Language Understanding `__ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. -69. :doc:`XLSR-Wav2Vec2 ` (from Facebook AI) released with the paper `Unsupervised +72. :doc:`XLSR-Wav2Vec2 ` (from Facebook AI) released with the paper `Unsupervised Cross-Lingual Representation Learning For Speech Recognition `__ by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. @@ -426,8 +434,12 @@ Flax), PyTorch, and/or TensorFlow. +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ | RoFormer | ✅ | ✅ | ✅ | ✅ | ❌ | +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ | ++-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ | Speech2Text | ✅ | ❌ | ✅ | ❌ | ❌ | +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +| Speech2Text2 | ✅ | ❌ | ❌ | ❌ | ❌ | ++-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ | Splinter | ✅ | ✅ | ✅ | ❌ | ❌ | +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ | SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ | @@ -590,7 +602,9 @@ Flax), PyTorch, and/or TensorFlow. model_doc/retribert model_doc/roberta model_doc/roformer + model_doc/speechencoderdecoder model_doc/speech_to_text + model_doc/speech_to_text_2 model_doc/splinter model_doc/squeezebert model_doc/t5 diff --git a/docs/source/model_doc/speech_to_text_2.rst b/docs/source/model_doc/speech_to_text_2.rst new file mode 100644 index 00000000000..0f179b5caa7 --- /dev/null +++ b/docs/source/model_doc/speech_to_text_2.rst @@ -0,0 +1,123 @@ +.. + Copyright 2021 The HuggingFace Team. All rights reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on + an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + specific language governing permissions and limitations under the License. + +Speech2Text2 +----------------------------------------------------------------------------------------------------------------------- + +Overview +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Speech2Text2 model is used together with :doc:`Wav2Vec2 ` for Speech Translation models proposed in +`Large-Scale Self- and Semi-Supervised Learning for Speech Translation `__ by +Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. + +Speech2Text2 is a *decoder-only* transformer model that can be used with any speech *encoder-only*, such as +:doc:`Wav2Vec2 ` or :doc:`HuBERT ` for Speech-to-Text tasks. Please refer to the +:doc:`SpeechEncoderDecoder ` class on how to combine Speech2Text2 with any speech *encoder-only* +model. + +This model was contributed by `Patrick von Platen `__. + +The original code can be found `here +`__. + + +Tips: + +- Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see + the `official models `__ . +- Speech2Text2 is always used within the :doc:`SpeechEncoderDecoder ` framework. +- Speech2Text2's tokenizer currently only supports inference, but not training. + +Inference +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Speech2Text2's :class:`~transformers.SpeechEncoderDecoderModel` model accepts raw waveform input values from speech and +makes use of :func:`~transformers.generation_utils.GenerationMixin.generate` to translate the input speech +autoregressively to the target language. + +The :class:`~transformers.Wav2Vec2FeatureExtractor` class is responsible for preprocessing the input speech and +:class:`~transformers.Speech2Text2Tokenizer` decodes the generated target tokens to the target string. The +:class:`~transformers.Speech2Text2Processor` wraps :class:`~transformers.Wav2Vec2FeatureExtractor` and +:class:`~transformers.Speech2Text2Tokenizer` into a single instance to both extract the input features and decode the +predicted token ids. + +- Step-by-step Speech Translation + +.. code-block:: + + >>> import torch + >>> from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel + >>> from datasets import load_dataset + >>> import soundfile as sf + + >>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de") + >>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de") + + >>> def map_to_array(batch): + ... speech, _ = sf.read(batch["file"]) + ... batch["speech"] = speech + ... return batch + + >>> ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") + >>> ds = ds.map(map_to_array) + + >>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt") + >>> generated_ids = model.generate(input_ids=inputs["input_values"], attention_mask=inputs["attention_mask"]) + + >>> transcription = processor.batch_decode(generated_ids) + + +- Speech Translation via Pipelines + + The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code + +.. code-block:: + + >>> from datasets import load_dataset + >>> from transformers import pipeline + + >>> librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") + >>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de") + + >>> translation_de = asr(librispeech_en[0]["file"]) + + +See `model hub `__ to look for Speech2Text2 checkpoints. + + +Speech2Text2Config +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.Speech2Text2Config + :members: + + +Speech2TextTokenizer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.Speech2Text2Tokenizer + :members: batch_decode, decode, save_vocabulary + + +Speech2Text2Processor +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.Speech2Text2Processor + :members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor + + +Speech2Text2ForCausalLM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.Speech2Text2ForCausalLM + :members: forward diff --git a/docs/source/model_doc/speechencoderdecoder.rst b/docs/source/model_doc/speechencoderdecoder.rst new file mode 100644 index 00000000000..c0773e69f65 --- /dev/null +++ b/docs/source/model_doc/speechencoderdecoder.rst @@ -0,0 +1,40 @@ +.. + Copyright 2021 The HuggingFace Team. All rights reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on + an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + specific language governing permissions and limitations under the License. + +Speech Encoder Decoder Models +----------------------------------------------------------------------------------------------------------------------- + +The :class:`~transformers.SpeechEncoderDecoderModel` can be used to initialize a speech-sequence-to-text-sequence model +with any pretrained speech autoencoding model as the encoder (*e.g.* :doc:`Wav2Vec2 `, :doc:`Hubert +`) and any pretrained autoregressive model as the decoder. + +The effectiveness of initializing speech-sequence-to-text-sequence models with pretrained checkpoints for speech +recognition and speech translation has *e.g.* been shown in `Large-Scale Self- and Semi-Supervised Learning for Speech +Translation `__ by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, +Alexis Conneau. + +An example of how to use a :class:`~transformers.SpeechEncoderDecoderModel` for inference can be seen in +:doc:`Speech2Text2 `. + + +SpeechEncoderDecoderConfig +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.SpeechEncoderDecoderConfig + :members: + + +SpeechEncoderDecoderModel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.SpeechEncoderDecoderModel + :members: forward, from_encoder_decoder_pretrained diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index d7b06c7275f..d993e13f98b 100755 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -248,10 +248,17 @@ _import_structure = { "models.retribert": ["RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RetriBertConfig", "RetriBertTokenizer"], "models.roberta": ["ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaConfig", "RobertaTokenizer"], "models.roformer": ["ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoFormerConfig", "RoFormerTokenizer"], + "models.speech_encoder_decoder": ["SpeechEncoderDecoderConfig"], "models.speech_to_text": [ "SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", "Speech2TextConfig", ], + "models.speech_to_text_2": [ + "SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Speech2Text2Config", + "Speech2Text2Processor", + "Speech2Text2Tokenizer", + ], "models.splinter": ["SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SplinterConfig", "SplinterTokenizer"], "models.squeezebert": ["SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "SqueezeBertConfig", "SqueezeBertTokenizer"], "models.t5": ["T5_PRETRAINED_CONFIG_ARCHIVE_MAP", "T5Config"], @@ -1074,6 +1081,7 @@ if is_torch_available(): "load_tf_weights_in_roformer", ] ) + _import_structure["models.speech_encoder_decoder"].extend(["SpeechEncoderDecoderModel"]) _import_structure["models.speech_to_text"].extend( [ "SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1082,6 +1090,7 @@ if is_torch_available(): "Speech2TextPreTrainedModel", ] ) + _import_structure["models.speech_to_text_2"].extend(["Speech2Text2ForCausalLM", "Speech2Text2PreTrainedModel"]) _import_structure["models.splinter"].extend( [ "SPLINTER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2015,7 +2024,14 @@ if TYPE_CHECKING: from .models.retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig, RetriBertTokenizer from .models.roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig, RobertaTokenizer from .models.roformer import ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, RoFormerConfig, RoFormerTokenizer + from .models.speech_encoder_decoder import SpeechEncoderDecoderConfig from .models.speech_to_text import SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Speech2TextConfig + from .models.speech_to_text_2 import ( + SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP, + Speech2Text2Config, + Speech2Text2Processor, + Speech2Text2Tokenizer, + ) from .models.splinter import SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP, SplinterConfig, SplinterTokenizer from .models.squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig, SqueezeBertTokenizer from .models.t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config @@ -2710,12 +2726,14 @@ if TYPE_CHECKING: RoFormerPreTrainedModel, load_tf_weights_in_roformer, ) + from .models.speech_encoder_decoder import SpeechEncoderDecoderModel from .models.speech_to_text import ( SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST, Speech2TextForConditionalGeneration, Speech2TextModel, Speech2TextPreTrainedModel, ) + from .models.speech_to_text_2 import Speech2Text2ForCausalLM, Speech2Text2PreTrainedModel from .models.splinter import ( SPLINTER_PRETRAINED_MODEL_ARCHIVE_LIST, SplinterForQuestionAnswering, diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index efdd964324a..0fecd385126 100644 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -40,6 +40,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ("detr", "DetrConfig"), ("gpt_neo", "GPTNeoConfig"), ("big_bird", "BigBirdConfig"), + ("speech_to_text_2", "Speech2Text2Config"), ("speech_to_text", "Speech2TextConfig"), ("vit", "ViTConfig"), ("wav2vec2", "Wav2Vec2Config"), @@ -83,6 +84,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ("xlm", "XLMConfig"), ("ctrl", "CTRLConfig"), ("electra", "ElectraConfig"), + ("speech-encoder-decoder", "SpeechEncoderDecoderConfig"), ("encoder-decoder", "EncoderDecoderConfig"), ("funnel", "FunnelConfig"), ("lxmert", "LxmertConfig"), @@ -113,6 +115,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ("big_bird", "BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("megatron-bert", "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("speech_to_text", "SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("speech_to_text_2", "SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit", "VIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("m2m_100", "M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -174,6 +177,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ("detr", "DETR"), ("gpt_neo", "GPT Neo"), ("big_bird", "BigBird"), + ("speech_to_text_2", "Speech2Text2"), ("speech_to_text", "Speech2Text"), ("vit", "ViT"), ("wav2vec2", "Wav2Vec2"), @@ -211,6 +215,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ("ctrl", "CTRL"), ("electra", "ELECTRA"), ("encoder-decoder", "Encoder decoder"), + ("speech-encoder-decoder", "Speech Encoder decoder"), ("funnel", "Funnel Transformer"), ("lxmert", "LXMERT"), ("deberta-v2", "DeBERTa-v2"), diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py index 54c03a3dcb5..726f2adc261 100644 --- a/src/transformers/models/auto/feature_extraction_auto.py +++ b/src/transformers/models/auto/feature_extraction_auto.py @@ -156,11 +156,11 @@ class AutoFeatureExtractor: model_type = config_class_to_model_type(type(config).__name__) - if model_type is not None: - return FEATURE_EXTRACTOR_MAPPING[type(config)].from_dict(config_dict, **kwargs) - elif "feature_extractor_type" in config_dict: + if "feature_extractor_type" in config_dict: feature_extractor_class = feature_extractor_class_from_name(config_dict["feature_extractor_type"]) return feature_extractor_class.from_dict(config_dict, **kwargs) + elif model_type is not None: + return FEATURE_EXTRACTOR_MAPPING[type(config)].from_dict(config_dict, **kwargs) raise ValueError( f"Unrecognized feature extractor in {pretrained_model_name_or_path}. Should have a `feature_extractor_type` key in " diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index be3a964904e..a3569d90ae6 100644 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -212,6 +212,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict( ("blenderbot", "BlenderbotForCausalLM"), ("blenderbot-small", "BlenderbotSmallForCausalLM"), ("megatron-bert", "MegatronBertForCausalLM"), + ("speech_to_text_2", "Speech2Text2ForCausalLM"), ] ) diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py index 7542781f686..c8751df1745 100644 --- a/src/transformers/models/auto/tokenization_auto.py +++ b/src/transformers/models/auto/tokenization_auto.py @@ -154,6 +154,7 @@ else: ("rag", ("RagTokenizer", None)), ("xlm-prophetnet", ("XLMProphetNetTokenizer" if is_sentencepiece_available() else None, None)), ("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)), + ("speech_to_text_2", ("Speech2Text2Tokenizer", None)), ("m2m_100", ("M2M100Tokenizer" if is_sentencepiece_available() else None, None)), ("prophetnet", ("ProphetNetTokenizer", None)), ("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)), diff --git a/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py b/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py index c240d1079e9..17c288d0e3c 100644 --- a/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py +++ b/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py @@ -53,7 +53,7 @@ ENCODER_DECODER_START_DOCSTRING = r""" general usage and behavior. Parameters: - config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model. + config (:class:`~transformers.PretrainedConfig`): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights. diff --git a/src/transformers/models/speech_encoder_decoder/__init__.py b/src/transformers/models/speech_encoder_decoder/__init__.py new file mode 100644 index 00000000000..01d552038a2 --- /dev/null +++ b/src/transformers/models/speech_encoder_decoder/__init__.py @@ -0,0 +1,40 @@ +# flake8: noqa +# There's no way to ignore "F401 '...' imported but unused" warnings in this +# module, but to preserve other warnings. So, don't check this module at all. + +# Copyright 2021 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import TYPE_CHECKING + +from ...file_utils import _LazyModule, is_torch_available + + +_import_structure = { + "configuration_speech_encoder_decoder": ["SpeechEncoderDecoderConfig"], +} + +if is_torch_available(): + _import_structure["modeling_speech_encoder_decoder"] = ["SpeechEncoderDecoderModel"] + +if TYPE_CHECKING: + from .configuration_speech_encoder_decoder import SpeechEncoderDecoderConfig + + if is_torch_available(): + from .modeling_speech_encoder_decoder import SpeechEncoderDecoderModel + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure) diff --git a/src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py b/src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py new file mode 100644 index 00000000000..dff79b35c38 --- /dev/null +++ b/src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py @@ -0,0 +1,119 @@ +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. +# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy + +from ...configuration_utils import PretrainedConfig +from ...utils import logging +from ..auto.configuration_auto import AutoConfig + + +logger = logging.get_logger(__name__) + + +class SpeechEncoderDecoderConfig(PretrainedConfig): + r""" + :class:`~transformers.SpeechEncoderDecoderConfig` is the configuration class to store the configuration of a + :class:`~transformers.SpeechEncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to + the specified arguments, defining the encoder and decoder configs. + + Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model + outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. + + Args: + kwargs (`optional`): + Dictionary of keyword arguments. Notably: + + - **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration + object that defines the encoder config. + - **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration + object that defines the decoder config. + + Examples:: + + >>> from transformers import BertConfig, Wav2Vec2Config, SpeechEncoderDecoderConfig, SpeechEncoderDecoderModel + + >>> # Initializing a Wav2Vec2 & BERT style configuration + >>> config_encoder = Wav2Vec2Config() + >>> config_decoder = BertConfig() + + >>> config = SpeechEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder) + + >>> # Initializing a Wav2Vec2Bert model from a Wav2Vec2 & bert-base-uncased style configurations + >>> model = SpeechEncoderDecoderModel(config=config) + + >>> # Accessing the model configuration + >>> config_encoder = model.config.encoder + >>> config_decoder = model.config.decoder + >>> # set decoder config to causal lm + >>> config_decoder.is_decoder = True + >>> config_decoder.add_cross_attention = True + + >>> # Saving the model, including its configuration + >>> model.save_pretrained('my-model') + + >>> # loading model and config from pretrained folder + >>> encoder_decoder_config = SpeechEncoderDecoderConfig.from_pretrained('my-model') + >>> model = SpeechEncoderDecoderModel.from_pretrained('my-model', config=encoder_decoder_config) + """ + model_type = "speech-encoder-decoder" + is_composition = True + + def __init__(self, **kwargs): + super().__init__(**kwargs) + if "encoder" not in kwargs or "decoder" not in kwargs: + raise ValueError( + f"A configuraton of type {self.model_type} cannot be instantiated because not both `encoder` and `decoder` sub-configurations are passed, but only {kwargs}" + ) + + encoder_config = kwargs.pop("encoder") + encoder_model_type = encoder_config.pop("model_type") + decoder_config = kwargs.pop("decoder") + decoder_model_type = decoder_config.pop("model_type") + + self.encoder = AutoConfig.for_model(encoder_model_type, **encoder_config) + self.decoder = AutoConfig.for_model(decoder_model_type, **decoder_config) + self.is_encoder_decoder = True + + @classmethod + def from_encoder_decoder_configs( + cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig, **kwargs + ) -> PretrainedConfig: + r""" + Instantiate a :class:`~transformers.SpeechEncoderDecoderConfig` (or a derived class) from a pre-trained encoder + model configuration and decoder model configuration. + + Returns: + :class:`SpeechEncoderDecoderConfig`: An instance of a configuration object + """ + logger.info("Setting `config.is_decoder=True` and `config.add_cross_attention=True` for decoder_config") + decoder_config.is_decoder = True + decoder_config.add_cross_attention = True + + return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict(), **kwargs) + + def to_dict(self): + """ + Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`. + + Returns: + :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, + """ + output = copy.deepcopy(self.__dict__) + output["encoder"] = self.encoder.to_dict() + output["decoder"] = self.decoder.to_dict() + output["model_type"] = self.__class__.model_type + return output diff --git a/src/transformers/models/speech_encoder_decoder/convert_speech_to_text_wav2vec2_seq2seq_original_to_pytorch.py b/src/transformers/models/speech_encoder_decoder/convert_speech_to_text_wav2vec2_seq2seq_original_to_pytorch.py new file mode 100644 index 00000000000..40433bba134 --- /dev/null +++ b/src/transformers/models/speech_encoder_decoder/convert_speech_to_text_wav2vec2_seq2seq_original_to_pytorch.py @@ -0,0 +1,312 @@ +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Convert Wav2Vec2 checkpoint.""" + + +import argparse +import json +import os + +import fairseq +import torch +from torch import nn + +from transformers import ( + Speech2Text2Config, + Speech2Text2ForCausalLM, + Speech2Text2Tokenizer, + SpeechEncoderDecoderConfig, + SpeechEncoderDecoderModel, + Wav2Vec2Config, + Wav2Vec2FeatureExtractor, + Wav2Vec2Model, + logging, +) + + +logging.set_verbosity_info() +logger = logging.get_logger(__name__) + +MAPPING = { + "post_extract_proj": "feature_projection.projection", + "encoder.pos_conv.0": "encoder.pos_conv_embed.conv", + "self_attn.k_proj": "encoder.layers.*.attention.k_proj", + "self_attn.v_proj": "encoder.layers.*.attention.v_proj", + "self_attn.q_proj": "encoder.layers.*.attention.q_proj", + "self_attn.out_proj": "encoder.layers.*.attention.out_proj", + "self_attn_layer_norm": "encoder.layers.*.layer_norm", + "fc1": "encoder.layers.*.feed_forward.intermediate_dense", + "fc2": "encoder.layers.*.feed_forward.output_dense", + "final_layer_norm": "encoder.layers.*.final_layer_norm", + "encoder.layer_norm": "encoder.layer_norm", + "w2v_model.layer_norm": "feature_projection.layer_norm", + "quantizer.weight_proj": "quantizer.weight_proj", + "quantizer.vars": "quantizer.codevectors", + "project_q": "project_q", + "final_proj": "project_hid", + "w2v_encoder.proj": "lm_head", + "mask_emb": "masked_spec_embed", +} +TOP_LEVEL_KEYS = [ + "lm_head", + "quantizer.weight_proj", + "quantizer.codevectors", + "project_q", + "project_hid", +] + + +def set_recursively(hf_pointer, key, value, full_name, weight_type): + for attribute in key.split("."): + hf_pointer = getattr(hf_pointer, attribute) + + if weight_type is not None: + hf_shape = getattr(hf_pointer, weight_type).shape + else: + hf_shape = hf_pointer.shape + + assert ( + hf_shape == value.shape + ), f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be {value.shape} for {full_name}" + + if weight_type == "weight": + hf_pointer.weight.data = value + elif weight_type == "weight_g": + hf_pointer.weight_g.data = value + elif weight_type == "weight_v": + hf_pointer.weight_v.data = value + elif weight_type == "bias": + hf_pointer.bias.data = value + else: + hf_pointer.data = value + + logger.info(f"{key + '.' + weight_type if weight_type is not None else ''} was initialized from {full_name}.") + + +def recursively_load_weights_wav2vec2(fairseq_model, hf_model): + unused_weights = [] + fairseq_dict = fairseq_model.state_dict() + + feature_extractor = hf_model.feature_extractor + + # if encoder has different dim to decoder -> use proj_weight + proj_weight = None + + for name, value in fairseq_dict.items(): + is_used = False + if "conv_layers" in name: + load_conv_layer( + name, + value, + feature_extractor, + unused_weights, + hf_model.config.feat_extract_norm == "group", + ) + is_used = True + elif name.split(".")[0] == "proj": + proj_weight = fairseq_model.proj + is_used = True + else: + for key, mapped_key in MAPPING.items(): + if key in name or key.split("w2v_model.")[-1] == name.split(".")[0]: + is_used = True + if "*" in mapped_key: + layer_index = name.split(key)[0].split(".")[-2] + mapped_key = mapped_key.replace("*", layer_index) + if "weight_g" in name: + weight_type = "weight_g" + elif "weight_v" in name: + weight_type = "weight_v" + elif "bias" in name: + weight_type = "bias" + elif "weight" in name: + weight_type = "weight" + else: + weight_type = None + set_recursively(hf_model, mapped_key, value, name, weight_type) + continue + if not is_used: + unused_weights.append(name) + + logger.warning(f"Unused weights: {unused_weights}") + + return proj_weight + + +def load_conv_layer(full_name, value, feature_extractor, unused_weights, use_group_norm): + name = full_name.split("conv_layers.")[-1] + items = name.split(".") + layer_id = int(items[0]) + type_id = int(items[1]) + + if type_id == 0: + if "bias" in name: + assert ( + value.shape == feature_extractor.conv_layers[layer_id].conv.bias.data.shape + ), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.bias.data.shape} was found." + feature_extractor.conv_layers[layer_id].conv.bias.data = value + logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.") + elif "weight" in name: + assert ( + value.shape == feature_extractor.conv_layers[layer_id].conv.weight.data.shape + ), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.weight.data.shape} was found." + feature_extractor.conv_layers[layer_id].conv.weight.data = value + logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.") + elif (type_id == 2 and not use_group_norm) or (type_id == 2 and layer_id == 0 and use_group_norm): + if "bias" in name: + assert ( + value.shape == feature_extractor.conv_layers[layer_id].layer_norm.bias.data.shape + ), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.bias.data.shape} was found." + feature_extractor.conv_layers[layer_id].layer_norm.bias.data = value + logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.") + elif "weight" in name: + assert ( + value.shape == feature_extractor.conv_layers[layer_id].layer_norm.weight.data.shape + ), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.weight.data.shape} was found." + feature_extractor.conv_layers[layer_id].layer_norm.weight.data = value + logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.") + else: + unused_weights.append(full_name) + + +def make_linear_from_emb(emb): + vocab_size, emb_size = emb.weight.shape + lin_layer = nn.Linear(vocab_size, emb_size, bias=False) + lin_layer.weight.data = emb.weight.data + return lin_layer + + +def create_vocab_dict(dict_path): + with open(dict_path, "r", encoding="utf-8") as f: + lines = f.readlines() + words = [line.split(" ")[0] for line in lines] + + num_words = len(words) + + vocab_dict = { + "": 0, + "": 1, + "": 2, + "": 3, + } + + vocab_dict.update({k: v for k, v in zip(words, range(4, num_words + 4))}) + return vocab_dict + + +@torch.no_grad() +def convert_wav2vec2_checkpoint( + checkpoint_path, + pytorch_dump_folder_path, + dict_path, + encoder_config_path, + decoder_config_path, + vocab_size, + num_decoder_layers, +): + """ + Copy/paste/tweak model's weights to transformers design. + """ + encoder_config = Wav2Vec2Config.from_pretrained(encoder_config_path) + decoder_config = Speech2Text2Config.from_pretrained( + decoder_config_path, vocab_size=vocab_size, decoder_layers=num_decoder_layers, do_stable_layer_norm=True + ) + + feature_extractor = Wav2Vec2FeatureExtractor( + feature_size=1, + sampling_rate=16000, + padding_value=0, + do_normalize=True, + return_attention_mask=True, + ) + + model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task( + [checkpoint_path], arg_overrides={"data": "/".join(dict_path.split("/")[:-1])} + ) + model = model[0].eval() + + # set weights for wav2vec2 encoder + hf_encoder = Wav2Vec2Model(encoder_config) + projection_layer = recursively_load_weights_wav2vec2(model.encoder, hf_encoder) + + hf_decoder = Speech2Text2ForCausalLM(decoder_config) + missing_keys, unexpected_keys = hf_decoder.model.decoder.load_state_dict(model.decoder.state_dict(), strict=False) + + # set output linear layer + unexpected_keys.remove("embed_out") + hf_decoder.lm_head.weight = nn.Parameter(model.decoder.embed_out.detach()) + + # layer norm is init to identity matrix so leaving it is fine + logger.warning(f"The following keys are missing when loading the decoder weights: {missing_keys}") + logger.warning(f"The following keys are unexpected when loading the decoder weights: {unexpected_keys}") + + hf_wav2vec = SpeechEncoderDecoderModel(encoder=hf_encoder, decoder=hf_decoder) + hf_wav2vec.config.tie_word_embeddings = False + + # add projection layer + hf_wav2vec.enc_to_dec_proj.weight = nn.Parameter(projection_layer.weight) + hf_wav2vec.enc_to_dec_proj.bias = nn.Parameter(projection_layer.bias) + + vocab_dict = create_vocab_dict(dict_path) + + with open(os.path.join(pytorch_dump_folder_path, "vocab.json"), "w") as fp: + json.dump(vocab_dict, fp) + + tokenizer = Speech2Text2Tokenizer(os.path.join(pytorch_dump_folder_path, "vocab.json")) + tokenizer.save_pretrained(pytorch_dump_folder_path) + + config = hf_wav2vec.config.to_dict() + config["pad_token_id"] = tokenizer.pad_token_id + config["bos_token_id"] = tokenizer.bos_token_id + config["eos_token_id"] = tokenizer.eos_token_id + config["tokenizer_class"] = "speech_to_text_2" + config["feature_extractor_type"] = "wav2vec2" + + hf_wav2vec.config = SpeechEncoderDecoderConfig.from_dict(config) + + hf_wav2vec.save_pretrained(pytorch_dump_folder_path) + feature_extractor.save_pretrained(pytorch_dump_folder_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.") + parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint") + parser.add_argument("--dict_path", default=None, type=str, help="Path to dict of fine-tuned model") + parser.add_argument( + "--encoder_config_path", + default="facebook/wav2vec2-large-lv60", + type=str, + help="Path to hf encoder wav2vec2 checkpoint config", + ) + parser.add_argument( + "--decoder_config_path", + default="facebook/s2t-small-mustc-en-fr-st", + type=str, + help="Path to hf decoder s2t checkpoint config", + ) + parser.add_argument("--vocab_size", default=10224, type=int, help="Vocab size of decoder") + parser.add_argument("--num_decoder_layers", default=7, type=int, help="Number of decoder layers") + + args = parser.parse_args() + convert_wav2vec2_checkpoint( + args.checkpoint_path, + args.pytorch_dump_folder_path, + args.dict_path, + encoder_config_path=args.encoder_config_path, + decoder_config_path=args.decoder_config_path, + vocab_size=args.vocab_size, + num_decoder_layers=args.num_decoder_layers, + ) diff --git a/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py b/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py new file mode 100644 index 00000000000..06ce8937dfa --- /dev/null +++ b/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py @@ -0,0 +1,522 @@ +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Classes to support Speech-Encoder-Text-Decoder architectures """ + + +from typing import Optional + +from torch import nn + +from ...configuration_utils import PretrainedConfig +from ...file_utils import add_start_docstrings, add_start_docstrings_to_model_forward, replace_return_docstrings +from ...modeling_outputs import Seq2SeqLMOutput +from ...modeling_utils import PreTrainedModel +from ...utils import logging +from ..auto.configuration_auto import AutoConfig +from ..auto.modeling_auto import AutoModel, AutoModelForCausalLM +from .configuration_speech_encoder_decoder import SpeechEncoderDecoderConfig + + +logger = logging.get_logger(__name__) + +_CONFIG_FOR_DOC = "SpeechEncoderDecoderConfig" + +SPEECH_ENCODER_DECODER_START_DOCSTRING = r""" + This class can be used to initialize a speech-sequence-to-text-sequence model with any pretrained speech + autoencoding model as the encoder and any pretrained text autoregressive model as the decoder. The encoder is + loaded via :meth:`~transformers.AutoModel.from_pretrained` function and the decoder is loaded via + :meth:`~transformers.AutoModelForCausalLM.from_pretrained` function. Cross-attention layers are automatically added + to the decoder and should be fine-tuned on a downstream generative task, like summarization. + + The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation + tasks was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks + `__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. Michael Matena, Yanqi + Zhou, Wei Li, Peter J. Liu. + + Additionally, in `Large-Scale Self- and Semi-Supervised Learning for Speech Translation + `__ it is shown how leveraging large pretrained speech models for speech + translation yields a significant performance improvement. + + After such an Speech-Encoder Decoder model has been trained/fine-tuned, it can be saved/loaded just like any other + models (see the examples for more information). + + This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic + methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, + pruning heads etc.) + + This model is also a PyTorch `torch.nn.Module `__ + subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to + general usage and behavior. + + Parameters: + config (:class:`~transformers.PretrainedConfig`): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model + weights. +""" + +SPEECH_ENCODER_DECODER_INPUTS_DOCSTRING = r""" + Args: + input_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Float values of input raw speech waveform. Values can be obtained by loading a `.flac` or `.wav` audio file + into an array of type `List[float]` or a `numpy.ndarray`, *e.g.* via the soundfile library (`pip install + soundfile`). To prepare the array into `input_values`, the :class:`~transformers.Wav2Vec2Processor` should + be used for padding and conversion into a tensor of type `torch.FloatTensor`. See + :meth:`transformers.Wav2Vec2Processor.__call__` for details. + input_features (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length, feature_size)`, `optional`): + Float values of fbank features extracted from the raw speech waveform. Raw speech waveform can be obtained + by loading a ``.flac`` or ``.wav`` audio file into an array of type :obj:`List[float]` or a + :obj:`numpy.ndarray`, *e.g.* via the soundfile library (``pip install soundfile``). To prepare the array + into :obj:`input_features`, the :class:`~transformers.Speech2TextTokenizer` should be used for extracting + the fbank features, padding and conversion into a tensor of type :obj:`torch.FloatTensor`. See + :meth:`~transformers.Speech2TextTokenizer.__call__` + attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + `What are attention masks? <../glossary.html#attention-mask>`__ + decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): + Indices of decoder input sequence tokens in the vocabulary. + + Indices can be obtained using :class:`~transformers.PreTrainedTokenizer`. See + :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for + details. + + `What are input IDs? <../glossary.html#input-ids>`__ + + If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see + :obj:`past_key_values`). + + Provide for sequence to sequence training to the decoder. Indices can be obtained using + :class:`~transformers.PreTrainedTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and + :meth:`transformers.PreTrainedTokenizer.__call__` for details. + decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): + Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will + also be used by default. + encoder_outputs (:obj:`tuple(torch.FloatTensor)`, `optional`): + This tuple must consist of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: + :obj:`attentions`) :obj:`last_hidden_state` (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, + sequence_length, hidden_size)`) is a tensor of hidden-states at the output of the last layer of the + encoder. Used in the cross-attention of the decoder. + past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): + Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. + + If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids` + (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)` + instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`. + inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): + Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert :obj:`input_ids` indices into associated + vectors than the model's internal embedding lookup matrix. + decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`): + Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded + representation. This is useful if you want more control over how to convert :obj:`decoder_input_ids` + indices into associated vectors than the model's internal embedding lookup matrix. + labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Labels for computing the masked language modeling loss for the decoder. Indices should be in ``[-100, 0, + ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored + (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` + use_cache (:obj:`bool`, `optional`): + If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up + decoding (see :obj:`past_key_values`). + output_attentions (:obj:`bool`, `optional`): + Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned + tensors for more detail. + output_hidden_states (:obj:`bool`, `optional`): + Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for + more detail. + return_dict (:obj:`bool`, `optional`): + If set to ``True``, the model will return a :class:`~transformers.file_utils.Seq2SeqLMOutput` instead of a + plain tuple. + kwargs: (`optional`) Remaining dictionary of keyword arguments. Keyword arguments come in two flavors: + + - Without a prefix which will be input as ``**encoder_kwargs`` for the encoder forward function. + - With a `decoder_` prefix which will be input as ``**decoder_kwargs`` for the decoder forward function. +""" + + +@add_start_docstrings(SPEECH_ENCODER_DECODER_START_DOCSTRING) +class SpeechEncoderDecoderModel(PreTrainedModel): + r""" + :class:`~transformers.EncoderDecoder` is a generic model class that will be instantiated as a transformer + architecture with one of the base model classes of the library as encoder and another one as decoder when created + with the :meth`~transformers.AutoModel.from_pretrained` class method for the encoder and + :meth`~transformers.AutoModelForCausalLM.from_pretrained` class method for the decoder. + """ + config_class = SpeechEncoderDecoderConfig + base_model_prefix = "speech_encoder_decoder" + + def __init__( + self, + config: Optional[PretrainedConfig] = None, + encoder: Optional[PreTrainedModel] = None, + decoder: Optional[PreTrainedModel] = None, + ): + if config is None and (encoder is None or decoder is None): + raise ValueError("Either a configuration or an encoder and a decoder has to be provided.") + if config is None: + config = SpeechEncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config) + else: + if not isinstance(config, self.config_class): + raise ValueError(f"Config: {config} has to be of type {self.config_class}") + + # initialize with config + # make sure input & output embeddings is not tied + config.tie_word_embeddings = False + super().__init__(config) + + if encoder is None: + encoder = AutoModel.from_config(config.encoder) + + if decoder is None: + decoder = AutoModelForCausalLM.from_config(config.decoder) + + self.encoder = encoder + self.decoder = decoder + + if self.encoder.config.to_dict() != self.config.encoder.to_dict(): + logger.warning( + f"Config of the encoder: {self.encoder.__class__} is overwritten by shared encoder config: {self.config.encoder}" + ) + if self.decoder.config.to_dict() != self.config.decoder.to_dict(): + logger.warning( + f"Config of the decoder: {self.decoder.__class__} is overwritten by shared decoder config: {self.config.decoder}" + ) + + # make sure that the individual model's config refers to the shared config + # so that the updates to the config will be synced + self.encoder.config = self.config.encoder + self.decoder.config = self.config.decoder + + if self.encoder.config.hidden_size != self.decoder.config.hidden_size: + # encoder outputs might need to be projected to different dimension for decoder + self.enc_to_dec_proj = nn.Linear(self.encoder.config.hidden_size, self.decoder.config.hidden_size) + + if self.encoder.get_output_embeddings() is not None: + raise ValueError( + f"The encoder {self.encoder} should not have a LM Head. Please use a model without LM Head" + ) + + def get_encoder(self): + return self.encoder + + def get_decoder(self): + return self.decoder + + def get_output_embeddings(self): + return self.decoder.get_output_embeddings() + + def set_output_embeddings(self, new_embeddings): + return self.decoder.set_output_embeddings(new_embeddings) + + @classmethod + def from_pretrained(cls, *args, **kwargs): + # At the moment fast initialization is not supported + # for composite models + if kwargs.get("_fast_init", False): + logger.warning( + "Fast initialization is currently not supported for SpeechEncoderDecoderModel. Falling back to slow intialization..." + ) + kwargs["_fast_init"] = False + return super().from_pretrained(*args, **kwargs) + + @classmethod + def from_encoder_decoder_pretrained( + cls, + encoder_pretrained_model_name_or_path: str = None, + decoder_pretrained_model_name_or_path: str = None, + *model_args, + **kwargs + ) -> PreTrainedModel: + r""" + Instantiate an encoder and a decoder from one or two base classes of the library from pretrained model + checkpoints. + + + The model is set in evaluation mode by default using :obj:`model.eval()` (Dropout modules are deactivated). To + train the model, you need to first set it back in training mode with :obj:`model.train()`. + + Params: + encoder_pretrained_model_name_or_path (:obj: `str`, `optional`): + Information necessary to initiate the encoder. Can be either: + + - A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co. + Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under + a user or organization name, like ``dbmdz/bert-base-german-cased``. + - A path to a `directory` containing model weights saved using + :func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``. + - A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In + this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided + as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in + a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards. + + decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`): + Information necessary to initiate the decoder. Can be either: + + - A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co. + Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under + a user or organization name, like ``dbmdz/bert-base-german-cased``. + - A path to a `directory` containing model weights saved using + :func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``. + - A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In + this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided + as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in + a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards. + + model_args (remaining positional arguments, `optional`): + All remaning positional arguments will be passed to the underlying model's ``__init__`` method. + + kwargs (remaining dictionary of keyword arguments, `optional`): + Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., + :obj:`output_attentions=True`). + + - To update the encoder configuration, use the prefix `encoder_` for each configuration parameter. + - To update the decoder configuration, use the prefix `decoder_` for each configuration parameter. + - To update the parent model configuration, do not use a prefix for each configuration parameter. + + Behaves differently depending on whether a :obj:`config` is provided or automatically loaded. + + Example:: + + >>> from transformers import SpeechEncoderDecoderModel + >>> # initialize a wav2vec2bert from a pretrained Wav2Vec2 and a pretrained BERT model. Note that the cross-attention layers will be randomly initialized + >>> model = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained('facebook/wav2vec2-base-960h', 'bert-base-uncased') + >>> # saving model after fine-tuning + >>> model.save_pretrained("./wav2vec2bert") + >>> # load fine-tuned model + >>> model = SpeechEncoderDecoderModel.from_pretrained("./wav2vec2bert") + + """ + + kwargs_encoder = { + argument[len("encoder_") :]: value for argument, value in kwargs.items() if argument.startswith("encoder_") + } + + kwargs_decoder = { + argument[len("decoder_") :]: value for argument, value in kwargs.items() if argument.startswith("decoder_") + } + + # remove encoder, decoder kwargs from kwargs + for key in kwargs_encoder.keys(): + del kwargs["encoder_" + key] + for key in kwargs_decoder.keys(): + del kwargs["decoder_" + key] + + # Load and initialize the encoder and decoder + # The distinction between encoder and decoder at the model level is made + # by the value of the flag `is_decoder` that we need to set correctly. + encoder = kwargs_encoder.pop("model", None) + if encoder is None: + if encoder_pretrained_model_name_or_path is None: + raise ValueError( + f"No `encoder_model` is passed to kwargs: {kwargs_encoder}. In this case make sure that `encoder_pretrained_model_name_or_path` defined" + ) + + if "config" not in kwargs_encoder: + encoder_config = AutoConfig.from_pretrained(encoder_pretrained_model_name_or_path) + if encoder_config.is_decoder is True or encoder_config.add_cross_attention is True: + + logger.info( + f"Initializing {encoder_pretrained_model_name_or_path} as a encoder model " + "from a decoder model. Cross-attention and casual mask are disabled." + ) + encoder_config.is_decoder = False + encoder_config.add_cross_attention = False + + kwargs_encoder["config"] = encoder_config + + encoder = AutoModel.from_pretrained(encoder_pretrained_model_name_or_path, *model_args, **kwargs_encoder) + + decoder = kwargs_decoder.pop("model", None) + if decoder is None: + if decoder_pretrained_model_name_or_path is None: + raise ValueError( + "If `decoder_model` is not defined as an argument, a `decoder_pretrained_model_name_or_path` has to be defined" + ) + + if "config" not in kwargs_decoder: + decoder_config = AutoConfig.from_pretrained(decoder_pretrained_model_name_or_path) + if decoder_config.is_decoder is False or decoder_config.add_cross_attention is False: + logger.info( + f"Initializing {decoder_pretrained_model_name_or_path} as a decoder model." + "Cross attention layers are added to {decoder_pretrained_model_name_or_path} " + "and randomly initialized if {decoder_pretrained_model_name_or_path}'s architecture allows for cross attention layers." + ) + decoder_config.is_decoder = True + decoder_config.add_cross_attention = True + + kwargs_decoder["config"] = decoder_config + + if kwargs_decoder["config"].is_decoder is False or kwargs_decoder["config"].add_cross_attention is False: + logger.warning( + f"Decoder model {decoder_pretrained_model_name_or_path} is not initialized as a decoder." + f"In order to initialize {decoder_pretrained_model_name_or_path} as a decoder, " + "make sure that the attributes `is_decoder` and `add_cross_attention` of `decoder_config`" + "passed to `.from_encoder_decoder_pretrained(...)` are set to `True` or do not pass a `decoder_config` to `.from_encoder_decoder_pretrained(...)`" + ) + + decoder = AutoModelForCausalLM.from_pretrained(decoder_pretrained_model_name_or_path, **kwargs_decoder) + + # instantiate config with corresponding kwargs + config = SpeechEncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config, **kwargs) + + # make sure input & output embeddings is not tied + config.tie_word_embeddings = False + return cls(encoder=encoder, decoder=decoder, config=config) + + @add_start_docstrings_to_model_forward(SPEECH_ENCODER_DECODER_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_values=None, + input_features=None, + attention_mask=None, + decoder_input_ids=None, + decoder_attention_mask=None, + encoder_outputs=None, + past_key_values=None, + decoder_inputs_embeds=None, + labels=None, + use_cache=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + **kwargs, + ): + r""" + Returns: + + Examples:: + + >>> from transformers import SpeechEncoderDecoderModel, Speech2Text2Processor + >>> import torch + + >>> processor = Speech2Text2Processor.from_pretrained('facebook/s2t-wav2vec2-large-en-de') + >>> model = SpeechEncoderDecoderModel.from_pretrained('facebook/s2t-wav2vec2-large-en-de') + + >>> # process dataset + >>> def map_to_array(batch): + >>> speech, _ = sf.read(batch["file"]) + >>> batch["speech"] = speech + >>> return batch + + >>> ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") + >>> ds = ds.map(map_to_array) + + >>> input_values = processor(ds["speech"][0], return_tensors="pt").input_values # Batch size 1 + >>> decoder_input_ids = torch.tensor([[model.config.decoder.decoder_start_token_id]]) + >>> outputs = model(input_values=input_values, decoder_input_ids=decoder_input_ids) + + >>> # generation + >>> generated = model.generate(input_values) + >>> translation = processor.batch_decode(generated) + + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith("decoder_")} + + kwargs_decoder = { + argument[len("decoder_") :]: value for argument, value in kwargs.items() if argument.startswith("decoder_") + } + + if encoder_outputs is None: + if input_values is not None and input_features is not None: + raise ValueError("You cannot specify both input_values and input_features at the same time") + elif input_values is not None: + inputs = input_values + elif input_features is not None: + inputs = input_features + else: + raise ValueError("You have to specify either input_values or input_features") + + encoder_outputs = self.encoder( + inputs, + attention_mask=attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + **kwargs_encoder, + ) + + encoder_hidden_states = encoder_outputs[0] + + # project encoder_hidden_states + if self.encoder.config.hidden_size != self.decoder.config.hidden_size: + encoder_hidden_states = self.enc_to_dec_proj(encoder_hidden_states) + + # compute correct encoder attention mask + if attention_mask is not None: + encoder_attention_mask = self.encoder._get_feature_vector_attention_mask( + encoder_hidden_states.shape[1], attention_mask + ) + else: + encoder_attention_mask = None + + # Decode + decoder_outputs = self.decoder( + input_ids=decoder_input_ids, + attention_mask=decoder_attention_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + inputs_embeds=decoder_inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + use_cache=use_cache, + past_key_values=past_key_values, + return_dict=return_dict, + **kwargs_decoder, + ) + + if not return_dict: + return decoder_outputs + encoder_outputs + + return Seq2SeqLMOutput( + logits=decoder_outputs.logits, + past_key_values=decoder_outputs.past_key_values, + decoder_hidden_states=decoder_outputs.hidden_states, + decoder_attentions=decoder_outputs.attentions, + cross_attentions=decoder_outputs.cross_attentions, + encoder_last_hidden_state=encoder_hidden_states, + encoder_hidden_states=encoder_outputs.hidden_states, + encoder_attentions=encoder_outputs.attentions, + ) + + def prepare_inputs_for_generation( + self, input_ids, past=None, attention_mask=None, use_cache=None, encoder_outputs=None, **kwargs + ): + decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids, past=past) + decoder_attention_mask = decoder_inputs["attention_mask"] if "attention_mask" in decoder_inputs else None + input_dict = { + "attention_mask": attention_mask, + "decoder_attention_mask": decoder_attention_mask, + "decoder_input_ids": decoder_inputs["input_ids"], + "encoder_outputs": encoder_outputs, + "past_key_values": decoder_inputs["past_key_values"], + "use_cache": use_cache, + } + return input_dict + + def resize_token_embeddings(self, *args, **kwargs): + raise NotImplementedError( + "Resizing the embedding layers via the SpeechEncoderDecoderModel directly is not supported." + "Please use the respective methods of the wrapped decoder object (model.decoder.resize_token_embeddings(...))" + ) + + def _reorder_cache(self, past, beam_idx): + # apply decoder cache reordering here + return self.decoder._reorder_cache(past, beam_idx) diff --git a/src/transformers/models/speech_to_text/modeling_speech_to_text.py b/src/transformers/models/speech_to_text/modeling_speech_to_text.py index 91a12dffdce..491983751b6 100755 --- a/src/transformers/models/speech_to_text/modeling_speech_to_text.py +++ b/src/transformers/models/speech_to_text/modeling_speech_to_text.py @@ -543,26 +543,26 @@ class Speech2TextPreTrainedModel(PreTrainedModel): if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() - def _get_subsampled_output_lengths(self, input_lengths: torch.LongTensor): + def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor): """ Computes the output length of the convolutional layers """ - for i in range(self.config.num_conv_layers): input_lengths = (input_lengths - 1) // 2 + 1 return input_lengths - def _get_subsampled_encoder_attn_mask(self, attention_mask): + def _get_feature_vector_attention_mask(self, feature_vector_length, attention_mask): # generate creates 3D attention mask, because of the shape of input_features # convert it to 2D if thats the case if len(attention_mask.shape) > 2: attention_mask = attention_mask[:, :, -1] - subsampled_lengths = self._get_subsampled_output_lengths(attention_mask.sum(-1)) - max_len = subsampled_lengths.max().item() + subsampled_lengths = self._get_feat_extract_output_lengths(attention_mask.sum(-1)) bsz = attention_mask.size()[0] - attention_mask = torch.zeros((bsz, max_len), dtype=attention_mask.dtype, device=attention_mask.device) + attention_mask = torch.zeros( + (bsz, feature_vector_length), dtype=attention_mask.dtype, device=attention_mask.device + ) # these two operations makes sure that all values # before the output lengths indices are attended to @@ -758,16 +758,16 @@ class Speech2TextEncoder(Speech2TextPreTrainedModel): output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states ) return_dict = return_dict if return_dict is not None else self.config.use_return_dict - if attention_mask is not None: - attention_mask = self._get_subsampled_encoder_attn_mask(attention_mask) - inputs_embeds = self.conv(input_features) inputs_embeds = self.embed_scale * inputs_embeds - if attention_mask is None: - padding_mask = torch.zeros_like(inputs_embeds, dtype=torch.long) - else: + # subsample attention mask if necessary + if attention_mask is not None: + attention_mask = self._get_feature_vector_attention_mask(inputs_embeds.shape[1], attention_mask) padding_mask = attention_mask.ne(1).long() + else: + padding_mask = torch.zeros_like(inputs_embeds, dtype=torch.long) + embed_pos = self.embed_positions(padding_mask) hidden_states = inputs_embeds + embed_pos @@ -857,7 +857,9 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel): config.d_model, self.padding_idx, ) + self.layers = nn.ModuleList([Speech2TextDecoderLayer(config) for _ in range(config.decoder_layers)]) + self.layer_norm = nn.LayerNorm(config.d_model) self.init_weights() @@ -999,7 +1001,6 @@ class Speech2TextDecoder(Speech2TextPreTrainedModel): # expand encoder attention mask if encoder_hidden_states is not None and encoder_attention_mask is not None: - encoder_attention_mask = self._get_subsampled_encoder_attn_mask(encoder_attention_mask) # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) @@ -1176,12 +1177,20 @@ class Speech2TextModel(Speech2TextPreTrainedModel): attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None, ) + # downsample encoder attention mask + if attention_mask is not None: + encoder_attention_mask = self._get_feature_vector_attention_mask( + encoder_outputs[0].shape[1], attention_mask + ) + else: + encoder_attention_mask = None + # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn) decoder_outputs = self.decoder( input_ids=decoder_input_ids, attention_mask=decoder_attention_mask, encoder_hidden_states=encoder_outputs[0], - encoder_attention_mask=attention_mask, + encoder_attention_mask=encoder_attention_mask, head_mask=decoder_head_mask, cross_attn_head_mask=cross_attn_head_mask, past_key_values=past_key_values, diff --git a/src/transformers/models/speech_to_text_2/__init__.py b/src/transformers/models/speech_to_text_2/__init__.py new file mode 100644 index 00000000000..08ed5957208 --- /dev/null +++ b/src/transformers/models/speech_to_text_2/__init__.py @@ -0,0 +1,56 @@ +# flake8: noqa +# There's no way to ignore "F401 '...' imported but unused" warnings in this +# module, but to preserve other warnings. So, don't check this module at all. + +# Copyright 2021 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...file_utils import _LazyModule, is_sentencepiece_available, is_speech_available, is_torch_available + + +_import_structure = { + "configuration_speech_to_text_2": [ + "SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Speech2Text2Config", + ], + "processing_speech_to_text_2": ["Speech2Text2Processor"], + "tokenization_speech_to_text_2": ["Speech2Text2Tokenizer"], +} + + +if is_torch_available(): + _import_structure["modeling_speech_to_text_2"] = [ + "SPEECH_TO_TEXT_2_PRETRAINED_MODEL_ARCHIVE_LIST", + "Speech2Text2ForCausalLM", + "Speech2Text2PreTrainedModel", + ] + + +if TYPE_CHECKING: + from .configuration_speech_to_text_2 import SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP, Speech2Text2Config + from .processing_speech_to_text_2 import Speech2Text2Processor + from .tokenization_speech_to_text_2 import Speech2Text2Tokenizer + + if is_torch_available(): + from .modeling_speech_to_text_2 import ( + SPEECH_TO_TEXT_2_PRETRAINED_MODEL_ARCHIVE_LIST, + Speech2Text2ForCausalLM, + Speech2Text2PreTrainedModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure) diff --git a/src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py b/src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py new file mode 100644 index 00000000000..a26d1643c67 --- /dev/null +++ b/src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py @@ -0,0 +1,151 @@ +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Speech2Text model configuration """ + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "facebook/s2t-small-librispeech-asr": "https://huggingface.co/facebook/s2t-wav2vec2-large-en-de/resolve/main/config.json", + # See all Speech2Text models at https://huggingface.co/models?filter=speech2text2 +} + + +class Speech2Text2Config(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a :class:`~transformers.Speech2Text2ForCausalLM`. It + is used to instantiate an Speech2Text2 model according to the specified arguments, defining the model architecture. + Instantiating a configuration with the defaults will yield a similar configuration to that of the Speech2Text2 + `facebook/s2t-small-librispeech-asr `__ architecture. + + Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model + outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. + + + Args: + vocab_size (:obj:`int`, `optional`, defaults to 50265): + Vocabulary size of the Speech2Text model. Defines the number of different tokens that can be represented by + the :obj:`inputs_ids` passed when calling :class:`~transformers.Speech2TextModel` + d_model (:obj:`int`, `optional`, defaults to 1024): + Dimensionality of the layers and the pooler layer. + decoder_layers (:obj:`int`, `optional`, defaults to 12): + Number of decoder layers. + decoder_attention_heads (:obj:`int`, `optional`, defaults to 16): + Number of attention heads for each attention layer in the Transformer decoder. + decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): + Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. + activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): + The non-linear activation function (function or string) in the pooler. If string, :obj:`"gelu"`, + :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported. + dropout (:obj:`float`, `optional`, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings, and pooler. + attention_dropout (:obj:`float`, `optional`, defaults to 0.0): + The dropout ratio for the attention probabilities. + activation_dropout (:obj:`float`, `optional`, defaults to 0.0): + The dropout ratio for activations inside the fully connected layer. + classifier_dropout (:obj:`float`, `optional`, defaults to 0.0): + The dropout ratio for classifier. + init_std (:obj:`float`, `optional`, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + https://arxiv.org/abs/1909.11556>`__ for more details. + decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): + The LayerDrop probability for the decoder. See the `LayerDrop paper `__ for more details. + use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not the model should return the last key/values attentions (not used by all models). + max_source_positions (:obj:`int`, `optional`, defaults to 6000): + The maximum sequence length of log-mel filter-bank features that this model might ever be used with. + max_target_positions: (:obj:`int`, `optional`, defaults to 1024): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + + Example:: + + >>> from transformers import Speech2Text2ForCausalLM, Speech2Text2Config + + >>> # Initializing a Speech2Text2 s2t_transformer_s style configuration + >>> configuration = Speech2Text2Config() + + >>> # Initializing a model from the s2t_transformer_s style configuration + >>> model = Speech2Text2ForCausalLM(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + """ + model_type = "speech_to_text_2" + keys_to_ignore_at_inference = ["past_key_values"] + + def __init__( + self, + vocab_size=10000, + decoder_layers=6, + decoder_ffn_dim=2048, + decoder_attention_heads=4, + decoder_layerdrop=0.0, + use_cache=True, + activation_function="relu", + d_model=256, + dropout=0.1, + attention_dropout=0.0, + activation_dropout=0.0, + init_std=0.02, + decoder_start_token_id=2, + classifier_dropout=0.0, + scale_embedding=True, + gradient_checkpointing=False, + pad_token_id=1, + bos_token_id=0, + eos_token_id=2, + max_source_positions=6000, + max_target_positions=1024, + **kwargs + ): + super().__init__( + pad_token_id=pad_token_id, + bos_token_id=bos_token_id, + eos_token_id=eos_token_id, + decoder_start_token_id=decoder_start_token_id, + **kwargs, + ) + + self.vocab_size = vocab_size + self.d_model = d_model + self.decoder_ffn_dim = decoder_ffn_dim + self.decoder_layers = decoder_layers + self.decoder_attention_heads = decoder_attention_heads + self.dropout = dropout + self.attention_dropout = attention_dropout + self.activation_dropout = activation_dropout + self.activation_function = activation_function + self.init_std = init_std + self.decoder_layerdrop = decoder_layerdrop + self.classifier_dropout = classifier_dropout + self.use_cache = use_cache + self.num_hidden_layers = decoder_layers + self.gradient_checkpointing = gradient_checkpointing + self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True + self.max_source_positions = max_source_positions + self.max_target_positions = max_target_positions + + @property + def num_attention_heads(self) -> int: + return self.decoder_attention_heads + + @property + def hidden_size(self) -> int: + return self.d_model diff --git a/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py b/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py new file mode 100755 index 00000000000..35bd8f30881 --- /dev/null +++ b/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py @@ -0,0 +1,923 @@ +# coding=utf-8 +# Copyright 2021 The Fairseq Authors and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch Speech2Text2 model. """ + + +import copy +import math +import random +from typing import Optional, Tuple + +import torch +from torch import nn +from torch.nn import CrossEntropyLoss + +from ...activations import ACT2FN +from ...file_utils import add_start_docstrings, replace_return_docstrings +from ...modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, CausalLMOutputWithCrossAttentions +from ...modeling_utils import PreTrainedModel +from ...utils import logging +from .configuration_speech_to_text_2 import Speech2Text2Config + + +logger = logging.get_logger(__name__) + +_CONFIG_FOR_DOC = "Speech2Text2Config" +_TOKENIZER_FOR_DOC = "Speech2Text2Tokenizer" +_CHECKPOINT_FOR_DOC = "facebook/s2t-small-librispeech-asr" + + +SPEECH_TO_TEXT_2_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "facebook/s2t-small-librispeech-asr", + # See all Speech2Text2 models at https://huggingface.co/models?filter=speech_to_text +] + + +# Copied from transformers.models.bart.modeling_bart._make_causal_mask +def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): + """ + Make causal mask used for bi-directional self-attention. + """ + bsz, tgt_len = input_ids_shape + mask = torch.full((tgt_len, tgt_len), float("-inf")) + mask_cond = torch.arange(mask.size(-1)) + mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) + mask = mask.to(dtype) + + if past_key_values_length > 0: + mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1) + return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) + + +# Copied from transformers.models.bart.modeling_bart._expand_mask +def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): + """ + Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. + """ + bsz, src_len = mask.size() + tgt_len = tgt_len if tgt_len is not None else src_len + + expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) + + inverted_mask = 1.0 - expanded_mask + + return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min) + + +# Copied from transformers.models.speech_to_text.modeling_speech_to_text.Speech2TextSinusoidalPositionalEmbedding with Speech2Text->Speech2Text2 +class Speech2Text2SinusoidalPositionalEmbedding(nn.Module): + """This module produces sinusoidal positional embeddings of any length.""" + + def __init__(self, num_positions: int, embedding_dim: int, padding_idx: Optional[int] = None): + super().__init__() + self.offset = 2 + self.embedding_dim = embedding_dim + self.padding_idx = padding_idx + self.make_weights(num_positions + self.offset, embedding_dim, padding_idx) + + def make_weights(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None): + emb_weights = self.get_embedding(num_embeddings, embedding_dim, padding_idx) + if hasattr(self, "weights"): + # in forward, put the weights on correct device + emb_weights = emb_weights.to(self.weights.device) + + self.weights = nn.Parameter(emb_weights) + self.weights.requires_grad = False + self.weights.detach_() + + @staticmethod + def get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None): + """ + Build sinusoidal embeddings. This matches the implementation in tensor2tensor, but differs slightly from the + description in Section 3.5 of "Attention Is All You Need". + """ + half_dim = embedding_dim // 2 + emb = math.log(10000) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb) + emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0) + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1) + if embedding_dim % 2 == 1: + # zero pad + emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1) + if padding_idx is not None: + emb[padding_idx, :] = 0 + return emb + + @torch.no_grad() + def forward(self, input_ids: torch.Tensor, past_key_values_length: int = 0): + bsz, seq_len = input_ids.size() + # Create the position ids from the input token ids. Any padded tokens remain padded. + position_ids = self.create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length).to( + input_ids.device + ) + + # expand embeddings if needed + max_pos = self.padding_idx + 1 + seq_len + if max_pos > self.weights.size(0): + self.make_weights(max_pos + self.offset, self.embedding_dim, self.padding_idx) + + return self.weights.index_select(0, position_ids.view(-1)).view(bsz, seq_len, -1).detach() + + def create_position_ids_from_input_ids( + self, input_ids: torch.Tensor, padding_idx: int, past_key_values_length: Optional[int] = 0 + ): + """ + Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding + symbols are ignored. This is modified from fairseq's `utils.make_positions`. + + Args: + x: torch.Tensor x: + Returns: torch.Tensor + """ + # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA. + mask = input_ids.ne(padding_idx).int() + incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask + return incremental_indices.long() + padding_idx + + +# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->Speech2Text2 +class Speech2Text2Attention(nn.Module): + """Multi-headed attention from 'Attention Is All You Need' paper""" + + def __init__( + self, + embed_dim: int, + num_heads: int, + dropout: float = 0.0, + is_decoder: bool = False, + bias: bool = True, + ): + super().__init__() + self.embed_dim = embed_dim + self.num_heads = num_heads + self.dropout = dropout + self.head_dim = embed_dim // num_heads + assert ( + self.head_dim * num_heads == self.embed_dim + ), f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`: {num_heads})." + self.scaling = self.head_dim ** -0.5 + self.is_decoder = is_decoder + + self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() + + def forward( + self, + hidden_states: torch.Tensor, + key_value_states: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + """Input shape: Batch x Time x Channel""" + + # if key_value_states are provided this layer is used as a cross-attention layer + # for the decoder + is_cross_attention = key_value_states is not None + bsz, tgt_len, embed_dim = hidden_states.size() + + # get query proj + query_states = self.q_proj(hidden_states) * self.scaling + # get key, value proj + if is_cross_attention and past_key_value is not None: + # reuse k,v, cross_attentions + key_states = past_key_value[0] + value_states = past_key_value[1] + elif is_cross_attention: + # cross_attentions + key_states = self._shape(self.k_proj(key_value_states), -1, bsz) + value_states = self._shape(self.v_proj(key_value_states), -1, bsz) + elif past_key_value is not None: + # reuse k, v, self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + key_states = torch.cat([past_key_value[0], key_states], dim=2) + value_states = torch.cat([past_key_value[1], value_states], dim=2) + else: + # self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + + if self.is_decoder: + # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. + # Further calls to cross_attention layer can then reuse all cross-attention + # key/value_states (first "if" case) + # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of + # all previous decoder key/value_states. Further calls to uni-directional self-attention + # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) + # if encoder bi-directional self-attention `past_key_value` is always `None` + past_key_value = (key_states, value_states) + + proj_shape = (bsz * self.num_heads, -1, self.head_dim) + query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) + key_states = key_states.view(*proj_shape) + value_states = value_states.view(*proj_shape) + + src_len = key_states.size(1) + attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) + + if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len): + raise ValueError( + f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is {attn_weights.size()}" + ) + + if attention_mask is not None: + if attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" + ) + attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + if layer_head_mask is not None: + if layer_head_mask.size() != (self.num_heads,): + raise ValueError( + f"Head mask for a single layer should be of size {(self.num_heads,)}, but is {layer_head_mask.size()}" + ) + attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + if output_attentions: + # this operation is a bit awkward, but it's required to + # make sure that attn_weights keeps its gradient. + # In order to do so, attn_weights have to be reshaped + # twice and have to be reused in the following + attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len) + else: + attn_weights_reshaped = None + + attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training) + + attn_output = torch.bmm(attn_probs, value_states) + + if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is {attn_output.size()}" + ) + + attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) + attn_output = attn_output.transpose(1, 2) + attn_output = attn_output.reshape(bsz, tgt_len, embed_dim) + + attn_output = self.out_proj(attn_output) + + return attn_output, attn_weights_reshaped, past_key_value + + +class Speech2Text2DecoderLayer(nn.Module): + def __init__(self, config: Speech2Text2Config): + super().__init__() + self.embed_dim = config.d_model + + self.self_attn = Speech2Text2Attention( + embed_dim=self.embed_dim, + num_heads=config.decoder_attention_heads, + dropout=config.attention_dropout, + is_decoder=True, + ) + self.dropout = config.dropout + self.activation_fn = ACT2FN[config.activation_function] + self.activation_dropout = config.activation_dropout + + self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) + + if config.is_decoder: + self.encoder_attn = Speech2Text2Attention( + self.embed_dim, + config.decoder_attention_heads, + dropout=config.attention_dropout, + is_decoder=True, + ) + self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim) + + self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) + self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim) + self.final_layer_norm = nn.LayerNorm(self.embed_dim) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + cross_attn_layer_head_mask: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + output_attentions: Optional[bool] = False, + use_cache: Optional[bool] = True, + ): + """ + Args: + hidden_states (:obj:`torch.FloatTensor`): input to the layer of shape :obj:`(seq_len, batch, embed_dim)` + attention_mask (:obj:`torch.FloatTensor`): attention mask of size + :obj:`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + encoder_hidden_states (:obj:`torch.FloatTensor`): cross attention input to the layer of shape :obj:`(seq_len, batch, embed_dim)` + encoder_attention_mask (:obj:`torch.FloatTensor`): encoder attention mask of size + :obj:`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + layer_head_mask (:obj:`torch.FloatTensor`): mask for attention heads in a given layer of size + :obj:`(encoder_attention_heads,)`. + cross_attn_layer_head_mask (:obj:`torch.FloatTensor`): mask for cross-attention heads in a given layer of + size `(decoder_attention_heads,)`. + past_key_value (:obj:`Tuple(torch.FloatTensor)`): cached past key and value projection states + output_attentions (:obj:`bool`, `optional`): + Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under + returned tensors for more detail. + """ + residual = hidden_states + + # Self Attention + # decoder uni-directional self-attention cached key/values tuple is at positions 1,2 + self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None + # add present self-attn cache to positions 1,2 of present_key_value tuple + hidden_states, self_attn_weights, present_key_value = self.self_attn( + hidden_states=hidden_states, + past_key_value=self_attn_past_key_value, + attention_mask=attention_mask, + layer_head_mask=layer_head_mask, + output_attentions=output_attentions, + ) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states = self.self_attn_layer_norm(hidden_states) + + # Cross-Attention Block + cross_attn_present_key_value = None + cross_attn_weights = None + if encoder_hidden_states is not None: + residual = hidden_states + + # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple + cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None + hidden_states, cross_attn_weights, cross_attn_present_key_value = self.encoder_attn( + hidden_states=hidden_states, + key_value_states=encoder_hidden_states, + attention_mask=encoder_attention_mask, + layer_head_mask=cross_attn_layer_head_mask, + past_key_value=cross_attn_past_key_value, + output_attentions=output_attentions, + ) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states = self.encoder_attn_layer_norm(hidden_states) + + # add cross-attn to positions 3,4 of present_key_value tuple + present_key_value = present_key_value + cross_attn_present_key_value + + # Fully Connected + residual = hidden_states + hidden_states = self.activation_fn(self.fc1(hidden_states)) + hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training) + hidden_states = self.fc2(hidden_states) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states = self.final_layer_norm(hidden_states) + + outputs = (hidden_states,) + + if output_attentions: + outputs += (self_attn_weights, cross_attn_weights) + + if use_cache: + outputs += (present_key_value,) + + return outputs + + +class Speech2Text2PreTrainedModel(PreTrainedModel): + config_class = Speech2Text2Config + base_model_prefix = "model" + + def _init_weights(self, module): + std = self.config.init_std + if isinstance(module, (nn.Linear, nn.Conv1d)): + module.weight.data.normal_(mean=0.0, std=std) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=std) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + + +SPEECH_TO_TEXT_2_START_DOCSTRING = r""" + This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic + methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, + pruning heads etc.) + + This model is also a PyTorch `torch.nn.Module `__ + subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to + general usage and behavior. + + Parameters: + config (:class:`~transformers.Speech2Text2Config`): + Model configuration class with all the parameters of the model. Initializing with a config file does not + load the weights associated with the model, only the configuration. Check out the + :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights. +""" + + +class Speech2Text2Decoder(Speech2Text2PreTrainedModel): + """ + Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a :class:`Speech2Text2DecoderLayer` + + Args: + config: Speech2Text2Config + embed_tokens (nn.Embedding): output embedding + """ + + def __init__(self, config: Speech2Text2Config): + super().__init__(config) + self.dropout = config.dropout + self.layerdrop = config.decoder_layerdrop + self.padding_idx = config.pad_token_id + self.max_target_positions = config.max_target_positions + self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0 + + self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model, self.padding_idx) + + self.embed_positions = Speech2Text2SinusoidalPositionalEmbedding( + self.max_target_positions, + config.d_model, + self.padding_idx, + ) + + self.layers = nn.ModuleList([Speech2Text2DecoderLayer(config) for _ in range(config.decoder_layers)]) + + self.init_weights() + + def get_input_embeddings(self): + return self.embed_tokens + + def set_input_embeddings(self, value): + self.embed_tokens = value + + def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): + # create causal mask + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + combined_attention_mask = None + if input_shape[-1] > 1: + combined_attention_mask = _make_causal_mask( + input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length + ).to(self.device) + + if attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) + combined_attention_mask = ( + expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask + ) + + return combined_attention_mask + + def forward( + self, + input_ids=None, + attention_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + head_mask=None, + cross_attn_head_mask=None, + past_key_values=None, + inputs_embeds=None, + use_cache=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + r""" + Args: + input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you + provide it. + + Indices can be obtained using :class:`~transformers.Speech2Text2Tokenizer`. See + :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` + for details. + + `What are input IDs? <../glossary.html#input-ids>`__ + attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + `What are attention masks? <../glossary.html#attention-mask>`__ + encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, encoder_sequence_length, hidden_size)`, `optional`): + Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention + of the decoder. + encoder_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, encoder_sequence_length)`, `optional`): + Mask to avoid performing cross-attention on padding tokens indices of encoder input_ids. Mask values + selected in ``[0, 1]``: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + `What are attention masks? <../glossary.html#attention-mask>`__ + head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`): + Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`): + Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention + on hidden heads. Mask values selected in ``[0, 1]``: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + past_key_values (:obj:`tuple(tuple(torch.FloatTensor))`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``): + Tuple of :obj:`tuple(torch.FloatTensor)` of length :obj:`config.n_layers`, with each tuple having 2 + tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional + tensors of shape :obj:`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. + + Contains pre-computed hidden-states (key and values in the self-attention blocks and in the + cross-attention blocks) that can be used (see :obj:`past_key_values` input) to speed up sequential + decoding. + + If :obj:`past_key_values` are used, the user can optionally input only the last + :obj:`decoder_input_ids` (those that don't have their past key value states given to this model) of + shape :obj:`(batch_size, 1)` instead of all :obj:`decoder_input_ids`` of shape :obj:`(batch_size, + sequence_length)`. + inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): + Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded + representation. This is useful if you want more control over how to convert :obj:`input_ids` indices + into associated vectors than the model's internal embedding lookup matrix. + output_attentions (:obj:`bool`, `optional`): + Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under + returned tensors for more detail. + output_hidden_states (:obj:`bool`, `optional`): + Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors + for more detail. + return_dict (:obj:`bool`, `optional`): + Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # retrieve input_ids and inputs_embeds + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + else: + raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds") + + # past_key_values_length + past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0 + + if inputs_embeds is None: + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + + attention_mask = self._prepare_decoder_attention_mask( + attention_mask, input_shape, inputs_embeds, past_key_values_length + ) + + # expand encoder attention mask + if encoder_hidden_states is not None and encoder_attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) + + # embed positions + positions = self.embed_positions(input_ids, past_key_values_length=past_key_values_length) + + hidden_states = inputs_embeds + positions + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + + # decoder layers + all_hidden_states = () if output_hidden_states else None + all_self_attns = () if output_attentions else None + all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None + next_decoder_cache = () if use_cache else None + + # check if head_mask/cross_attn_head_mask has a correct number of layers specified if desired + for attn_mask, mask_name in zip([head_mask, cross_attn_head_mask], ["head_mask", "cross_attn_head_mask"]): + if attn_mask is not None: + if attn_mask.size()[0] != (len(self.layers)): + raise ValueError( + f"The `{mask_name}` should be specified for {len(self.layers)} layers, but it is for {head_mask.size()[0]}." + ) + for idx, decoder_layer in enumerate(self.layers): + # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) + if output_hidden_states: + all_hidden_states += (hidden_states,) + dropout_probability = random.uniform(0, 1) + if self.training and (dropout_probability < self.layerdrop): + continue + + past_key_value = past_key_values[idx] if past_key_values is not None else None + + if getattr(self.config, "gradient_checkpointing", False) and self.training: + + if use_cache: + logger.warning( + "`use_cache = True` is incompatible with `config.gradient_checkpointing = True`. Setting `use_cache = False`..." + ) + use_cache = False + + def create_custom_forward(module): + def custom_forward(*inputs): + # None for past_key_value + return module(*inputs, output_attentions, use_cache) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(decoder_layer), + hidden_states, + attention_mask, + encoder_hidden_states, + encoder_attention_mask, + head_mask[idx] if head_mask is not None else None, + cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None, + None, + ) + else: + + layer_outputs = decoder_layer( + hidden_states, + attention_mask=attention_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + layer_head_mask=(head_mask[idx] if head_mask is not None else None), + cross_attn_layer_head_mask=( + cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None + ), + past_key_value=past_key_value, + output_attentions=output_attentions, + use_cache=use_cache, + ) + hidden_states = layer_outputs[0] + + if use_cache: + next_decoder_cache += (layer_outputs[3 if output_attentions else 1],) + + if output_attentions: + all_self_attns += (layer_outputs[1],) + + if encoder_hidden_states is not None: + all_cross_attentions += (layer_outputs[2],) + + # add hidden states from the last decoder layer + if output_hidden_states: + all_hidden_states += (hidden_states,) + + next_cache = next_decoder_cache if use_cache else None + if not return_dict: + return tuple( + v + for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_cross_attentions] + if v is not None + ) + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=next_cache, + hidden_states=all_hidden_states, + attentions=all_self_attns, + cross_attentions=all_cross_attentions, + ) + + +@add_start_docstrings( + "The Speech2Text2 Model with a language modeling head. Can be used for summarization.", + SPEECH_TO_TEXT_2_START_DOCSTRING, +) +class Speech2Text2DecoderWrapper(Speech2Text2PreTrainedModel): + """ + This wrapper class is a helper class to correctly load pretrained checkpoints when the causal language model is + used in combination with the :class:`~transformers.EncoderDecoderModel` framework. + """ + + def __init__(self, config): + super().__init__(config) + self.decoder = Speech2Text2Decoder(config) + + def forward(self, *args, **kwargs): + return self.decoder(*args, **kwargs) + + +@add_start_docstrings( + "The Speech2Text2 Decoder with a language modeling head. Can be used as the decoder part of :class:`~transformers.EncoderDecoderModel` and :class:`~transformers.SpeechEncoderDecoder`.", + SPEECH_TO_TEXT_2_START_DOCSTRING, +) +class Speech2Text2ForCausalLM(Speech2Text2PreTrainedModel): + def __init__(self, config): + super().__init__(config) + config = copy.deepcopy(config) + config.is_decoder = True + config.is_encoder_decoder = False + self.model = Speech2Text2DecoderWrapper(config) + + self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) + + self.init_weights() + + def get_input_embeddings(self): + return self.model.decoder.embed_tokens + + def set_input_embeddings(self, value): + self.model.decoder.embed_tokens = value + + def get_output_embeddings(self): + return self.lm_head + + def set_output_embeddings(self, new_embeddings): + self.lm_head = new_embeddings + + def set_decoder(self, decoder): + self.model.decoder = decoder + + def get_decoder(self): + return self.model.decoder + + @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids=None, + attention_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + head_mask=None, + cross_attn_head_mask=None, + past_key_values=None, + inputs_embeds=None, + labels=None, + use_cache=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + r""" + Args: + input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you + provide it. + + Indices can be obtained using :class:`~transformers.Speech2Text2Tokenizer`. See + :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` + for details. + + `What are input IDs? <../glossary.html#input-ids>`__ + attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + `What are attention masks? <../glossary.html#attention-mask>`__ + encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`): + Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention + if the model is configured as a decoder. + encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used + in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``: + head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`): + Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + cross_attn_head_mask (:obj:`torch.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`): + Mask to nullify selected heads of the cross-attention modules. Mask values selected in ``[0, 1]``: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + past_key_values (:obj:`tuple(tuple(torch.FloatTensor))`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``): + Tuple of :obj:`tuple(torch.FloatTensor)` of length :obj:`config.n_layers`, with each tuple having 2 + tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional + tensors of shape :obj:`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two + additional tensors are only required when the model is used as a decoder in a Sequence to Sequence + model. + + Contains pre-computed hidden-states (key and values in the self-attention blocks and in the + cross-attention blocks) that can be used (see :obj:`past_key_values` input) to speed up sequential + decoding. + + If :obj:`past_key_values` are used, the user can optionally input only the last ``decoder_input_ids`` + (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)` + instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`. + labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Labels for computing the masked language modeling loss. Indices should either be in ``[0, ..., + config.vocab_size]`` or -100 (see ``input_ids`` docstring). Tokens with indices set to ``-100`` are + ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., + config.vocab_size]``. + use_cache (:obj:`bool`, `optional`): + If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up + decoding (see :obj:`past_key_values`). + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + output_attentions (:obj:`bool`, `optional`): + Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under + returned tensors for more detail. + output_hidden_states (:obj:`bool`, `optional`): + Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors + for more detail. + return_dict (:obj:`bool`, `optional`): + Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. + + Returns: + + Example:: + + >>> from transformers import SpeechEncoderDecoderModel, Speech2Text2ForCausalLM, Wav2Vec2Model, Speech2Text2Config, Wav2Vec2Config + + >>> encoder = Wav2Vec2Model(Wav2Vec2Config()) + >>> decoder = Speech2Text2ForCausalLM(Speech2Text2Config()) + + # init speech2text model + >>> model = SpeechEncoderDecoderModel(encoder=encoder, decoder=decoder) + """ + + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) + outputs = self.model.decoder( + input_ids=input_ids, + attention_mask=attention_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + head_mask=head_mask, + cross_attn_head_mask=cross_attn_head_mask, + past_key_values=past_key_values, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + logits = self.lm_head(outputs[0]) + + loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() + loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1)) + + if not return_dict: + output = (logits,) + outputs[1:] + return (loss,) + output if loss is not None else output + + return CausalLMOutputWithCrossAttentions( + loss=loss, + logits=logits, + past_key_values=outputs.past_key_values, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + cross_attentions=outputs.cross_attentions, + ) + + def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, use_cache=None, **kwargs): + # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly + if attention_mask is None: + attention_mask = input_ids.new_ones(input_ids.shape) + + if past: + input_ids = input_ids[:, -1:] + # first step, decoder_cached_states are empty + return { + "input_ids": input_ids, # encoder_outputs is defined. input_ids not needed + "attention_mask": attention_mask, + "past_key_values": past, + "use_cache": use_cache, + } + + @staticmethod + def _reorder_cache(past, beam_idx): + reordered_past = () + for layer_past in past: + reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + return reordered_past diff --git a/src/transformers/models/speech_to_text_2/processing_speech_to_text_2.py b/src/transformers/models/speech_to_text_2/processing_speech_to_text_2.py new file mode 100644 index 00000000000..b1e9b3692d6 --- /dev/null +++ b/src/transformers/models/speech_to_text_2/processing_speech_to_text_2.py @@ -0,0 +1,144 @@ +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Speech processor class for Speech2Text2 +""" +from contextlib import contextmanager + +from ...feature_extraction_sequence_utils import SequenceFeatureExtractor +from ..auto.feature_extraction_auto import AutoFeatureExtractor +from .tokenization_speech_to_text_2 import Speech2Text2Tokenizer + + +class Speech2Text2Processor: + r""" + Constructs a Speech2Text2 processor which wraps a Speech2Text2 feature extractor and a Speech2Text2 tokenizer into + a single processor. + + :class:`~transformers.Speech2Text2Processor` offers all the functionalities of + :class:`~transformers.AutoFeatureExtractor` and :class:`~transformers.Speech2Text2Tokenizer`. See the + :meth:`~transformers.Speech2Text2Processor.__call__` and :meth:`~transformers.Speech2Text2Processor.decode` for + more information. + + Args: + feature_extractor (:obj:`AutoFeatureExtractor`): + An instance of :class:`~transformers.AutoFeatureExtractor`. The feature extractor is a required input. + tokenizer (:obj:`Speech2Text2Tokenizer`): + An instance of :class:`~transformers.Speech2Text2Tokenizer`. The tokenizer is a required input. + """ + + def __init__(self, feature_extractor, tokenizer): + if not isinstance(feature_extractor, SequenceFeatureExtractor): + raise ValueError( + f"`feature_extractor` has to be of type {SequenceFeatureExtractor.__class__}, but is {type(feature_extractor)}" + ) + if not isinstance(tokenizer, Speech2Text2Tokenizer): + raise ValueError( + f"`tokenizer` has to be of type {Speech2Text2Tokenizer.__class__}, but is {type(tokenizer)}" + ) + + self.feature_extractor = feature_extractor + self.tokenizer = tokenizer + self.current_processor = self.feature_extractor + + def save_pretrained(self, save_directory): + """ + Save a Speech2Text2 feature extractor object and Speech2Text2 tokenizer object to the directory + ``save_directory``, so that it can be re-loaded using the + :func:`~transformers.Speech2Text2Processor.from_pretrained` class method. + + .. note:: + + This class method is simply calling :meth:`~transformers.PreTrainedFeatureExtractor.save_pretrained` and + :meth:`~transformers.tokenization_utils_base.PreTrainedTokenizer.save_pretrained`. Please refer to the + docstrings of the methods above for more information. + + Args: + save_directory (:obj:`str` or :obj:`os.PathLike`): + Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will + be created if it does not exist). + """ + + self.feature_extractor.save_pretrained(save_directory) + self.tokenizer.save_pretrained(save_directory) + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): + r""" + Instantiate a :class:`~transformers.Speech2Text2Processor` from a pretrained Speech2Text2 processor. + + .. note:: + + This class method is simply calling AutoFeatureExtractor's + :meth:`~transformers.PreTrainedFeatureExtractor.from_pretrained` and Speech2Text2Tokenizer's + :meth:`~transformers.tokenization_utils_base.PreTrainedTokenizer.from_pretrained`. Please refer to the + docstrings of the methods above for more information. + + Args: + pretrained_model_name_or_path (:obj:`str` or :obj:`os.PathLike`): + This can be either: + + - a string, the `model id` of a pretrained feature_extractor hosted inside a model repo on + huggingface.co. Valid model ids can be located at the root-level, like ``bert-base-uncased``, or + namespaced under a user or organization name, like ``dbmdz/bert-base-german-cased``. + - a path to a `directory` containing a feature extractor file saved using the + :meth:`~transformers.PreTrainedFeatureExtractor.save_pretrained` method, e.g., + ``./my_model_directory/``. + - a path or url to a saved feature extractor JSON `file`, e.g., + ``./my_model_directory/preprocessor_config.json``. + **kwargs + Additional keyword arguments passed along to both :class:`~transformers.PreTrainedFeatureExtractor` and + :class:`~transformers.PreTrainedTokenizer` + """ + feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs) + tokenizer = Speech2Text2Tokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs) + + return cls(feature_extractor=feature_extractor, tokenizer=tokenizer) + + def __call__(self, *args, **kwargs): + """ + When used in normal mode, this method forwards all its arguments to AutoFeatureExtractor's + :meth:`~transformers.AutoFeatureExtractor.__call__` and returns its output. If used in the context + :meth:`~transformers.Speech2Text2Processor.as_target_processor` this method forwards all its arguments to + Speech2Text2Tokenizer's :meth:`~transformers.Speech2Text2Tokenizer.__call__`. Please refer to the doctsring of + the above two methods for more information. + """ + return self.current_processor(*args, **kwargs) + + def batch_decode(self, *args, **kwargs): + """ + This method forwards all its arguments to Speech2Text2Tokenizer's + :meth:`~transformers.PreTrainedTokenizer.batch_decode`. Please refer to the docstring of this method for more + information. + """ + return self.tokenizer.batch_decode(*args, **kwargs) + + def decode(self, *args, **kwargs): + """ + This method forwards all its arguments to Speech2Text2Tokenizer's + :meth:`~transformers.PreTrainedTokenizer.decode`. Please refer to the docstring of this method for more + information. + """ + return self.tokenizer.decode(*args, **kwargs) + + @contextmanager + def as_target_processor(self): + """ + Temporarily sets the tokenizer for processing the input. Useful for encoding the labels when fine-tuning + Speech2Text2. + """ + self.current_processor = self.tokenizer + yield + self.current_processor = self.feature_extractor diff --git a/src/transformers/models/speech_to_text_2/tokenization_speech_to_text_2.py b/src/transformers/models/speech_to_text_2/tokenization_speech_to_text_2.py new file mode 100644 index 00000000000..8cf6702eae4 --- /dev/null +++ b/src/transformers/models/speech_to_text_2/tokenization_speech_to_text_2.py @@ -0,0 +1,131 @@ +# coding=utf-8 +# Copyright 2021 The Facebook Inc. and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization class for Speech2Text2.""" + +import json +import os +from typing import Dict, List, Optional, Tuple + +from ...tokenization_utils import PreTrainedTokenizer +from ...utils import logging + + +logger = logging.get_logger(__name__) + + +VOCAB_FILES_NAMES = { + "vocab_file": "vocab.json", + "tokenizer_config_file": "tokenizer_config.json", +} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "facebook/s2t-wav2vec2-large-en-de": "https://huggingface.co/facebook/s2t-wav2vec2-large-en-de/resolve/main/vocab.json", + }, + "tokenizer_config_file": { + "facebook/s2t-wav2vec2-large-en-de": "https://huggingface.co/facebook/s2t-wav2vec2-large-en-de/resolve/main/tokenizer_config.json", + }, +} + +# Speech2Text2 has no max input length +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"facebook/s2t-wav2vec2-large-en-de": 1024} + + +class Speech2Text2Tokenizer(PreTrainedTokenizer): + + """ + Constructs a Speech2Text2Tokenizer. + + This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains some of the main methods. + Users should refer to the superclass for more information regarding such methods. + + Args: + vocab_file (:obj:`str`): + File containing the vocabulary. + bos_token (:obj:`str`, `optional`, defaults to :obj:`""`): + The beginning of sentence token. + eos_token (:obj:`str`, `optional`, defaults to :obj:`""`): + The end of sentence token. + unk_token (:obj:`str`, `optional`, defaults to :obj:`""`): + The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this + token instead. + pad_token (:obj:`str`, `optional`, defaults to :obj:`""`): + The token used for padding, for example when batching sequences of different lengths. + + **kwargs + Additional keyword arguments passed along to :class:`~transformers.PreTrainedTokenizer` + """ + + vocab_files_names = VOCAB_FILES_NAMES + pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP + max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES + model_input_names = ["input_ids", "attention_mask"] + + def __init__(self, vocab_file, bos_token="", pad_token="", eos_token="", unk_token="", **kwargs): + super().__init__( + unk_token=unk_token, + bos_token=bos_token, + eos_token=eos_token, + pad_token=pad_token, + **kwargs, + ) + + with open(vocab_file, encoding="utf-8") as vocab_handle: + self.encoder = json.load(vocab_handle) + self.decoder = {v: k for k, v in self.encoder.items()} + + @property + def vocab_size(self) -> int: + return len(self.decoder) + + def get_vocab(self) -> Dict: + return dict(self.encoder, **self.added_tokens_encoder) + + def _tokenize(self, text, **kwargs): + raise NotImplementedError("Tokenization requires a bpe tokenization file, which is currently not available") + + def _convert_token_to_id(self, token: str) -> int: + """Converts a token (str) in an index (integer) using the vocab.""" + return self.encoder.get(token, self.encoder.get(self.unk_token)) + + def _convert_id_to_token(self, index: int) -> str: + """Converts an index (integer) in a token (str) using the vocab.""" + result = self.decoder.get(index, self.unk_token) + return result + + def convert_tokens_to_string(self, tokens: List[str]) -> str: + """ + Converts a list of output tokens into a single string. + """ + # combine tokens + string = " ".join(tokens) + + # make sure @@ tokens are concatenated + string = "".join(string.split("@@ ")) + + return string + + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: + if not os.path.isdir(save_directory): + logger.error(f"Vocabulary path ({save_directory}) should be a directory") + return + vocab_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] + ) + + with open(vocab_file, "w", encoding="utf-8") as f: + f.write(json.dumps(self.encoder, ensure_ascii=False)) + + return (vocab_file,) diff --git a/src/transformers/models/wav2vec2/configuration_wav2vec2.py b/src/transformers/models/wav2vec2/configuration_wav2vec2.py index 6df4a87064b..d82e6a6d345 100644 --- a/src/transformers/models/wav2vec2/configuration_wav2vec2.py +++ b/src/transformers/models/wav2vec2/configuration_wav2vec2.py @@ -91,7 +91,7 @@ class Wav2Vec2Config(PretrainedConfig): num_conv_pos_embedding_groups (:obj:`int`, `optional`, defaults to 16): Number of groups of 1D convolutional positional embeddings layer. do_stable_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`): - Whether do apply `stable` layer norm architecture of the Transformer encoder. ``do_stable_layer_norm is + Whether to apply `stable` layer norm architecture of the Transformer encoder. ``do_stable_layer_norm is True`` corresponds to applying layer norm before the attention layer, whereas ``do_stable_layer_norm is False`` corresponds to applying layer norm after the attention layer. apply_spec_augment (:obj:`bool`, `optional`, defaults to :obj:`True`): diff --git a/src/transformers/pipelines/__init__.py b/src/transformers/pipelines/__init__.py index ae460028ef9..942e3ec2059 100755 --- a/src/transformers/pipelines/__init__.py +++ b/src/transformers/pipelines/__init__.py @@ -446,8 +446,8 @@ def pipeline( model_config = model.config - load_tokenizer = type(model_config) in TOKENIZER_MAPPING - load_feature_extractor = type(model_config) in FEATURE_EXTRACTOR_MAPPING + load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None + load_feature_extractor = type(model_config) in FEATURE_EXTRACTOR_MAPPING or feature_extractor is not None if load_tokenizer: # Try to infer tokenizer from model or config name (if provided as str) diff --git a/src/transformers/pipelines/automatic_speech_recognition.py b/src/transformers/pipelines/automatic_speech_recognition.py index f3d050238d7..6104b1efb37 100644 --- a/src/transformers/pipelines/automatic_speech_recognition.py +++ b/src/transformers/pipelines/automatic_speech_recognition.py @@ -70,7 +70,7 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): to support multiple audio formats """ - def __init__(self, feature_extractor: "SequenceFeatureExtractor", *args, **kwargs): + def __init__(self, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs): """ Arguments: feature_extractor (:obj:`~transformers.SequenceFeatureExtractor`): @@ -96,6 +96,7 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): model on the associated CUDA device id. """ super().__init__(*args, **kwargs) + self.feature_extractor = feature_extractor if self.framework == "tf": @@ -139,9 +140,13 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): processed = self.ensure_tensor_on_device(**processed) name = self.model.__class__.__name__ - if name.endswith("ForConditionalGeneration"): - input_ids = processed["input_features"] - tokens = self.model.generate(input_ids=input_ids) + if name.endswith("ForConditionalGeneration") or name.endswith("EncoderDecoderModel"): + encoder = self.model.get_encoder() + # we need to pass `processed.get("attention_mask")` here since audio encoder + # attention mask length is different from expected text decoder `encoder_attention_mask` length + tokens = self.model.generate( + encoder_outputs=encoder(**processed), attention_mask=processed.get("attention_mask") + ) tokens = tokens.squeeze(0) elif name.endswith("ForCTC"): outputs = self.model(**processed) diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py index b63e3abdaab..b8d94b8b36b 100644 --- a/src/transformers/utils/dummy_pt_objects.py +++ b/src/transformers/utils/dummy_pt_objects.py @@ -3152,6 +3152,15 @@ def load_tf_weights_in_roformer(*args, **kwargs): requires_backends(load_tf_weights_in_roformer, ["torch"]) +class SpeechEncoderDecoderModel: + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + @classmethod + def from_pretrained(cls, *args, **kwargs): + requires_backends(cls, ["torch"]) + + SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST = None @@ -3182,6 +3191,24 @@ class Speech2TextPreTrainedModel: requires_backends(cls, ["torch"]) +class Speech2Text2ForCausalLM: + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + @classmethod + def from_pretrained(cls, *args, **kwargs): + requires_backends(cls, ["torch"]) + + +class Speech2Text2PreTrainedModel: + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + @classmethod + def from_pretrained(cls, *args, **kwargs): + requires_backends(cls, ["torch"]) + + SPLINTER_PRETRAINED_MODEL_ARCHIVE_LIST = None diff --git a/tests/test_modeling_speech_encoder_decoder.py b/tests/test_modeling_speech_encoder_decoder.py new file mode 100644 index 00000000000..783d6b7ec0e --- /dev/null +++ b/tests/test_modeling_speech_encoder_decoder.py @@ -0,0 +1,527 @@ +# coding=utf-8 +# Copyright 2021 HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import tempfile +import unittest + +from transformers import is_torch_available +from transformers.testing_utils import require_torch, slow, torch_device + +from .test_modeling_bert import BertModelTester +from .test_modeling_common import ids_tensor +from .test_modeling_speech_to_text import Speech2TextModelTester +from .test_modeling_speech_to_text_2 import Speech2Text2StandaloneDecoderModelTester +from .test_modeling_wav2vec2 import Wav2Vec2ModelTester + + +if is_torch_available(): + import numpy as np + import torch + + from transformers import ( + BertLMHeadModel, + Speech2Text2ForCausalLM, + SpeechEncoderDecoderConfig, + SpeechEncoderDecoderModel, + Wav2Vec2Model, + ) + from transformers.modeling_outputs import BaseModelOutput + from transformers.models.speech_to_text.modeling_speech_to_text import Speech2TextEncoder + + +@require_torch +class EncoderDecoderMixin: + def get_encoder_decoder_model(self, config, decoder_config): + pass + + def prepare_config_and_inputs(self): + pass + + def get_pretrained_model(self): + pass + + def check_encoder_decoder_model_from_pretrained_configs( + self, + config, + attention_mask, + decoder_config, + decoder_input_ids, + decoder_attention_mask, + input_values=None, + input_features=None, + **kwargs + ): + encoder_decoder_config = SpeechEncoderDecoderConfig.from_encoder_decoder_configs(config, decoder_config) + self.assertTrue(encoder_decoder_config.decoder.is_decoder) + + enc_dec_model = SpeechEncoderDecoderModel(encoder_decoder_config) + enc_dec_model.to(torch_device) + enc_dec_model.eval() + + self.assertTrue(enc_dec_model.config.is_encoder_decoder) + + outputs_encoder_decoder = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + ) + + self.assertEqual( + outputs_encoder_decoder["logits"].shape, (decoder_input_ids.shape + (decoder_config.vocab_size,)) + ) + + def check_encoder_decoder_model( + self, + config, + attention_mask, + decoder_config, + decoder_input_ids, + decoder_attention_mask, + input_values=None, + input_features=None, + **kwargs + ): + encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config) + enc_dec_model = SpeechEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model) + self.assertTrue(enc_dec_model.config.decoder.is_decoder) + self.assertTrue(enc_dec_model.config.decoder.add_cross_attention) + self.assertTrue(enc_dec_model.config.is_encoder_decoder) + enc_dec_model.to(torch_device) + outputs_encoder_decoder = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + output_hidden_states=True, + ) + self.assertEqual( + outputs_encoder_decoder["logits"].shape, (decoder_input_ids.shape + (decoder_config.vocab_size,)) + ) + encoder_outputs = BaseModelOutput(last_hidden_state=outputs_encoder_decoder.encoder_hidden_states[-1]) + outputs_encoder_decoder = enc_dec_model( + encoder_outputs=encoder_outputs, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + ) + + self.assertEqual( + outputs_encoder_decoder["logits"].shape, (decoder_input_ids.shape + (decoder_config.vocab_size,)) + ) + + def check_encoder_decoder_model_from_pretrained( + self, + config, + attention_mask, + decoder_config, + decoder_input_ids, + decoder_attention_mask, + return_dict, + input_values=None, + input_features=None, + **kwargs + ): + encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config) + kwargs = {"encoder_model": encoder_model, "decoder_model": decoder_model, "return_dict": return_dict} + enc_dec_model = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained(**kwargs) + enc_dec_model.to(torch_device) + outputs_encoder_decoder = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + output_hidden_states=True, + return_dict=True, + ) + + self.assertEqual( + outputs_encoder_decoder["logits"].shape, (decoder_input_ids.shape + (decoder_config.vocab_size,)) + ) + + def check_save_and_load( + self, + config, + attention_mask, + decoder_config, + decoder_input_ids, + decoder_attention_mask, + input_values=None, + input_features=None, + **kwargs + ): + encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config) + enc_dec_model = SpeechEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model) + enc_dec_model.to(torch_device) + enc_dec_model.eval() + with torch.no_grad(): + outputs = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + ) + out_2 = outputs[0].cpu().numpy() + out_2[np.isnan(out_2)] = 0 + + with tempfile.TemporaryDirectory() as tmpdirname: + enc_dec_model.save_pretrained(tmpdirname) + enc_dec_model = SpeechEncoderDecoderModel.from_pretrained(tmpdirname) + + after_outputs = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + ) + out_1 = after_outputs[0].cpu().numpy() + out_1[np.isnan(out_1)] = 0 + max_diff = np.amax(np.abs(out_1 - out_2)) + self.assertLessEqual(max_diff, 1e-5) + + def check_save_and_load_encoder_decoder_model( + self, + config, + attention_mask, + decoder_config, + decoder_input_ids, + decoder_attention_mask, + input_values=None, + input_features=None, + **kwargs + ): + encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config) + enc_dec_model = SpeechEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model) + enc_dec_model.to(torch_device) + enc_dec_model.eval() + with torch.no_grad(): + outputs = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + ) + out_2 = outputs[0].cpu().numpy() + out_2[np.isnan(out_2)] = 0 + + with tempfile.TemporaryDirectory() as encoder_tmp_dirname, tempfile.TemporaryDirectory() as decoder_tmp_dirname: + enc_dec_model.encoder.save_pretrained(encoder_tmp_dirname) + enc_dec_model.decoder.save_pretrained(decoder_tmp_dirname) + SpeechEncoderDecoderModel.from_encoder_decoder_pretrained( + encoder_pretrained_model_name_or_path=encoder_tmp_dirname, + decoder_pretrained_model_name_or_path=decoder_tmp_dirname, + ) + + after_outputs = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + ) + out_1 = after_outputs[0].cpu().numpy() + out_1[np.isnan(out_1)] = 0 + max_diff = np.amax(np.abs(out_1 - out_2)) + self.assertLessEqual(max_diff, 1e-5) + + def check_encoder_decoder_model_output_attentions( + self, + config, + attention_mask, + decoder_config, + decoder_input_ids, + decoder_attention_mask, + labels=None, + input_values=None, + input_features=None, + **kwargs + ): + # make the decoder inputs a different shape from the encoder inputs to harden the test + decoder_input_ids = decoder_input_ids[:, :-1] + decoder_attention_mask = decoder_attention_mask[:, :-1] + encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config) + enc_dec_model = SpeechEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model) + enc_dec_model.to(torch_device) + outputs_encoder_decoder = enc_dec_model( + input_values=input_values, + input_features=input_features, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + decoder_attention_mask=decoder_attention_mask, + output_attentions=True, + ) + + inputs = input_values if input_features is None else input_features + + encoder_attentions = outputs_encoder_decoder["encoder_attentions"] + self.assertEqual(len(encoder_attentions), config.num_hidden_layers) + + seq_len = enc_dec_model.encoder._get_feat_extract_output_lengths(inputs.shape[1]) + self.assertEqual(encoder_attentions[0].shape[-3:], (config.num_attention_heads, seq_len, seq_len)) + + decoder_attentions = outputs_encoder_decoder["decoder_attentions"] + num_decoder_layers = ( + decoder_config.num_decoder_layers + if hasattr(decoder_config, "num_decoder_layers") + else decoder_config.num_hidden_layers + ) + self.assertEqual(len(decoder_attentions), num_decoder_layers) + + self.assertEqual( + decoder_attentions[0].shape[-3:], + (decoder_config.num_attention_heads, decoder_input_ids.shape[-1], decoder_input_ids.shape[-1]), + ) + + cross_attentions = outputs_encoder_decoder["cross_attentions"] + self.assertEqual(len(cross_attentions), num_decoder_layers) + + cross_attention_input_seq_len = decoder_input_ids.shape[-1] + self.assertEqual( + cross_attentions[0].shape[-3:], + (decoder_config.num_attention_heads, cross_attention_input_seq_len, seq_len), + ) + + def check_encoder_decoder_model_generate( + self, config, decoder_config, input_values=None, input_features=None, **kwargs + ): + encoder_model, decoder_model = self.get_encoder_decoder_model(config, decoder_config) + enc_dec_model = SpeechEncoderDecoderModel(encoder=encoder_model, decoder=decoder_model) + enc_dec_model.to(torch_device) + + inputs = input_values if input_features is None else input_features + + # Bert does not have a bos token id, so use pad_token_id instead + generated_output = enc_dec_model.generate( + inputs, decoder_start_token_id=enc_dec_model.config.decoder.pad_token_id + ) + self.assertEqual(generated_output.shape, (inputs.shape[0],) + (decoder_config.max_length,)) + + def test_encoder_decoder_model(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_encoder_decoder_model(**input_ids_dict) + + def test_encoder_decoder_model_from_pretrained_configs(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_encoder_decoder_model_from_pretrained_configs(**input_ids_dict) + + def test_encoder_decoder_model_from_pretrained(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_encoder_decoder_model_from_pretrained(**input_ids_dict, return_dict=False) + + def test_encoder_decoder_model_from_pretrained_return_dict(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_encoder_decoder_model_from_pretrained(**input_ids_dict, return_dict=True) + + def test_save_and_load_from_pretrained(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_save_and_load(**input_ids_dict) + + def test_save_and_load_from_encoder_decoder_pretrained(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_save_and_load_encoder_decoder_model(**input_ids_dict) + + def test_encoder_decoder_model_output_attentions(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_encoder_decoder_model_output_attentions(**input_ids_dict) + + def test_encoder_decoder_model_generate(self): + input_ids_dict = self.prepare_config_and_inputs() + self.check_encoder_decoder_model_generate(**input_ids_dict) + + @slow + def test_real_model_save_load_from_pretrained(self): + model_2 = self.get_pretrained_model() + model_2.to(torch_device) + input_name, inputs = self.get_inputs() + decoder_input_ids = ids_tensor([13, 1], model_2.config.encoder.vocab_size) + attention_mask = ids_tensor([13, 5], vocab_size=2) + with torch.no_grad(): + outputs = model_2( + **{input_name: inputs}, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + ) + out_2 = outputs[0].cpu().numpy() + out_2[np.isnan(out_2)] = 0 + + with tempfile.TemporaryDirectory() as tmp_dirname: + model_2.save_pretrained(tmp_dirname) + model_1 = SpeechEncoderDecoderModel.from_pretrained(tmp_dirname) + model_1.to(torch_device) + + after_outputs = model_1( + **{input_name: inputs}, + decoder_input_ids=decoder_input_ids, + attention_mask=attention_mask, + ) + out_1 = after_outputs[0].cpu().numpy() + out_1[np.isnan(out_1)] = 0 + max_diff = np.amax(np.abs(out_1 - out_2)) + self.assertLessEqual(max_diff, 1e-5) + + +@require_torch +class Wav2Vec2BertModelTest(EncoderDecoderMixin, unittest.TestCase): + def get_pretrained_model(self): + return SpeechEncoderDecoderModel.from_encoder_decoder_pretrained( + "facebook/wav2vec2-base-960h", "bert-base-cased" + ) + + def get_encoder_decoder_model(self, config, decoder_config): + encoder_model = Wav2Vec2Model(config) + decoder_model = BertLMHeadModel(decoder_config) + return encoder_model, decoder_model + + def prepare_config_and_inputs(self): + bert_model_tester = BertModelTester(self) + wav2vec2_model_tester = Wav2Vec2ModelTester(self) + encoder_config_and_inputs = wav2vec2_model_tester.prepare_config_and_inputs() + decoder_config_and_inputs = bert_model_tester.prepare_config_and_inputs_for_decoder() + ( + config, + input_values, + input_mask, + ) = encoder_config_and_inputs + ( + decoder_config, + decoder_input_ids, + decoder_token_type_ids, + decoder_input_mask, + decoder_sequence_labels, + decoder_token_labels, + decoder_choice_labels, + encoder_attention_mask, + _, + ) = decoder_config_and_inputs + + # make sure that cross attention layers are added + decoder_config.add_cross_attention = True + return { + "config": config, + "input_values": input_values, + "attention_mask": input_mask, + "decoder_config": decoder_config, + "decoder_input_ids": decoder_input_ids, + "decoder_token_type_ids": decoder_token_type_ids, + "decoder_attention_mask": decoder_input_mask, + "decoder_sequence_labels": decoder_sequence_labels, + "decoder_token_labels": decoder_token_labels, + "decoder_choice_labels": decoder_choice_labels, + "labels": decoder_token_labels, + } + + +@require_torch +class Speech2TextBertModelTest(EncoderDecoderMixin, unittest.TestCase): + def get_pretrained_model(self): + return SpeechEncoderDecoderModel.from_encoder_decoder_pretrained( + "facebook/s2t-small-librispeech-asr", "bert-base-cased" + ) + + def get_encoder_decoder_model(self, config, decoder_config): + encoder_model = Speech2TextEncoder(config) + decoder_model = BertLMHeadModel(decoder_config) + return encoder_model, decoder_model + + def prepare_config_and_inputs(self): + bert_model_tester = BertModelTester(self) + speech2text_model_tester = Speech2TextModelTester(self) + encoder_config_and_inputs = speech2text_model_tester.prepare_config_and_inputs() + decoder_config_and_inputs = bert_model_tester.prepare_config_and_inputs_for_decoder() + + config, inputs = encoder_config_and_inputs + input_features = inputs["input_features"] + input_mask = inputs["attention_mask"] + + ( + decoder_config, + decoder_input_ids, + decoder_token_type_ids, + decoder_input_mask, + decoder_sequence_labels, + decoder_token_labels, + decoder_choice_labels, + encoder_attention_mask, + _, + ) = decoder_config_and_inputs + + # make sure that cross attention layers are added + decoder_config.add_cross_attention = True + return { + "config": config, + "input_features": input_features, + "attention_mask": input_mask, + "decoder_config": decoder_config, + "decoder_input_ids": decoder_input_ids, + "decoder_token_type_ids": decoder_token_type_ids, + "decoder_attention_mask": decoder_input_mask, + "decoder_sequence_labels": decoder_sequence_labels, + "decoder_token_labels": decoder_token_labels, + "decoder_choice_labels": decoder_choice_labels, + "labels": decoder_token_labels, + } + + # can't save full model for now because Speech2TextModel != Speech2TextEncoder + def test_encoder_decoder_model_from_pretrained_configs(self): + pass + + # can't save full model for now because Speech2TextModel != Speech2TextEncoder + def test_save_and_load_from_pretrained(self): + pass + + +@require_torch +class Wav2Vec2Speech2Text2(EncoderDecoderMixin, unittest.TestCase): + def get_encoder_decoder_model(self, config, decoder_config): + encoder_model = Wav2Vec2Model(config) + decoder_model = Speech2Text2ForCausalLM(decoder_config) + return encoder_model, decoder_model + + def prepare_config_and_inputs(self): + model_tester_encoder = Wav2Vec2ModelTester(self, batch_size=13) + model_tester_decoder = Speech2Text2StandaloneDecoderModelTester( + self, batch_size=13, d_model=32, max_position_embeddings=512 + ) + encoder_config_and_inputs = model_tester_encoder.prepare_config_and_inputs() + decoder_config_and_inputs = model_tester_decoder.prepare_config_and_inputs() + ( + config, + input_values, + input_mask, + ) = encoder_config_and_inputs + (decoder_config, decoder_input_ids, decoder_attention_mask, _) = decoder_config_and_inputs + + # make sure that cross attention layers are added + decoder_config.add_cross_attention = True + # disable cache for now + decoder_config.use_cache = False + return { + "config": config, + "input_values": input_values, + "attention_mask": input_mask, + "decoder_config": decoder_config, + "decoder_input_ids": decoder_input_ids, + "decoder_attention_mask": decoder_attention_mask, + } + + def get_pretrained_model(self): + return SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("bert-large-uncased", "facebook/bart-large") diff --git a/tests/test_modeling_speech_to_text.py b/tests/test_modeling_speech_to_text.py index 2a56a583fa5..546e356c721 100644 --- a/tests/test_modeling_speech_to_text.py +++ b/tests/test_modeling_speech_to_text.py @@ -241,11 +241,15 @@ class Speech2TextModelTester: decoder.save_pretrained(tmpdirname) decoder = Speech2TextDecoder.from_pretrained(tmpdirname).to(torch_device) + encoder_attention_mask = encoder._get_feature_vector_attention_mask( + encoder_last_hidden_state.shape[1], inputs_dict["attention_mask"] + ) + last_hidden_state_2 = decoder( input_ids=inputs_dict["decoder_input_ids"], attention_mask=inputs_dict["decoder_attention_mask"], encoder_hidden_states=encoder_last_hidden_state, - encoder_attention_mask=inputs_dict["attention_mask"], + encoder_attention_mask=encoder_attention_mask, )[0] self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3) @@ -288,6 +292,7 @@ class Speech2TextModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.Tes config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common() self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs) + # not implemented currently def test_inputs_embeds(self): pass @@ -352,7 +357,7 @@ class Speech2TextModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.Tes else: seq_length = self.model_tester.seq_length - subsampled_seq_length = model._get_subsampled_output_lengths(seq_length) + subsampled_seq_length = model._get_feat_extract_output_lengths(seq_length) self.assertListEqual( list(hidden_states[0].shape[-2:]), @@ -402,8 +407,8 @@ class Speech2TextModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.Tes model.to(torch_device) model.eval() - subsampled_encoder_seq_length = model._get_subsampled_output_lengths(encoder_seq_length) - subsampled_encoder_key_length = model._get_subsampled_output_lengths(encoder_key_length) + subsampled_encoder_seq_length = model._get_feat_extract_output_lengths(encoder_seq_length) + subsampled_encoder_key_length = model._get_feat_extract_output_lengths(encoder_key_length) with torch.no_grad(): outputs = model(**self._prepare_for_class(inputs_dict, model_class)) diff --git a/tests/test_modeling_speech_to_text_2.py b/tests/test_modeling_speech_to_text_2.py new file mode 100644 index 00000000000..aaded0c8aa8 --- /dev/null +++ b/tests/test_modeling_speech_to_text_2.py @@ -0,0 +1,211 @@ +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Testing suite for the PyTorch Speech2Text model. """ + +import unittest + +from transformers import Speech2Text2Config +from transformers.testing_utils import is_torch_available, require_torch, torch_device + +from .test_configuration_common import ConfigTester +from .test_generation_utils import GenerationTesterMixin +from .test_modeling_common import ModelTesterMixin, ids_tensor + + +if is_torch_available(): + import torch + + from transformers.models.speech_to_text_2.modeling_speech_to_text_2 import ( + Speech2Text2Decoder, + Speech2Text2ForCausalLM, + ) + + +@require_torch +class Speech2Text2StandaloneDecoderModelTester: + def __init__( + self, + parent, + vocab_size=99, + batch_size=13, + d_model=16, + decoder_seq_length=7, + is_training=True, + is_decoder=True, + use_attention_mask=True, + use_cache=False, + use_labels=True, + decoder_start_token_id=2, + decoder_ffn_dim=32, + decoder_layers=4, + decoder_attention_heads=4, + max_position_embeddings=30, + pad_token_id=0, + bos_token_id=1, + eos_token_id=2, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.decoder_seq_length = decoder_seq_length + # For common tests + self.seq_length = self.decoder_seq_length + self.is_training = is_training + self.use_attention_mask = use_attention_mask + self.use_labels = use_labels + + self.vocab_size = vocab_size + self.d_model = d_model + self.hidden_size = d_model + self.num_hidden_layers = decoder_layers + self.decoder_layers = decoder_layers + self.decoder_ffn_dim = decoder_ffn_dim + self.decoder_attention_heads = decoder_attention_heads + self.num_attention_heads = decoder_attention_heads + self.eos_token_id = eos_token_id + self.bos_token_id = bos_token_id + self.pad_token_id = pad_token_id + self.decoder_start_token_id = decoder_start_token_id + self.use_cache = use_cache + self.max_position_embeddings = max_position_embeddings + + self.scope = None + self.decoder_key_length = decoder_seq_length + self.base_model_out_len = 2 + self.decoder_attention_idx = 1 + + def prepare_config_and_inputs(self): + input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size) + + attention_mask = None + if self.use_attention_mask: + attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2) + + lm_labels = None + if self.use_labels: + lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size) + + config = Speech2Text2Config( + vocab_size=self.vocab_size, + d_model=self.d_model, + decoder_layers=self.decoder_layers, + decoder_ffn_dim=self.decoder_ffn_dim, + decoder_attention_heads=self.decoder_attention_heads, + eos_token_id=self.eos_token_id, + bos_token_id=self.bos_token_id, + use_cache=self.use_cache, + pad_token_id=self.pad_token_id, + decoder_start_token_id=self.decoder_start_token_id, + max_position_embeddings=self.max_position_embeddings, + ) + + return ( + config, + input_ids, + attention_mask, + lm_labels, + ) + + def create_and_check_decoder_model_past( + self, + config, + input_ids, + attention_mask, + lm_labels, + ): + config.use_cache = True + model = Speech2Text2Decoder(config=config).to(torch_device).eval() + input_ids = input_ids[:2] + + input_ids[input_ids == 0] += 1 + # first forward pass + outputs = model(input_ids, use_cache=True) + outputs_use_cache_conf = model(input_ids) + outputs_no_past = model(input_ids, use_cache=False) + + self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf)) + self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1) + + past_key_values = outputs["past_key_values"] + + # create hypothetical next token and extent to next_input_ids + next_tokens = ids_tensor((2, 1), config.vocab_size - 1) + 1 + + # append to next input_ids and + next_input_ids = torch.cat([input_ids, next_tokens], dim=-1) + + print(next_input_ids) + + output_from_no_past = model(next_input_ids)["last_hidden_state"] + output_from_past = model(next_tokens, past_key_values=past_key_values)["last_hidden_state"] + + # select random slice + random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item() + output_from_no_past_slice = output_from_no_past[:, next_input_ids.shape[-1] - 1, random_slice_idx].detach() + output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach() + + # test that outputs are equal for slice + assert torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + ( + config, + input_ids, + attention_mask, + lm_labels, + ) = config_and_inputs + + inputs_dict = { + "input_ids": input_ids, + "attention_mask": attention_mask, + } + return config, inputs_dict + + +@require_torch +class Speech2Text2StandaloneDecoderModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase): + all_model_classes = (Speech2Text2Decoder, Speech2Text2ForCausalLM) if is_torch_available() else () + all_generative_model_classes = (Speech2Text2ForCausalLM,) if is_torch_available() else () + test_pruning = False + + def setUp( + self, + ): + self.model_tester = Speech2Text2StandaloneDecoderModelTester(self, is_training=False) + self.config_tester = ConfigTester(self, config_class=Speech2Text2Config) + + # not implemented currently + def test_inputs_embeds(self): + pass + + # speech2text2 has no base model + def test_save_load_fast_init_from_base(self): + pass + + # speech2text2 has no base model + def test_save_load_fast_init_to_base(self): + pass + + def test_config(self): + self.config_tester.run_common_tests() + + def test_decoder_model_past(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_decoder_model_past(*config_and_inputs) + + # decoder cannot keep gradients + def test_retain_grad_hidden_states_attentions(self): + return diff --git a/tests/test_pipelines_audio_classification.py b/tests/test_pipelines_audio_classification.py index 4837b05c2a9..786de18da13 100644 --- a/tests/test_pipelines_audio_classification.py +++ b/tests/test_pipelines_audio_classification.py @@ -76,6 +76,8 @@ class AudioClassificationPipelineTests(unittest.TestCase, metaclass=PipelineTest @require_torch def test_small_model_pt(self): model = "anton-l/wav2vec2-random-tiny-classifier" + + # hack: dummy tokenizer is required to prevent pipeline from failing tokenizer = PreTrainedTokenizer() audio_classifier = pipeline("audio-classification", model=model, tokenizer=tokenizer) @@ -98,6 +100,8 @@ class AudioClassificationPipelineTests(unittest.TestCase, metaclass=PipelineTest import datasets model = "superb/wav2vec2-base-superb-ks" + + # hack: dummy tokenizer is required to prevent pipeline from failing tokenizer = PreTrainedTokenizer() audio_classifier = pipeline("audio-classification", model=model, tokenizer=tokenizer) dataset = datasets.load_dataset("anton-l/superb_dummy", "ks", split="test") diff --git a/tests/test_pipelines_automatic_speech_recognition.py b/tests/test_pipelines_automatic_speech_recognition.py index 6d391eee9a9..4ee03271f4a 100644 --- a/tests/test_pipelines_automatic_speech_recognition.py +++ b/tests/test_pipelines_automatic_speech_recognition.py @@ -14,6 +14,8 @@ import unittest +import pytest + from transformers import AutoFeatureExtractor, AutoTokenizer, Speech2TextForConditionalGeneration, Wav2Vec2ForCTC from transformers.pipelines import AutomaticSpeechRecognitionPipeline, pipeline from transformers.testing_utils import is_pipeline_test, require_datasets, require_torch, require_torchaudio, slow @@ -44,6 +46,16 @@ class AutomaticSpeechRecognitionPipelineTests(unittest.TestCase): output = speech_recognizer(waveform) self.assertEqual(output, {"text": "C'est ce que j'ai fait à ce moment-là."}) + @require_torch + def test_torch_small_no_tokenizer_files(self): + # test that model without tokenizer file cannot be loaded + with pytest.raises(ValueError): + pipeline( + task="automatic-speech-recognition", + model="hf-internal-testing/tiny-random-wav2vec2", + framework="pt", + ) + @require_datasets @require_torch @slow @@ -67,6 +79,24 @@ class AutomaticSpeechRecognitionPipelineTests(unittest.TestCase): output = speech_recognizer(filename) self.assertEqual(output, {"text": "A MAN SAID TO THE UNIVERSE SIR I EXIST"}) + @require_datasets + @require_torch + @slow + def test_torch_speech_encoder_decoder(self): + speech_recognizer = pipeline( + task="automatic-speech-recognition", + model="facebook/s2t-wav2vec2-large-en-de", + feature_extractor="facebook/s2t-wav2vec2-large-en-de", + framework="pt", + ) + + from datasets import load_dataset + + ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") + filename = ds[0]["file"] + output = speech_recognizer(filename) + self.assertEqual(output, {"text": 'Ein Mann sagte zum Universum : " Sir, ich existiert! "'}) + @slow @require_torch @require_datasets diff --git a/tests/test_tokenization_speech_to_text_2.py b/tests/test_tokenization_speech_to_text_2.py new file mode 100644 index 00000000000..1517555e72b --- /dev/null +++ b/tests/test_tokenization_speech_to_text_2.py @@ -0,0 +1,155 @@ +# Copyright 2021 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import tempfile +import unittest + +from transformers.models.speech_to_text_2 import Speech2Text2Tokenizer +from transformers.models.speech_to_text_2.tokenization_speech_to_text_2 import VOCAB_FILES_NAMES +from transformers.testing_utils import is_pt_tf_cross_test + +from .test_tokenization_common import TokenizerTesterMixin + + +class SpeechToTextTokenizerTest(TokenizerTesterMixin, unittest.TestCase): + tokenizer_class = Speech2Text2Tokenizer + test_rust_tokenizer = False + + def setUp(self): + super().setUp() + + vocab = " here@@ a couple of@@ words for the vocab".split(" ") + vocab_tokens = dict(zip(vocab, range(len(vocab)))) + + self.special_tokens_map = {"pad_token": "", "unk_token": "", "bos_token": "", "eos_token": ""} + + self.tmpdirname = tempfile.mkdtemp() + self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"]) + with open(self.vocab_file, "w", encoding="utf-8") as fp: + fp.write(json.dumps(vocab_tokens) + "\n") + + def test_get_vocab(self): + vocab_keys = list(self.get_tokenizer().get_vocab().keys()) + + self.assertEqual(vocab_keys[0], "") + self.assertEqual(vocab_keys[1], "") + self.assertEqual(vocab_keys[-1], "vocab") + self.assertEqual(len(vocab_keys), 12) + + def test_vocab_size(self): + self.assertEqual(self.get_tokenizer().vocab_size, 12) + + def test_tokenizer_decode(self): + tokenizer = Speech2Text2Tokenizer.from_pretrained(self.tmpdirname) + + # make sure @@ is correctly concatenated + token_ids = [4, 6, 8, 7, 10] # ["here@@", "couple", "words", "of@@", "the"] + output_string = tokenizer.decode(token_ids) + + self.assertTrue(output_string == "herecouple words ofthe") + + # currently tokenizer cannot do encoding, but just decoding + def test_add_special_tokens(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_add_tokens_tokenizer(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_added_tokens_do_lower_case(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_batch_encode_plus_batch_sequence_length(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_batch_encode_plus_overflowing_tokens(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_batch_encode_plus_padding(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_call(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_encode_plus_with_padding(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_internal_consistency(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_maximum_encoding_length_pair_input(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_maximum_encoding_length_single_input(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_number_of_added_tokens(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_padding_to_max_length(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_padding_to_multiple_of(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_pickle_tokenizer(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_prepare_for_model(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_pretokenized_inputs(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_right_and_left_padding(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_save_and_load_tokenizer(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_special_tokens_mask(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_special_tokens_mask_input_pairs(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + def test_token_type_ids(self): + pass + + # currently tokenizer cannot do encoding, but just decoding + @is_pt_tf_cross_test + def test_batch_encode_plus_tensors(self): + pass