mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 21:00:08 +06:00
Add support for XLM-R XL and XXL models by modeling_xlm_roberta_xl.py (#13727)
* add xlm roberta xl * add convert xlm xl fairseq checkpoint to pytorch * fix init and documents for xlm-roberta-xl * fix indention * add test for XLM-R xl,xxl * fix model hub name * fix some stuff * up * correct init * fix more * fix as suggestions * add torch_device * fix default values of doc strings * fix leftovers * merge to master * up * correct hub names * fix docs * fix model * up * finalize * last fix * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * add copied from * make style Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
parent
16d4acbfdb
commit
e09473a817
@ -323,6 +323,7 @@ AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Ch
|
||||
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
||||
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
|
||||
1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/master/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
|
||||
1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
1. **[XLS-R](https://huggingface.co/docs/master/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
|
||||
|
@ -301,6 +301,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
||||
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
|
||||
1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/master/model_doc/xlm-roberta-xl)** (from Facebook AI) released with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
|
||||
1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
1. **[XLS-R](https://huggingface.co/docs/master/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
|
||||
1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
|
@ -325,6 +325,7 @@ conda install -c huggingface transformers
|
||||
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。
|
||||
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
|
||||
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (来自 Facebook AI), 伴随论文 [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) 由 Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov 发布。
|
||||
1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/master/model_doc/xlm-roberta-xl)** (来自 Facebook AI) 伴随论文 [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) 由 Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau 发布。
|
||||
1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (来自 Google/CMU) 伴随论文 [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 由 Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le 发布。
|
||||
1. **[XLS-R](https://huggingface.co/docs/master/transformers/model_doc/xls_r)** (来自 Facebook AI) 伴随论文 [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) 由 Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli 发布。
|
||||
1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (来自 Facebook AI) 伴随论文 [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) 由 Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli 发布。
|
||||
|
@ -337,6 +337,7 @@ conda install -c huggingface transformers
|
||||
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
||||
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
|
||||
1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/master/model_doc/xlm-roberta-xl)** (from Facebook AI) released with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
|
||||
1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
1. **[XLS-R](https://huggingface.co/docs/master/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
|
||||
1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
|
@ -312,6 +312,8 @@
|
||||
title: XLM-ProphetNet
|
||||
- local: model_doc/xlm-roberta
|
||||
title: XLM-RoBERTa
|
||||
- local: model_doc/xlm-roberta-xl
|
||||
title: XLM-RoBERTa-XL
|
||||
- local: model_doc/xlnet
|
||||
title: XLNet
|
||||
- local: model_doc/xlsr_wav2vec2
|
||||
|
@ -146,6 +146,7 @@ conversion utilities for the following models.
|
||||
1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
||||
1. **[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[XLM-RoBERTa](model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
|
||||
1. **[XLM-RoBERTa-XL](model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
|
||||
1. **[XLNet](model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
1. **[XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
1. **[XLS-R](https://huggingface.co/docs/master/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
|
||||
@ -246,6 +247,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| XGLM | ✅ | ✅ | ✅ | ❌ | ✅ |
|
||||
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||
| XLM-RoBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| XLM-RoBERTa-XL | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| XLMProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||
| XLNet | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| YOSO | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
69
docs/source/model_doc/xlm-roberta-xl.mdx
Normal file
69
docs/source/model_doc/xlm-roberta-xl.mdx
Normal file
@ -0,0 +1,69 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# XLM-RoBERTa-XL
|
||||
|
||||
## Overview
|
||||
|
||||
The XLM-RoBERTa-XL model was proposed in [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cross-lingual understanding. In this study, we present the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages. This suggests pretrained models with larger capacity may obtain both strong performance on high-resource languages while greatly improving low-resource languages. We make our code and models publicly available.*
|
||||
|
||||
Tips:
|
||||
|
||||
- XLM-RoBERTa-XL is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
|
||||
not require `lang` tensors to understand which language is used, and should be able to determine the correct
|
||||
language from the input ids.
|
||||
|
||||
This model was contributed by [Soonhwan-Kwon](https://github.com/Soonhwan-Kwon) and [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
|
||||
|
||||
|
||||
## XLMRobertaXLConfig
|
||||
|
||||
[[autodoc]] XLMRobertaXLConfig
|
||||
|
||||
## XLMRobertaXLModel
|
||||
|
||||
[[autodoc]] XLMRobertaXLModel
|
||||
- forward
|
||||
|
||||
## XLMRobertaXLForCausalLM
|
||||
|
||||
[[autodoc]] XLMRobertaXLForCausalLM
|
||||
- forward
|
||||
|
||||
## XLMRobertaXLForMaskedLM
|
||||
|
||||
[[autodoc]] XLMRobertaXLForMaskedLM
|
||||
- forward
|
||||
|
||||
## XLMRobertaXLForSequenceClassification
|
||||
|
||||
[[autodoc]] XLMRobertaXLForSequenceClassification
|
||||
- forward
|
||||
|
||||
## XLMRobertaXLForMultipleChoice
|
||||
|
||||
[[autodoc]] XLMRobertaXLForMultipleChoice
|
||||
- forward
|
||||
|
||||
## XLMRobertaXLForTokenClassification
|
||||
|
||||
[[autodoc]] XLMRobertaXLForTokenClassification
|
||||
- forward
|
||||
|
||||
## XLMRobertaXLForQuestionAnswering
|
||||
|
||||
[[autodoc]] XLMRobertaXLForQuestionAnswering
|
||||
- forward
|
@ -60,6 +60,7 @@ Ready-made configurations include the following architectures:
|
||||
- RoBERTa
|
||||
- T5
|
||||
- XLM-RoBERTa
|
||||
- XLM-RoBERTa-XL
|
||||
|
||||
The ONNX conversion is supported for the PyTorch versions of the models. If you
|
||||
would like to be able to convert a TensorFlow model, please let us know by
|
||||
|
@ -332,6 +332,7 @@ _import_structure = {
|
||||
"models.xlm": ["XLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMConfig", "XLMTokenizer"],
|
||||
"models.xlm_prophetnet": ["XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMProphetNetConfig"],
|
||||
"models.xlm_roberta": ["XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMRobertaConfig"],
|
||||
"models.xlm_roberta_xl": ["XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMRobertaXLConfig"],
|
||||
"models.xlnet": ["XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLNetConfig"],
|
||||
"models.yoso": ["YOSO_PRETRAINED_CONFIG_ARCHIVE_MAP", "YosoConfig"],
|
||||
"onnx": [],
|
||||
@ -1507,6 +1508,19 @@ if is_torch_available():
|
||||
"XLMRobertaModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.xlm_roberta_xl"].extend(
|
||||
[
|
||||
"XLM_ROBERTA_XL_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"XLMRobertaXLForCausalLM",
|
||||
"XLMRobertaXLForMaskedLM",
|
||||
"XLMRobertaXLForMultipleChoice",
|
||||
"XLMRobertaXLForQuestionAnswering",
|
||||
"XLMRobertaXLForSequenceClassification",
|
||||
"XLMRobertaXLForTokenClassification",
|
||||
"XLMRobertaXLModel",
|
||||
"XLMRobertaXLPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.xlnet"].extend(
|
||||
[
|
||||
"XLNET_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
@ -2485,6 +2499,7 @@ if TYPE_CHECKING:
|
||||
from .models.xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig, XLMTokenizer
|
||||
from .models.xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig
|
||||
from .models.xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
|
||||
from .models.xlm_roberta_xl import XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaXLConfig
|
||||
from .models.xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
|
||||
from .models.yoso import YOSO_PRETRAINED_CONFIG_ARCHIVE_MAP, YosoConfig
|
||||
|
||||
@ -3455,6 +3470,17 @@ if TYPE_CHECKING:
|
||||
XLMRobertaForTokenClassification,
|
||||
XLMRobertaModel,
|
||||
)
|
||||
from .models.xlm_roberta_xl import (
|
||||
XLM_ROBERTA_XL_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
XLMRobertaXLForCausalLM,
|
||||
XLMRobertaXLForMaskedLM,
|
||||
XLMRobertaXLForMultipleChoice,
|
||||
XLMRobertaXLForQuestionAnswering,
|
||||
XLMRobertaXLForSequenceClassification,
|
||||
XLMRobertaXLForTokenClassification,
|
||||
XLMRobertaXLModel,
|
||||
XLMRobertaXLPreTrainedModel,
|
||||
)
|
||||
from .models.xlnet import (
|
||||
XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
XLNetForMultipleChoice,
|
||||
|
@ -120,6 +120,7 @@ from . import (
|
||||
xlm,
|
||||
xlm_prophetnet,
|
||||
xlm_roberta,
|
||||
xlm_roberta_xl,
|
||||
xlnet,
|
||||
yoso,
|
||||
)
|
||||
|
@ -76,6 +76,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("albert", "AlbertConfig"),
|
||||
("bert-generation", "BertGenerationConfig"),
|
||||
("camembert", "CamembertConfig"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLConfig"),
|
||||
("xlm-roberta", "XLMRobertaConfig"),
|
||||
("pegasus", "PegasusConfig"),
|
||||
("marian", "MarianConfig"),
|
||||
@ -250,6 +251,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("bert-generation", "Bert Generation"),
|
||||
("camembert", "CamemBERT"),
|
||||
("xlm-roberta", "XLM-RoBERTa"),
|
||||
("xlm-roberta-xl", "XLM-RoBERTa-XL"),
|
||||
("pegasus", "Pegasus"),
|
||||
("blenderbot", "Blenderbot"),
|
||||
("marian", "Marian"),
|
||||
|
@ -75,6 +75,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("distilbert", "DistilBertModel"),
|
||||
("albert", "AlbertModel"),
|
||||
("camembert", "CamembertModel"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLModel"),
|
||||
("xlm-roberta", "XLMRobertaModel"),
|
||||
("bart", "BartModel"),
|
||||
("longformer", "LongformerModel"),
|
||||
@ -123,6 +124,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
||||
("distilbert", "DistilBertForMaskedLM"),
|
||||
("albert", "AlbertForPreTraining"),
|
||||
("camembert", "CamembertForMaskedLM"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForMaskedLM"),
|
||||
("xlm-roberta", "XLMRobertaForMaskedLM"),
|
||||
("bart", "BartForConditionalGeneration"),
|
||||
("fsmt", "FSMTForConditionalGeneration"),
|
||||
@ -178,6 +180,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
|
||||
("distilbert", "DistilBertForMaskedLM"),
|
||||
("albert", "AlbertForMaskedLM"),
|
||||
("camembert", "CamembertForMaskedLM"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForMaskedLM"),
|
||||
("xlm-roberta", "XLMRobertaForMaskedLM"),
|
||||
("marian", "MarianMTModel"),
|
||||
("fsmt", "FSMTForConditionalGeneration"),
|
||||
@ -220,6 +223,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
||||
("gpt_neo", "GPTNeoForCausalLM"),
|
||||
("big_bird", "BigBirdForCausalLM"),
|
||||
("camembert", "CamembertForCausalLM"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForCausalLM"),
|
||||
("xlm-roberta", "XLMRobertaForCausalLM"),
|
||||
("roberta", "RobertaForCausalLM"),
|
||||
("bert", "BertLMHeadModel"),
|
||||
@ -304,6 +308,7 @@ MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
|
||||
("bart", "BartForConditionalGeneration"),
|
||||
("mbart", "MBartForConditionalGeneration"),
|
||||
("camembert", "CamembertForMaskedLM"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForMaskedLM"),
|
||||
("xlm-roberta", "XLMRobertaForMaskedLM"),
|
||||
("longformer", "LongformerForMaskedLM"),
|
||||
("roberta", "RobertaForMaskedLM"),
|
||||
@ -379,6 +384,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
("distilbert", "DistilBertForSequenceClassification"),
|
||||
("albert", "AlbertForSequenceClassification"),
|
||||
("camembert", "CamembertForSequenceClassification"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForSequenceClassification"),
|
||||
("xlm-roberta", "XLMRobertaForSequenceClassification"),
|
||||
("mbart", "MBartForSequenceClassification"),
|
||||
("bart", "BartForSequenceClassification"),
|
||||
@ -430,6 +436,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
|
||||
("bart", "BartForQuestionAnswering"),
|
||||
("mbart", "MBartForQuestionAnswering"),
|
||||
("longformer", "LongformerForQuestionAnswering"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForQuestionAnswering"),
|
||||
("xlm-roberta", "XLMRobertaForQuestionAnswering"),
|
||||
("roberta", "RobertaForQuestionAnswering"),
|
||||
("squeezebert", "SqueezeBertForQuestionAnswering"),
|
||||
@ -476,6 +483,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
("camembert", "CamembertForTokenClassification"),
|
||||
("flaubert", "FlaubertForTokenClassification"),
|
||||
("xlm", "XLMForTokenClassification"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForTokenClassification"),
|
||||
("xlm-roberta", "XLMRobertaForTokenClassification"),
|
||||
("longformer", "LongformerForTokenClassification"),
|
||||
("roberta", "RobertaForTokenClassification"),
|
||||
@ -509,6 +517,7 @@ MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES = OrderedDict(
|
||||
("convbert", "ConvBertForMultipleChoice"),
|
||||
("camembert", "CamembertForMultipleChoice"),
|
||||
("electra", "ElectraForMultipleChoice"),
|
||||
("xlm-roberta-xl", "XLMRobertaXLForMultipleChoice"),
|
||||
("xlm-roberta", "XLMRobertaForMultipleChoice"),
|
||||
("longformer", "LongformerForMultipleChoice"),
|
||||
("roberta", "RobertaForMultipleChoice"),
|
||||
|
68
src/transformers/models/xlm_roberta_xl/__init__.py
Normal file
68
src/transformers/models/xlm_roberta_xl/__init__.py
Normal file
@ -0,0 +1,68 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...file_utils import _LazyModule, is_torch_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_xlm_roberta_xl": [
|
||||
"XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"XLMRobertaXLConfig",
|
||||
"XLMRobertaXLOnnxConfig",
|
||||
],
|
||||
}
|
||||
|
||||
if is_torch_available():
|
||||
_import_structure["modeling_xlm_roberta_xl"] = [
|
||||
"XLM_ROBERTA_XL_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"XLMRobertaXLForCausalLM",
|
||||
"XLMRobertaXLForMaskedLM",
|
||||
"XLMRobertaXLForMultipleChoice",
|
||||
"XLMRobertaXLForQuestionAnswering",
|
||||
"XLMRobertaXLForSequenceClassification",
|
||||
"XLMRobertaXLForTokenClassification",
|
||||
"XLMRobertaXLModel",
|
||||
"XLMRobertaXLPreTrainedModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_xlm_roberta_xl import (
|
||||
XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
XLMRobertaXLConfig,
|
||||
XLMRobertaXLOnnxConfig,
|
||||
)
|
||||
|
||||
if is_torch_available():
|
||||
from .modeling_xlm_roberta_xl import (
|
||||
XLM_ROBERTA_XL_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
XLMRobertaXLForCausalLM,
|
||||
XLMRobertaXLForMaskedLM,
|
||||
XLMRobertaXLForMultipleChoice,
|
||||
XLMRobertaXLForQuestionAnswering,
|
||||
XLMRobertaXLForSequenceClassification,
|
||||
XLMRobertaXLForTokenClassification,
|
||||
XLMRobertaXLModel,
|
||||
XLMRobertaXLPreTrainedModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
@ -0,0 +1,151 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" XLM_ROBERTa_XL configuration"""
|
||||
|
||||
from collections import OrderedDict
|
||||
from typing import Mapping
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...onnx import OnnxConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"xlm-roberta-xl": "https://huggingface.co/facebook/xlm-roberta-xl/resolve/main/config.json",
|
||||
"xlm-roberta-xxl": "https://huggingface.co/facebook/xlm-roberta-xxl/resolve/main/config.json",
|
||||
# See all XLM-RoBERTa-XL models at https://huggingface.co/models?filter=xlm-roberta-xl
|
||||
}
|
||||
|
||||
|
||||
class XLMRobertaXLConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`XLMRobertaXLModel`] or a [`TFXLMRobertaXLModel`].
|
||||
It is used to instantiate a XLM_ROBERTA_XL model according to the specified arguments, defining the model
|
||||
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
|
||||
XLM_ROBERTA_XL [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 250880):
|
||||
Vocabulary size of the XLM_ROBERTA_XL model. Defines the number of different tokens that can be represented
|
||||
by the `inputs_ids` passed when calling [`XLMRobertaXLModel`].
|
||||
hidden_size (`int`, *optional*, defaults to 2560):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 36):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (`int`, *optional*, defaults to 10240):
|
||||
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
|
||||
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 514):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
type_vocab_size (`int`, *optional*, defaults to 1):
|
||||
The vocabulary size of the `token_type_ids` passed when calling [`XLMRobertaXLModel`] or
|
||||
[`TFXLMRobertaXLModel`].
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
|
||||
The epsilon used by the layer normalization layers.
|
||||
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
|
||||
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
|
||||
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
|
||||
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
|
||||
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
|
||||
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||
relevant if `config.is_decoder=True`.
|
||||
classifier_dropout (`float`, *optional*):
|
||||
The dropout ratio for the classification head.
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
>>> from transformers import XLMRobertaXLModel, XLMRobertaXLConfig
|
||||
|
||||
>>> # Initializing a XLM_ROBERTA_XL bert-base-uncased style configuration
|
||||
>>> configuration = XLMRobertaXLConfig()
|
||||
|
||||
>>> # Initializing a model from the bert-base-uncased style configuration
|
||||
>>> model = XLMRobertaXLModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
model_type = "xlm-roberta-xl"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=250880,
|
||||
hidden_size=2560,
|
||||
num_hidden_layers=36,
|
||||
num_attention_heads=32,
|
||||
intermediate_size=10240,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=514,
|
||||
type_vocab_size=1,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-05,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
position_embedding_type="absolute",
|
||||
use_cache=True,
|
||||
classifier_dropout=None,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_act = hidden_act
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.position_embedding_type = position_embedding_type
|
||||
self.use_cache = use_cache
|
||||
self.classifier_dropout = classifier_dropout
|
||||
|
||||
|
||||
# Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->XLMRobertaXL
|
||||
class XLMRobertaXLOnnxConfig(OnnxConfig):
|
||||
@property
|
||||
def inputs(self) -> Mapping[str, Mapping[int, str]]:
|
||||
return OrderedDict(
|
||||
[
|
||||
("input_ids", {0: "batch", 1: "sequence"}),
|
||||
("attention_mask", {0: "batch", 1: "sequence"}),
|
||||
]
|
||||
)
|
@ -0,0 +1,183 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert RoBERTa checkpoint."""
|
||||
|
||||
import argparse
|
||||
import pathlib
|
||||
|
||||
import fairseq
|
||||
import torch
|
||||
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
|
||||
from fairseq.modules import TransformerSentenceEncoderLayer
|
||||
from packaging import version
|
||||
|
||||
from transformers import XLMRobertaConfig, XLMRobertaXLForMaskedLM, XLMRobertaXLForSequenceClassification
|
||||
from transformers.models.bert.modeling_bert import (
|
||||
BertIntermediate,
|
||||
BertLayer,
|
||||
BertOutput,
|
||||
BertSelfAttention,
|
||||
BertSelfOutput,
|
||||
)
|
||||
from transformers.models.roberta.modeling_roberta import RobertaAttention
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
if version.parse(fairseq.__version__) < version.parse("1.0.0a"):
|
||||
raise Exception("requires fairseq >= 1.0.0a")
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
SAMPLE_TEXT = "Hello world! cécé herlolip"
|
||||
|
||||
|
||||
def convert_xlm_roberta_xl_checkpoint_to_pytorch(
|
||||
roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool
|
||||
):
|
||||
"""
|
||||
Copy/paste/tweak roberta's weights to our BERT structure.
|
||||
"""
|
||||
roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
|
||||
roberta.eval() # disable dropout
|
||||
roberta_sent_encoder = roberta.model.encoder.sentence_encoder
|
||||
config = XLMRobertaConfig(
|
||||
vocab_size=roberta_sent_encoder.embed_tokens.num_embeddings,
|
||||
hidden_size=roberta.cfg.model.encoder_embed_dim,
|
||||
num_hidden_layers=roberta.cfg.model.encoder_layers,
|
||||
num_attention_heads=roberta.cfg.model.encoder_attention_heads,
|
||||
intermediate_size=roberta.cfg.model.encoder_ffn_embed_dim,
|
||||
max_position_embeddings=514,
|
||||
type_vocab_size=1,
|
||||
layer_norm_eps=1e-5, # PyTorch default used in fairseq
|
||||
)
|
||||
if classification_head:
|
||||
config.num_labels = roberta.model.classification_heads["mnli"].out_proj.weight.shape[0]
|
||||
|
||||
print("Our RoBERTa config:", config)
|
||||
|
||||
model = XLMRobertaXLForSequenceClassification(config) if classification_head else XLMRobertaXLForMaskedLM(config)
|
||||
model.eval()
|
||||
|
||||
# Now let's copy all the weights.
|
||||
# Embeddings
|
||||
model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight
|
||||
model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight
|
||||
model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(
|
||||
model.roberta.embeddings.token_type_embeddings.weight
|
||||
) # just zero them out b/c RoBERTa doesn't use them.
|
||||
|
||||
model.roberta.encoder.LayerNorm.weight = roberta_sent_encoder.layer_norm.weight
|
||||
model.roberta.encoder.LayerNorm.bias = roberta_sent_encoder.layer_norm.bias
|
||||
|
||||
for i in range(config.num_hidden_layers):
|
||||
# Encoder: start of layer
|
||||
layer: BertLayer = model.roberta.encoder.layer[i]
|
||||
roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]
|
||||
|
||||
attention: RobertaAttention = layer.attention
|
||||
attention.self_attn_layer_norm.weight = roberta_layer.self_attn_layer_norm.weight
|
||||
attention.self_attn_layer_norm.bias = roberta_layer.self_attn_layer_norm.bias
|
||||
|
||||
# self attention
|
||||
self_attn: BertSelfAttention = layer.attention.self
|
||||
assert (
|
||||
roberta_layer.self_attn.k_proj.weight.data.shape
|
||||
== roberta_layer.self_attn.q_proj.weight.data.shape
|
||||
== roberta_layer.self_attn.v_proj.weight.data.shape
|
||||
== torch.Size((config.hidden_size, config.hidden_size))
|
||||
)
|
||||
|
||||
self_attn.query.weight.data = roberta_layer.self_attn.q_proj.weight
|
||||
self_attn.query.bias.data = roberta_layer.self_attn.q_proj.bias
|
||||
self_attn.key.weight.data = roberta_layer.self_attn.k_proj.weight
|
||||
self_attn.key.bias.data = roberta_layer.self_attn.k_proj.bias
|
||||
self_attn.value.weight.data = roberta_layer.self_attn.v_proj.weight
|
||||
self_attn.value.bias.data = roberta_layer.self_attn.v_proj.bias
|
||||
|
||||
# self-attention output
|
||||
self_output: BertSelfOutput = layer.attention.output
|
||||
assert self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape
|
||||
self_output.dense.weight = roberta_layer.self_attn.out_proj.weight
|
||||
self_output.dense.bias = roberta_layer.self_attn.out_proj.bias
|
||||
|
||||
# this one is final layer norm
|
||||
layer.LayerNorm.weight = roberta_layer.final_layer_norm.weight
|
||||
layer.LayerNorm.bias = roberta_layer.final_layer_norm.bias
|
||||
|
||||
# intermediate
|
||||
intermediate: BertIntermediate = layer.intermediate
|
||||
assert intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape
|
||||
intermediate.dense.weight = roberta_layer.fc1.weight
|
||||
intermediate.dense.bias = roberta_layer.fc1.bias
|
||||
|
||||
# output
|
||||
bert_output: BertOutput = layer.output
|
||||
assert bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape
|
||||
bert_output.dense.weight = roberta_layer.fc2.weight
|
||||
bert_output.dense.bias = roberta_layer.fc2.bias
|
||||
# end of layer
|
||||
|
||||
if classification_head:
|
||||
model.classifier.dense.weight = roberta.model.classification_heads["mnli"].dense.weight
|
||||
model.classifier.dense.bias = roberta.model.classification_heads["mnli"].dense.bias
|
||||
model.classifier.out_proj.weight = roberta.model.classification_heads["mnli"].out_proj.weight
|
||||
model.classifier.out_proj.bias = roberta.model.classification_heads["mnli"].out_proj.bias
|
||||
else:
|
||||
# LM Head
|
||||
model.lm_head.dense.weight = roberta.model.encoder.lm_head.dense.weight
|
||||
model.lm_head.dense.bias = roberta.model.encoder.lm_head.dense.bias
|
||||
model.lm_head.layer_norm.weight = roberta.model.encoder.lm_head.layer_norm.weight
|
||||
model.lm_head.layer_norm.bias = roberta.model.encoder.lm_head.layer_norm.bias
|
||||
model.lm_head.decoder.weight = roberta.model.encoder.lm_head.weight
|
||||
model.lm_head.decoder.bias = roberta.model.encoder.lm_head.bias
|
||||
|
||||
# Let's check that we get the same results.
|
||||
input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0) # batch of size 1
|
||||
|
||||
our_output = model(input_ids)[0]
|
||||
if classification_head:
|
||||
their_output = roberta.model.classification_heads["mnli"](roberta.extract_features(input_ids))
|
||||
else:
|
||||
their_output = roberta.model(input_ids)[0]
|
||||
print(our_output.shape, their_output.shape)
|
||||
max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()
|
||||
print(f"max_absolute_diff = {max_absolute_diff}") # ~ 1e-7
|
||||
success = torch.allclose(our_output, their_output, atol=1e-3)
|
||||
print("Do both models output the same tensors?", "🔥" if success else "💩")
|
||||
if not success:
|
||||
raise Exception("Something went wRoNg")
|
||||
|
||||
pathlib.Path(pytorch_dump_folder_path).mkdir(parents=True, exist_ok=True)
|
||||
print(f"Saving model to {pytorch_dump_folder_path}")
|
||||
model.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--roberta_checkpoint_path", default=None, type=str, required=True, help="Path the official PyTorch dump."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_folder_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--classification_head", action="store_true", help="Whether to convert a final classification head."
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_xlm_roberta_xl_checkpoint_to_pytorch(
|
||||
args.roberta_checkpoint_path, args.pytorch_dump_folder_path, args.classification_head
|
||||
)
|
1546
src/transformers/models/xlm_roberta_xl/modeling_xlm_roberta_xl.py
Normal file
1546
src/transformers/models/xlm_roberta_xl/modeling_xlm_roberta_xl.py
Normal file
File diff suppressed because it is too large
Load Diff
@ -4021,6 +4021,65 @@ class XLMRobertaModel(metaclass=DummyObject):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
XLM_ROBERTA_XL_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class XLMRobertaXLForCausalLM(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLForMaskedLM(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLForMultipleChoice(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLForQuestionAnswering(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLForSequenceClassification(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLForTokenClassification(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class XLMRobertaXLPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
XLNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
509
tests/test_modeling_xlm_roberta_xl.py
Normal file
509
tests/test_modeling_xlm_roberta_xl.py
Normal file
@ -0,0 +1,509 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import XLMRobertaXLConfig, is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_generation_utils import GenerationTesterMixin
|
||||
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
XLMRobertaXLForCausalLM,
|
||||
XLMRobertaXLForMaskedLM,
|
||||
XLMRobertaXLForMultipleChoice,
|
||||
XLMRobertaXLForQuestionAnswering,
|
||||
XLMRobertaXLForSequenceClassification,
|
||||
XLMRobertaXLForTokenClassification,
|
||||
XLMRobertaXLModel,
|
||||
)
|
||||
from transformers.models.xlm_roberta_xl.modeling_xlm_roberta_xl import (
|
||||
XLMRobertaXLEmbeddings,
|
||||
create_position_ids_from_input_ids,
|
||||
)
|
||||
|
||||
|
||||
class XLMRobertaXLModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = 13
|
||||
self.seq_length = 7
|
||||
self.is_training = True
|
||||
self.use_input_mask = True
|
||||
self.use_token_type_ids = True
|
||||
self.use_labels = True
|
||||
self.vocab_size = 99
|
||||
self.hidden_size = 32
|
||||
self.num_hidden_layers = 5
|
||||
self.num_attention_heads = 4
|
||||
self.intermediate_size = 37
|
||||
self.hidden_act = "gelu"
|
||||
self.hidden_dropout_prob = 0.1
|
||||
self.attention_probs_dropout_prob = 0.1
|
||||
self.max_position_embeddings = 512
|
||||
self.type_vocab_size = 16
|
||||
self.type_sequence_label_size = 2
|
||||
self.initializer_range = 0.02
|
||||
self.num_labels = 3
|
||||
self.num_choices = 4
|
||||
self.scope = None
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
|
||||
sequence_labels = None
|
||||
token_labels = None
|
||||
choice_labels = None
|
||||
if self.use_labels:
|
||||
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
|
||||
def get_config(self):
|
||||
return XLMRobertaXLConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
type_vocab_size=self.type_vocab_size,
|
||||
initializer_range=self.initializer_range,
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs_for_decoder(self):
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
) = self.prepare_config_and_inputs()
|
||||
|
||||
config.is_decoder = True
|
||||
encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
|
||||
encoder_attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
|
||||
return (
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
)
|
||||
|
||||
def create_and_check_model(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
model = XLMRobertaXLModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
|
||||
result = model(input_ids, token_type_ids=token_type_ids)
|
||||
result = model(input_ids)
|
||||
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def create_and_check_model_as_decoder(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
):
|
||||
config.add_cross_attention = True
|
||||
model = XLMRobertaXLModel(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
)
|
||||
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def create_and_check_for_causal_lm(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
):
|
||||
model = XLMRobertaXLForCausalLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
def create_and_check_decoder_model_past_large_inputs(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
):
|
||||
config.is_decoder = True
|
||||
config.add_cross_attention = True
|
||||
model = XLMRobertaXLForCausalLM(config=config).to(torch_device).eval()
|
||||
|
||||
# make sure that ids don't start with pad token
|
||||
mask = input_ids.ne(config.pad_token_id).long()
|
||||
input_ids = input_ids * mask
|
||||
|
||||
# first forward pass
|
||||
outputs = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
use_cache=True,
|
||||
)
|
||||
past_key_values = outputs.past_key_values
|
||||
|
||||
# create hypothetical multiple next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||
|
||||
# make sure that ids don't start with pad token
|
||||
mask = next_tokens.ne(config.pad_token_id).long()
|
||||
next_tokens = next_tokens * mask
|
||||
next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
next_attention_mask = torch.cat([input_mask, next_mask], dim=-1)
|
||||
|
||||
output_from_no_past = model(
|
||||
next_input_ids,
|
||||
attention_mask=next_attention_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
output_hidden_states=True,
|
||||
)["hidden_states"][0]
|
||||
output_from_past = model(
|
||||
next_tokens,
|
||||
attention_mask=next_attention_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
past_key_values=past_key_values,
|
||||
output_hidden_states=True,
|
||||
)["hidden_states"][0]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
|
||||
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
def create_and_check_for_masked_lm(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
model = XLMRobertaXLForMaskedLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
def create_and_check_for_token_classification(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
config.num_labels = self.num_labels
|
||||
model = XLMRobertaXLForTokenClassification(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.num_labels))
|
||||
|
||||
def create_and_check_for_multiple_choice(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
config.num_choices = self.num_choices
|
||||
model = XLMRobertaXLForMultipleChoice(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
|
||||
multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
|
||||
multiple_choice_input_mask = input_mask.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
|
||||
result = model(
|
||||
multiple_choice_inputs_ids,
|
||||
attention_mask=multiple_choice_input_mask,
|
||||
token_type_ids=multiple_choice_token_type_ids,
|
||||
labels=choice_labels,
|
||||
)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_choices))
|
||||
|
||||
def create_and_check_for_question_answering(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
model = XLMRobertaXLForQuestionAnswering(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
start_positions=sequence_labels,
|
||||
end_positions=sequence_labels,
|
||||
)
|
||||
self.parent.assertEqual(result.start_logits.shape, (self.batch_size, self.seq_length))
|
||||
self.parent.assertEqual(result.end_logits.shape, (self.batch_size, self.seq_length))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
) = config_and_inputs
|
||||
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class XLMRobertaXLModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (
|
||||
(
|
||||
XLMRobertaXLForCausalLM,
|
||||
XLMRobertaXLForMaskedLM,
|
||||
XLMRobertaXLModel,
|
||||
XLMRobertaXLForSequenceClassification,
|
||||
XLMRobertaXLForTokenClassification,
|
||||
XLMRobertaXLForMultipleChoice,
|
||||
XLMRobertaXLForQuestionAnswering,
|
||||
)
|
||||
if is_torch_available()
|
||||
else ()
|
||||
)
|
||||
all_generative_model_classes = (XLMRobertaXLForCausalLM,) if is_torch_available() else ()
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = XLMRobertaXLModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=XLMRobertaXLConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_model_various_embeddings(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
for type in ["absolute", "relative_key", "relative_key_query"]:
|
||||
config_and_inputs[0].position_embedding_type = type
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_model_as_decoder(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_decoder()
|
||||
self.model_tester.create_and_check_model_as_decoder(*config_and_inputs)
|
||||
|
||||
def test_model_as_decoder_with_default_input_mask(self):
|
||||
# This regression test was failing with PyTorch < 1.3
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
) = self.model_tester.prepare_config_and_inputs_for_decoder()
|
||||
|
||||
input_mask = None
|
||||
|
||||
self.model_tester.create_and_check_model_as_decoder(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
)
|
||||
|
||||
def test_for_causal_lm(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_decoder()
|
||||
self.model_tester.create_and_check_for_causal_lm(*config_and_inputs)
|
||||
|
||||
def test_decoder_model_past_with_large_inputs(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_decoder()
|
||||
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
|
||||
|
||||
def test_for_masked_lm(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
|
||||
|
||||
def test_for_token_classification(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
|
||||
|
||||
def test_for_multiple_choice(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
|
||||
|
||||
def test_for_question_answering(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
|
||||
|
||||
def test_create_position_ids_respects_padding_index(self):
|
||||
"""Ensure that the default position ids only assign a sequential . This is a regression
|
||||
test for https://github.com/huggingface/transformers/issues/1761
|
||||
|
||||
The position ids should be masked with the embedding object's padding index. Therefore, the
|
||||
first available non-padding position index is XLMRobertaXLEmbeddings.padding_idx + 1
|
||||
"""
|
||||
config = self.model_tester.prepare_config_and_inputs()[0]
|
||||
model = XLMRobertaXLEmbeddings(config=config)
|
||||
|
||||
input_ids = torch.as_tensor([[12, 31, 13, model.padding_idx]])
|
||||
expected_positions = torch.as_tensor(
|
||||
[[0 + model.padding_idx + 1, 1 + model.padding_idx + 1, 2 + model.padding_idx + 1, model.padding_idx]]
|
||||
)
|
||||
|
||||
position_ids = create_position_ids_from_input_ids(input_ids, model.padding_idx)
|
||||
self.assertEqual(position_ids.shape, expected_positions.shape)
|
||||
self.assertTrue(torch.all(torch.eq(position_ids, expected_positions)))
|
||||
|
||||
def test_create_position_ids_from_inputs_embeds(self):
|
||||
"""Ensure that the default position ids only assign a sequential . This is a regression
|
||||
test for https://github.com/huggingface/transformers/issues/1761
|
||||
|
||||
The position ids should be masked with the embedding object's padding index. Therefore, the
|
||||
first available non-padding position index is XLMRobertaXLEmbeddings.padding_idx + 1
|
||||
"""
|
||||
config = self.model_tester.prepare_config_and_inputs()[0]
|
||||
embeddings = XLMRobertaXLEmbeddings(config=config)
|
||||
|
||||
inputs_embeds = torch.empty(2, 4, 30)
|
||||
expected_single_positions = [
|
||||
0 + embeddings.padding_idx + 1,
|
||||
1 + embeddings.padding_idx + 1,
|
||||
2 + embeddings.padding_idx + 1,
|
||||
3 + embeddings.padding_idx + 1,
|
||||
]
|
||||
expected_positions = torch.as_tensor([expected_single_positions, expected_single_positions])
|
||||
position_ids = embeddings.create_position_ids_from_inputs_embeds(inputs_embeds)
|
||||
self.assertEqual(position_ids.shape, expected_positions.shape)
|
||||
self.assertTrue(torch.all(torch.eq(position_ids, expected_positions)))
|
||||
|
||||
|
||||
@require_torch
|
||||
class XLMRobertaModelXLIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_xlm_roberta_xl(self):
|
||||
model = XLMRobertaXLModel.from_pretrained("facebook/xlm-roberta-xl").to(torch_device)
|
||||
input_ids = torch.tensor(
|
||||
[[0, 581, 10269, 83, 99942, 136, 60742, 23, 70, 80583, 18276, 2]], device=torch_device
|
||||
)
|
||||
# The dog is cute and lives in the garden house
|
||||
|
||||
expected_output_shape = torch.Size((1, 12, 2560)) # batch_size, sequence_length, embedding_vector_dim
|
||||
expected_output_values_last_dim = torch.tensor(
|
||||
[[0.0110, 0.0605, 0.0354, 0.0689, 0.0066, 0.0691, 0.0302, 0.0412, 0.0860, 0.0036, 0.0405, 0.0170]],
|
||||
device=torch_device,
|
||||
)
|
||||
|
||||
output = model(input_ids)["last_hidden_state"].detach()
|
||||
self.assertEqual(output.shape, expected_output_shape)
|
||||
# compare the actual values for a slice of last dim
|
||||
self.assertTrue(torch.allclose(output[:, :, -1], expected_output_values_last_dim, atol=1e-3))
|
||||
|
||||
@unittest.skip(reason="Model is too large to be tested on the CI")
|
||||
def test_xlm_roberta_xxl(self):
|
||||
model = XLMRobertaXLModel.from_pretrained("facebook/xlm-roberta-xxl").to(torch_device)
|
||||
input_ids = torch.tensor(
|
||||
[[0, 581, 10269, 83, 99942, 136, 60742, 23, 70, 80583, 18276, 2]], device=torch_device
|
||||
)
|
||||
# The dog is cute and lives in the garden house
|
||||
|
||||
expected_output_shape = torch.Size((1, 12, 4096)) # batch_size, sequence_length, embedding_vector_dim
|
||||
expected_output_values_last_dim = torch.tensor(
|
||||
[[0.0046, 0.0146, 0.0227, 0.0126, 0.0219, 0.0175, -0.0101, 0.0006, 0.0124, 0.0209, -0.0063, 0.0096]],
|
||||
device=torch_device,
|
||||
)
|
||||
|
||||
output = model(input_ids)["last_hidden_state"].detach()
|
||||
self.assertEqual(output.shape, expected_output_shape)
|
||||
# compare the actual values for a slice of last dim
|
||||
self.assertTrue(torch.allclose(output[:, :, -1], expected_output_values_last_dim, atol=1e-3))
|
Loading…
Reference in New Issue
Block a user