From b61c47f5a555219bc6e29789263ea39d754109f4 Mon Sep 17 00:00:00 2001
From: Aashish Anand <84689683+AshAnand34@users.noreply.github.com>
Date: Mon, 9 Jun 2025 13:00:38 -0700
Subject: [PATCH] Created model card for xlm-roberta-xl (#38597)
* Created model card for xlm-roberta-xl
* Update XLM-RoBERTa-XL model card with improved descriptions and usage examples
* Minor option labeling fix
* Added MaskedLM version of XLM RoBERTa XL to model card
* Added quantization example for XLM RoBERTa XL model card
* minor fixes to xlm roberta xl model card
* Minor fixes to mask format in xlm roberta xl model card
---
docs/source/en/model_doc/xlm-roberta-xl.md | 118 +++++++++++++++++----
1 file changed, 97 insertions(+), 21 deletions(-)
diff --git a/docs/source/en/model_doc/xlm-roberta-xl.md b/docs/source/en/model_doc/xlm-roberta-xl.md
index 355869ad6e0..56306bcb4a6 100644
--- a/docs/source/en/model_doc/xlm-roberta-xl.md
+++ b/docs/source/en/model_doc/xlm-roberta-xl.md
@@ -14,37 +14,113 @@ rendered properly in your Markdown viewer.
-->
-# XLM-RoBERTa-XL
-
-
-

-

+
+
+

+

+
-## Overview
+# XLM-RoBERTa-XL
-The XLM-RoBERTa-XL model was proposed in [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
+[XLM-RoBERTa-XL](https://huggingface.co/papers/2105.00572) is a 3.5B parameter multilingual masked language model pretrained on 100 languages. It shows that by scaling model capacity, multilingual models demonstrates strong performance on high-resource languages and can even zero-shot low-resource languages.
-The abstract from the paper is the following:
+You can find all the original XLM-RoBERTa-XL checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=xlm) organization.
-*Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cross-lingual understanding. In this study, we present the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages. This suggests pretrained models with larger capacity may obtain both strong performance on high-resource languages while greatly improving low-resource languages. We make our code and models publicly available.*
+> [!TIP]
+> Click on the XLM-RoBERTa-XL models in the right sidebar for more examples of how to apply XLM-RoBERTa-XL to different cross-lingual tasks like classification, translation, and question answering.
-This model was contributed by [Soonhwan-Kwon](https://github.com/Soonhwan-Kwon) and [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
+The example below demonstrates how to predict the `
` token with [`Pipeline`], [`AutoModel`], and from the command line.
-## Usage tips
+
+
-XLM-RoBERTa-XL is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
-not require `lang` tensors to understand which language is used, and should be able to determine the correct
-language from the input ids.
+```python
+import torch
+from transformers import pipeline
-## Resources
+pipeline = pipeline(
+ task="fill-mask",
+ model="facebook/xlm-roberta-xl",
+ torch_dtype=torch.float16,
+ device=0
+)
+pipeline("Bonjour, je suis un modèle .")
+```
-- [Text classification task guide](../tasks/sequence_classification)
-- [Token classification task guide](../tasks/token_classification)
-- [Question answering task guide](../tasks/question_answering)
-- [Causal language modeling task guide](../tasks/language_modeling)
-- [Masked language modeling task guide](../tasks/masked_language_modeling)
-- [Multiple choice task guide](../tasks/multiple_choice)
+
+
+
+```python
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(
+ "facebook/xlm-roberta-xl",
+)
+model = AutoModelForMaskedLM.from_pretrained(
+ "facebook/xlm-roberta-xl",
+ torch_dtype=torch.float16,
+ device_map="auto",
+ attn_implementation="sdpa"
+)
+inputs = tokenizer("Bonjour, je suis un modèle .", return_tensors="pt").to("cuda")
+
+with torch.no_grad():
+ outputs = model(**inputs)
+ predictions = outputs.logits
+
+masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
+predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
+predicted_token = tokenizer.decode(predicted_token_id)
+
+print(f"The predicted token is: {predicted_token}")
+```
+
+
+
+
+```bash
+echo -e "Plants create through a process known as photosynthesis." | transformers-cli run --task fill-mask --model facebook/xlm-roberta-xl --device 0
+```
+
+
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer, TorchAoConfig
+
+quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
+tokenizer = AutoTokenizer.from_pretrained(
+ "facebook/xlm-roberta-xl",
+)
+model = AutoModelForMaskedLM.from_pretrained(
+ "facebook/xlm-roberta-xl",
+ torch_dtype=torch.float16,
+ device_map="auto",
+ attn_implementation="sdpa",
+ quantization_config=quantization_config
+)
+inputs = tokenizer("Bonjour, je suis un modèle .", return_tensors="pt").to("cuda")
+
+with torch.no_grad():
+ outputs = model(**inputs)
+ predictions = outputs.logits
+
+masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
+predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
+predicted_token = tokenizer.decode(predicted_token_id)
+
+print(f"The predicted token is: {predicted_token}")
+```
+
+## Notes
+
+- Unlike some XLM models, XLM-RoBERTa-XL doesn't require `lang` tensors to understand which language is used. It automatically determines the language from the input ids.
## XLMRobertaXLConfig