Refactor model summary (#21408)

* first draft of model summary * restructure docs * finish first draft * ✨minor reviews and edits * apply feedbacks * save important info, create new page for attention * add attention doc to toctree * ✨ few more minor fixes
2025-07-31 02:02:21 +06:00 · 2023-02-15 10:35:14 -08:00 · 2023-02-15 10:35:14 -08:00 · 7a5533b2c3
commit 7a5533b2c3
parent a0e69a9375
30 changed files with 431 additions and 904 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -157,9 +157,11 @@
  - local: tasks_explained
    title: How 🤗 Transformers solve tasks
  - local: model_summary
-    title: Summary of the models
+    title: The Transformer model family
  - local: tokenizer_summary
    title: Summary of the tokenizers
+  - local: attention
+    title: Attention mechanisms
  - local: pad_truncation
    title: Padding and truncation
  - local: bertology
--- a/docs/source/en/attention.mdx
+++ b/docs/source/en/attention.mdx
@ -0,0 +1,57 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Attention mechanisms
+
+Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
+computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
+use a sparse version of the attention matrix to speed up training.
+
+## LSH attention
+
+[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
+dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
+the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so
+very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
+(determined by a n_rounds parameter) and then are averaged together.
+
+## Local attention
+
+[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
+left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
+window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+representation of the whole sentence.
+
+Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
+all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
+their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
+
+<div class="flex justify-center">
+    <img scale="50 %" align="center" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local_attention_mask.png"/>
+</div>
+
+Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
+length.
+
+## Other tricks
+
+### Axial positional encodings
+
+[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
+E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
+that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
+dimensions \\(l_{1} \times d_{1}\\) and \\(l_{2} \times d_{2}\\), such that \\(l_{1} \times l_{2} = l\\) and
+\\(d_{1} + d_{2} = d\\) (with the product for the lengths, this ends up being way smaller). The embedding for time
+step \\(j\\) in E is obtained by concatenating the embeddings for timestep \\(j \% l1\\) in E1 and \\(j // l1\\)
+in E2.
--- a/docs/source/en/model_doc/albert.mdx
+++ b/docs/source/en/model_doc/albert.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # ALBERT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=albert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/albert-base-v2">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
@ -39,6 +48,10 @@ Tips:
 - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.
+- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters.
+- Layers are split in groups that share parameters (to save memory).
+Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.
+

 This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
 [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
--- a/docs/source/en/model_doc/bart.mdx
+++ b/docs/source/en/model_doc/bart.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # BART

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=bart">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/bart-large-mnli">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten

@ -36,6 +45,13 @@ Tips:

 - BART is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
+- Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of the following transformations are applied on the pretraining tasks for the encoder:
+
+  * mask random tokens (like in BERT)
+  * delete random tokens
+  * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
+  * permute sentences
+  * rotate the document to make it start at a specific token

 This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart).

--- a/docs/source/en/model_doc/bert.mdx
+++ b/docs/source/en/model_doc/bert.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # BERT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=bert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/bert-base-uncased">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The BERT model was proposed in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a
@ -38,6 +47,15 @@ Tips:
  the left.
 - BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
+- Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually 15%) is masked by:
+
+    * a special mask token with probability 0.8
+    * a random token different from the one masked with probability 0.1
+    * the same token with probability 0.1
+    
+- The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50% they are not related. The model has to predict if the sentences are consecutive or not.
+
+

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/bert).

--- a/docs/source/en/model_doc/convbert.mdx
+++ b/docs/source/en/model_doc/convbert.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # ConvBERT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=convbert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/conv-bert-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
--- a/docs/source/en/model_doc/ctrl.mdx
+++ b/docs/source/en/model_doc/ctrl.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # CTRL

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=ctrl">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/tiny-ctrl">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
--- a/docs/source/en/model_doc/distilbert.mdx
+++ b/docs/source/en/model_doc/distilbert.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # DistilBERT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=distilbert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/distilbert-base-uncased">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
@ -41,6 +50,11 @@ Tips:
  separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`).
 - DistilBERT doesn't have options to select the input positions (`position_ids` input). This could be added if
  necessary though, just let us know if you need this option.
+- Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it’s been trained to predict the same probabilities as the larger model. The actual objective is a combination of:
+
+    * finding the same probabilities as the teacher model
+    * predicting the masked tokens correctly (but no next-sentence objective)
+    * a cosine similarity between the hidden states of the student and the teacher model

 This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
 contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
--- a/docs/source/en/model_doc/dpr.mdx
+++ b/docs/source/en/model_doc/dpr.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # DPR

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=dpr">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/dpr-question_encoder-bert-base-multilingual">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
@ -30,6 +39,12 @@ benchmarks.*

 This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR).

+Tips:
+- DPR consists in three models:
+
+    * Question encoder: encode questions as vectors
+    * Context encoder: encode contexts as vectors
+    * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).

 ## DPRConfig

--- a/docs/source/en/model_doc/electra.mdx
+++ b/docs/source/en/model_doc/electra.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # ELECTRA

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=electra">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/electra_large_discriminator_squad2_512">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The ELECTRA model was proposed in the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
@ -44,6 +53,7 @@ Tips:
  while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
  embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
  layer is used.
+- ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA has to predict which token is an original and which one has been replaced. Like for GAN training, the small language model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a traditional GAN setting) then the ELECTRA model is trained for a few steps.
 - The ELECTRA checkpoints saved using [Google Research's implementation](https://github.com/google-research/electra)
  contain both the generator and discriminator. The conversion script requires the user to name which model to export
  into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
--- a/docs/source/en/model_doc/flaubert.mdx
+++ b/docs/source/en/model_doc/flaubert.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # FlauBERT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=flaubert">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/flaubert_small_cased">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The FlauBERT model was proposed in the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le et al. It's a transformer model pretrained using a masked language
@ -34,6 +43,10 @@ community for further reproducible experiments in French NLP.*

 This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert).

+Tips:
+- Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
+
+

 ## FlaubertConfig

--- a/docs/source/en/model_doc/funnel.mdx
+++ b/docs/source/en/model_doc/funnel.mdx
@ -12,6 +12,16 @@ specific language governing permissions and limitations under the License.

 # Funnel Transformer

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=funnel">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/funnel-transformer-small">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
+
 ## Overview

 The Funnel Transformer model was proposed in the paper [Funnel-Transformer: Filtering out Sequential Redundancy for
@ -35,15 +45,16 @@ comprehension.*

 Tips:

- Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers.
+- Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers. This way, their length is divided by 2, which speeds up the computation of the next hidden states.
  The base model therefore has a final sequence length that is a quarter of the original one. This model can be used
  directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
  tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
  sequence length as the input.
+- For tasks such as classification, this is not a problem, but for tasks like masked language modeling or token classification, we need a hidden state with the same sequence length as the original input. In those cases, the final hidden states are upsampled to the input sequence length and go through two additional layers. That's why there are two versions of each checkpoint. The version suffixed with “-base” contains only the three blocks, while the version without that suffix contains the three blocks and the upsampling head with its additional layers.
 - The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
  used for [`FunnelModel`], [`FunnelForPreTraining`],
  [`FunnelForMaskedLM`], [`FunnelForTokenClassification`] and
-  class:*~transformers.FunnelForQuestionAnswering*. The second ones should be used for
+  [`FunnelForQuestionAnswering`]. The second ones should be used for
  [`FunnelBaseModel`], [`FunnelForSequenceClassification`] and
  [`FunnelForMultipleChoice`].

--- a/docs/source/en/model_doc/gpt2.mdx
+++ b/docs/source/en/model_doc/gpt2.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # OpenAI GPT2

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=gpt2">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/gpt2">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec
--- a/docs/source/en/model_doc/longformer.mdx
+++ b/docs/source/en/model_doc/longformer.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # Longformer

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=longformer">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/longformer-base-4096-finetuned-squadv1">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The Longformer model was presented in [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
@ -33,6 +42,7 @@ Tips:
 - Since the Longformer is based on RoBERTa, it doesn't have `token_type_ids`. You don't need to indicate which
  token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or
  `</s>`).
+- A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g., what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the local attention section for more information.

 This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer).

--- a/docs/source/en/model_doc/marian.mdx
+++ b/docs/source/en/model_doc/marian.mdx
@ -12,11 +12,24 @@ specific language governing permissions and limitations under the License.

 # MarianMT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=marian">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/opus-mt-zh-en">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **Bugs:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
 and assign @patrickvonplaten.

 Translations should be similar, but not identical to output in the test set linked to in each model card.

+Tips:
+
+- A framework for translation models, using the same models as BART.
+
 ## Implementation Notes

 - Each model is about 298 MB on disk, there are more than 1,000 models.
--- a/docs/source/en/model_doc/mbart.mdx
+++ b/docs/source/en/model_doc/mbart.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # MBart and MBart-50

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=mbart">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/mbart-large-50-one-to-many-mmt">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten

--- a/docs/source/en/model_doc/mt5.mdx
+++ b/docs/source/en/model_doc/mt5.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # mT5

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=mt5">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mt5-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/mt5-small-finetuned-arxiv-cs-finetuned-arxiv-cs-full">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The mT5 model was presented in [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya
--- a/docs/source/en/model_doc/openai-gpt.mdx
+++ b/docs/source/en/model_doc/openai-gpt.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # OpenAI GPT

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=openai-gpt">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/openai-gpt">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
--- a/docs/source/en/model_doc/pegasus.mdx
+++ b/docs/source/en/model_doc/pegasus.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # Pegasus

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=pegasus">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/pegasus_paraphrase">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
 and assign @patrickvonplaten.

@ -29,6 +38,12 @@ According to the abstract,

 This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/google-research/pegasus).

+Tips:
+
+- Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining objective, called Gap Sentence Generation (GSG).
+
+  * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
+  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.

 ## Checkpoints

--- a/docs/source/en/model_doc/prophetnet.mdx
+++ b/docs/source/en/model_doc/prophetnet.mdx
@ -12,6 +12,16 @@ specific language governing permissions and limitations under the License.

 # ProphetNet

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=prophetnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-prophetnet-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/prophetnet-large-uncased">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
+
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten

@ -39,6 +49,7 @@ Tips:

 - ProphetNet is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
+- The model architecture is based on the original Transformer, but replaces the “standard” self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.

 The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).

--- a/docs/source/en/model_doc/rag.mdx
+++ b/docs/source/en/model_doc/rag.mdx
@ -12,6 +12,12 @@ specific language governing permissions and limitations under the License.

 # RAG

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=rag">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
+</a>
+</div>
+
 ## Overview

 Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
@ -42,6 +48,8 @@ parametric-only seq2seq baseline.*

 This model was contributed by [ola13](https://huggingface.co/ola13).

+Tips:
+- Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.

 ## RagConfig

--- a/docs/source/en/model_doc/reformer.mdx
+++ b/docs/source/en/model_doc/reformer.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # Reformer

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=reformer">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/reformer-crime-and-punishment">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).

 ## Overview
@ -31,9 +40,13 @@ while being much more memory-efficient and much faster on long sequences.*
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be
 found [here](https://github.com/google/trax/tree/master/trax/models/reformer).

-**Note**:
+Tips:

- Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
+- Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035).
+- Use Axial position encoding (see below for more details). It’s a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller matrices.
+- Replace traditional attention by LSH (local-sensitive hashing) attention (see below for more details). It’s a technique to avoid computing the full product query-key in the attention layers.
+- Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them for results inside a given layer (less efficient than storing them but saves memory).
+- Compute the feedforward operations by chunks and not on the whole batch.

 ## Axial Positional Encodings

--- a/docs/source/en/model_doc/roberta.mdx
+++ b/docs/source/en/model_doc/roberta.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # RoBERTa

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=roberta">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/roberta-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
@ -39,6 +48,12 @@ Tips:
  different pretraining scheme.
 - RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
  separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
+- Same as BERT with better pretraining tricks:
+
+    * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
+    * together to reach 512 tokens (so the sentences are in an order than may span several documents)
+    * train with larger batches
+    * use BPE with bytes as a subunit and not characters (because of unicode characters)
 - [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples.

 This model was contributed by [julien-c](https://huggingface.co/julien-c). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/roberta).
--- a/docs/source/en/model_doc/t5.mdx
+++ b/docs/source/en/model_doc/t5.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # T5

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=t5">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/t5-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
@ -35,6 +44,8 @@ Tips:
 each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
 different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
 for summarization: *summarize: ...*.
+- The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
+- Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.

 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.

--- a/docs/source/en/model_doc/transfo-xl.mdx
+++ b/docs/source/en/model_doc/transfo-xl.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # Transformer XL

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=transfo-xl">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/transfo-xl-wt103">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The Transformer-XL model was proposed in [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
@ -37,6 +46,9 @@ Tips:
 - Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.
+- Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that may span across multiple documents, and segments are fed in order to the model.
+- Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention scores. This allows the model to pay attention to information that was in the previous segment as well as the current one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
+- This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would give the same results in the current input and the current hidden state at a given position) and needs to make some adjustments in the way attention scores are computed.

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/kimiyoung/transformer-xl).

--- a/docs/source/en/model_doc/xlm-prophetnet.mdx
+++ b/docs/source/en/model_doc/xlm-prophetnet.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLM-ProphetNet

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xprophetnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xprophetnet-large-wiki100-cased-xglue-ntg">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten

@ -39,6 +48,10 @@ state-of-the-art results on all these datasets compared to the models using the

 The Authors' code can be found [here](https://github.com/microsoft/ProphetNet).

+Tips:
+
+- XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset XGLUE.
+
 ## XLMProphetNetConfig

 [[autodoc]] XLMProphetNetConfig
--- a/docs/source/en/model_doc/xlm-roberta.mdx
+++ b/docs/source/en/model_doc/xlm-roberta.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLM-RoBERTa

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xlm-roberta">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xlm-roberta-base">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
@ -38,6 +47,7 @@ Tips:
 - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
  not require `lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
+- Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses masked language modeling on sentences coming from one language.
 - This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
  as well as the information relative to the inputs and outputs.

--- a/docs/source/en/model_doc/xlm.mdx
+++ b/docs/source/en/model_doc/xlm.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLM

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xlm">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xlm-mlm-en-2048">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The XLM model was proposed in [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by
@ -38,6 +47,11 @@ Tips:
 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
 - XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the [multi-lingual](../multilingual) page for more information.
+- A transformer model trained on several languages. There are three different type of training for this model and the library provides checkpoints for all of them:
+
+    * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the previous section as well). One of the languages is selected for each training sample, and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages.
+    * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample, and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with dynamic masking of the tokens.
+    * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two different languages, with random masking. To predict one of the masked tokens, the model can use both, the surrounding context in language 1 and the context given by language 2.

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/facebookresearch/XLM/).

--- a/docs/source/en/model_doc/xlnet.mdx
+++ b/docs/source/en/model_doc/xlnet.mdx
@ -12,6 +12,15 @@ specific language governing permissions and limitations under the License.

 # XLNet

+<div class="flex flex-wrap space-x-1">
+<a href="https://huggingface.co/models?filter=xlnet">
+<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
+</a>
+<a href="https://huggingface.co/spaces/docs-demos/xlnet-base-cased">
+<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
+</a>
+</div>
+
 ## Overview

 The XLNet model was proposed in [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
@ -40,6 +49,8 @@ Tips:
  `target_mapping` inputs to control the attention span and outputs (see examples in
  *examples/pytorch/text-generation/run_generation.py*)
 - XLNet is one of the few models that has no sequence length limit.
+- XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,…,sequence length.
+- XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.

 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/zihangdai/xlnet/).

--- a/docs/source/en/model_summary.mdx
+++ b/docs/source/en/model_summary.mdx
@ -10,946 +10,94 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Summary of the models
+# The Transformer model family

-This is a summary of the most downloaded models in 🤗 Transformers. Click on the large outermost bubble of each pretrained model category (encoder, decoder, encoder-decoder) to zoom in and out to see the most popular models within a modality. The size of each bubble corresponds to the number of downloads of each model.
+Since its introduction in 2017, the [original Transformer](https://arxiv.org/abs/1706.03762) model has inspired many new and exciting models that extend beyond natural language processing (NLP) tasks. There are models for [predicting the folded structure of proteins](https://huggingface.co/blog/deep-learning-with-proteins), [training a cheetah to run](https://huggingface.co/blog/train-decision-transformers), and [time series forecasting](https://huggingface.co/blog/time-series-transformers). With so many Transformer variants available, it can be easy to miss the bigger picture. What all these models have in common is they're based on the original Transformer architecture. Some models only use the encoder or decoder, while others use both. This provides a useful taxonomy to categorize and examine the high-level differences within models in the Transformer family, and it'll help you understand Transformers you haven't encountered before.

-<iframe width="100%" height="900" frameborder="0"
-  src="https://observablehq.com/embed/eafbe39385aaf8f2?cells=chart"></iframe>
+If you aren't familiar with the original Transformer model or need a refresher, check out the [How do Transformers work](https://huggingface.co/course/chapter1/4?fw=pt) chapter from the Hugging Face course.

-It assumes you're familiar with the original [transformer
-model](https://arxiv.org/abs/1706.03762). For a gentle introduction check the [annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html). Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also check out [the Model Hub](https://huggingface.co/models) where you can filter the checkpoints by model architecture.
-
-Each one of the models in the library falls into one of the following categories:
-
- [autoregressive-models](#autoregressive-models)
- [autoencoding-models](#autoencoding-models)
- [seq-to-seq-models](#seq-to-seq-models)
- [multimodal-models](#multimodal-models)
- [retrieval-based-models](#retrieval-based-models)
-
-<iframe width="560" height="315" src="https://www.youtube.com/embed/H39Z_720T5s" title="YouTube video player"
-frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-picture-in-picture" allowfullscreen></iframe>
-
-Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
-previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
-sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
-typical example of such models is GPT.
-
-Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
-sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
-full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can
-be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is
-sentence classification or token classification. A typical example of such models is BERT.
-
-Note that the only difference between autoregressive models and autoencoding models is in the way the model is
-pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
-was first introduced.
-
-Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
-tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
-most natural applications are translation, summarization and question answering. The original transformer model is an
-example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
-
-Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
-
-<a id='autoregressive-models'></a>
-
-## Decoders or autoregressive models
-
-As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
-that at each position, the model can only look at the tokens before the attention heads.
-
-<Youtube id="d_ixlCubqQw"/>
-
-### Original GPT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=openai-gpt">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
-</a>
-<a href="model_doc/openai-gpt">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/openai-gpt">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
+<div align="center">
+    <iframe width="560" height="315" src="https://www.youtube.com/embed/H39Z_720T5s" title="YouTube video player"
+    frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
+    picture-in-picture" allowfullscreen></iframe>
 </div>

+## Computer vision

-[Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), Alec Radford et al.
+<iframe style="border: 1px solid rgba(0, 0, 0, 0.1);" width="1000" height="450" src="https://www.figma.com/embed?embed_host=share&url=https%3A%2F%2Fwww.figma.com%2Ffile%2FacQBpeFBVvrDUlzFlkejoz%2FModelscape-timeline%3Fnode-id%3D0%253A1%26t%3Dm0zJ7m2BQ9oe0WtO-1" allowfullscreen></iframe> 

-The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
+### Convolutional network

-The library provides versions of the model for language modeling and multitask language modeling/multiple choice
-classification.
+For a long time, convolutional networks (CNNs) were the dominant paradigm for computer vision tasks until the [Vision Transformer](https://arxiv.org/abs/2010.11929) demonstrated its scalability and efficiency. Even then, some of a CNN's best qualities, like translation invariance, are so powerful (especially for certain tasks) that some Transformers incorporate convolutions in their architecture. [ConvNeXt](model_doc/convnext) flipped this exchange around and incorporated design choices from Transformers to modernize a CNN. For example, ConvNeXt uses non-overlapping sliding windows to patchify an image and a larger kernel to increase its global receptive field. ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers!

-### GPT-2
+### Encoder[[cv-encoder]]

-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=gpt2">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
-</a>
-<a href="model_doc/gpt2">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/gpt2">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
+The [Vision Transformer (ViT)](model_doc/vit) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens. ViT capitalized on the Transformers' efficient architecture to demonstrate competitive results with the CNNs at the time while requiring fewer resources to train. ViT was soon followed by other vision models that could also handle dense vision tasks like segmentation as well as detection.

+One of these models is the [Swin](model_doc/swin) Transformer. It builds hierarchical feature maps (like a CNN 👀 and unlike ViT) from smaller-sized patches and merges them with neighboring patches in deeper layers. Attention is only computed within a local window, and the window is shifted between attention layers to create connections to help the model learn better. Since the Swin Transformer can produce hierarchical feature maps, it is a good candidate for dense prediction tasks like segmentation and detection. The [SegFormer](model_doc/segformer) also uses a Transformer encoder to build hierarchical feature maps, but it adds a simple multilayer perceptron (MLP) decoder on top to combine all the feature maps and make a prediction.

-[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf),
-Alec Radford et al.
+Other vision models, like BeIT and ViTMAE, drew inspiration from BERT's pretraining objective. [BeIT](model_doc/beit) is pretrained by *masked image modeling (MIM)*; the image patches are randomly masked, and the image is also tokenized into visual tokens. BeIT is trained to predict the visual tokens corresponding to the masked patches. [ViTMAE](model_doc/vitmae) has a similar pretraining objective, except it must predict the pixels instead of visual tokens. What's unusual is 75% of the image patches are masked! The decoder reconstructs the pixels from the masked tokens and encoded patches. After pretraining, the decoder is thrown away, and the encoder is ready to be used in downstream tasks.

-A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
-more).
+### Decoder[[cv-decoder]]

-The library provides versions of the model for language modeling and multitask language modeling/multiple choice
-classification.
+Decoder-only vision models are rare because most vision models rely on an encoder to learn an image representation. But for use cases like image generation, the decoder is a natural fit, as we've seen from text generation models like GPT-2. [ImageGPT](model_doc/imagegpt) uses the same architecture as GPT-2, but instead of predicting the next token in a sequence, it predicts the next pixel in an image. In addition to image generation, ImageGPT could also be finetuned for image classification.

-### CTRL
+### Encoder-decoder[[cv-encoder-decoder]]

-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=ctrl">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
-</a>
-<a href="model_doc/ctrl">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/tiny-ctrl">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
+Vision models commonly use an encoder (also known as a backbone) to extract important image features before passing them to a Transformer decoder. [DETR](model_doc/detr) has a pretrained backbone, but it also uses the complete Transformer encoder-decoder architecture for object detection. The encoder learns image representations and combines them with object queries (each object query is a learned embedding that focuses on a region or object in an image) in the decoder. DETR predicts the bounding box coordinates and class label for each object query.

+## Natural language processing

-[CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858),
-Nitish Shirish Keskar et al.
+<iframe style="border: 1px solid rgba(0, 0, 0, 0.1);" width="1000" height="450" src="https://www.figma.com/embed?embed_host=share&url=https%3A%2F%2Fwww.figma.com%2Ffile%2FUhbQAZDlpYW5XEpdFy6GoG%2Fnlp-model-timeline%3Fnode-id%3D0%253A1%26t%3D4mZMr4r1vDEYGJ50-1" allowfullscreen></iframe>

-Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or
-several) of those control codes which are then used to influence the text generation: generate with the style of
-wikipedia article, a book or a movie review.
+### Encoder[[nlp-encoder]]

-The library provides a version of the model for language modeling only.
+[BERT](model_doc/bert) is an encoder-only Transformer that randomly masks certain tokens in the input to avoid seeing other tokens, which would allow it to "cheat". The pretraining objective is to predict the masked token based on the context. This allows BERT to fully use the left and right contexts to help it learn a deeper and richer representation of the inputs. However, there was still room for improvement in BERT's pretraining strategy. [RoBERTa](model_doc/roberta) improved upon this by introducing a new pretraining recipe that includes training for longer and on larger batches, randomly masking tokens at each epoch instead of just once during preprocessing, and removing the next-sentence prediction objective. 

-### Transformer-XL
+The dominant strategy to improve performance is to increase the model size. But training large models is computationally expensive. One way to reduce computational costs is using a smaller model like [DistilBERT](model_doc/distilbert). DistilBERT uses [knowledge distillation](https://arxiv.org/abs/1503.02531) - a compression technique - to create a smaller version of BERT while keeping nearly all of its language understanding capabilities. 

-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=transfo-xl">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
-</a>
-<a href="model_doc/transfo-xl">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/transfo-xl-wt103">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
+However, most Transformer models continued to trend towards more parameters, leading to new models focused on improving training efficiency. [ALBERT](model_doc/albert) reduces memory consumption by lowering the number of parameters in two ways: separating the larger vocabulary embedding into two smaller matrices and allowing layers to share parameters. [DeBERTa](model_doc/deberta) added a disentangled attention mechanism where the word and its position are separately encoded in two vectors. The attention is computed from these separate vectors instead of a single vector containing the word and position embeddings. [Longformer](model_doc/longformer) also focused on making attention more efficient, especially for processing documents with longer sequence lengths. It uses a combination of local windowed attention (attention only calculated from fixed window size around each token) and global attention (only for specific task tokens like `[CLS]` for classification) to create a sparse attention matrix instead of a full attention matrix.

+### Decoder[[nlp-decoder]]

-[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860), Zihang
-Dai et al.
+[GPT-2](model_doc/gpt2) is a decoder-only Transformer that predicts the next word in the sequence. It masks tokens to the right so the model can't "cheat" by looking ahead. By pretraining on a massive body of text, GPT-2 became really good at generating text, even if the text is only sometimes accurate or true. But GPT-2 lacked the bidirectional context from BERT's pretraining, which made it unsuitable for certain tasks. [XLNET](model_doc/xlnet) combines the best of both BERT and GPT-2's pretraining objectives by using a permutation language modeling objective (PLM) that allows it to learn bidirectionally.

-Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
-RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
-may span across multiple documents, and segments are fed in order to the model.
+After GPT-2, language models grew even bigger and are now known as *large language models (LLMs)*. LLMs demonstrate few- or even zero-shot learning if pretrained on a large enough dataset. [GPT-J](model_doc/gptj) is an LLM with 6B parameters and trained on 400B tokens. GPT-J was followed by [OPT](model_doc/opt), a family of decoder-only models, the largest of which is 175B and trained on 180B tokens. [BLOOM](model_doc/bloom) was released around the same time, and the largest model in the family has 176B parameters and is trained on 366B tokens in 46 languages and 13 programming languages.

-Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention
-scores. This allows the model to pay attention to information that was in the previous segment as well as the current
-one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
+### Encoder-decoder[[nlp-encoder-decoder]]

-This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would
-give the same results in the current input and the current hidden state at a given position) and needs to make some
-adjustments in the way attention scores are computed.
+[BART](model_doc/bart) keeps the original Transformer architecture, but it modifies the pretraining objective with *text infilling* corruption, where some text spans are replaced with a single `mask` token. The decoder predicts the uncorrupted tokens (future tokens are masked) and uses the encoder's hidden states to help it. [Pegasus](model_doc/pegasus) is similar to BART, but Pegasus masks entire sentences instead of text spans. In addition to masked language modeling, Pegasus is pretrained by gap sentence generation (GSG). The GSG objective masks whole sentences important to a document, replacing them with a `mask` token. The decoder must generate the output from the remaining sentences. [T5](model_doc/t5) is a more unique model that casts all NLP tasks into a text-to-text problem using specific prefixes. For example, the prefix `Summarize:` indicates a summarization task. T5 is pretrained by supervised (GLUE and SuperGLUE) training and self-supervised training (randomly sample and drop out 15% of tokens).

-The library provides a version of the model for language modeling only.
+## Audio

-<a id='reformer'></a>
+<iframe style="border: 1px solid rgba(0, 0, 0, 0.1);" width="1000" height="450" src="https://www.figma.com/embed?embed_host=share&url=https%3A%2F%2Fwww.figma.com%2Ffile%2Fvrchl8jDV9YwNVPWu2W0kK%2Fspeech-and-audio-model-timeline%3Fnode-id%3D0%253A1%26t%3DmM4H8pPMuK23rClL-1" allowfullscreen></iframe>

-### Reformer
+### Encoder[[audio-encoder]]

-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=reformer">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
-</a>
-<a href="model_doc/reformer">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/reformer-crime-and-punishment">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
+[Wav2Vec2](model_doc/wav2vec2) uses a Transformer encoder to learn speech representations directly from raw audio waveforms. It is pretrained with a contrastive task to determine the true speech representation from a set of false ones. [HuBERT](model_doc/hubert) is similar to Wav2Vec2 but has a different training process. Target labels are created by a clustering step in which segments of similar audio are assigned to a cluster which becomes a hidden unit. The hidden unit is mapped to an embedding to make a prediction.

+### Encoder-decoder[[audio-encoder-decoder]]

-[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451), Nikita Kitaev et al .
+[Speech2Text](model_doc/speech_to_text) is a speech model designed for automatic speech recognition (ASR) and speech translation. The model accepts log mel-filter bank features extracted from the audio waveform and pretrained autoregressively to generate a transcript or translation. [Whisper](model_doc/whisper) is also an ASR model, but unlike many other speech models, it is pretrained on a massive amount of ✨ labeled ✨ audio transcription data for zero-shot performance. A large chunk of the dataset also contains non-English languages, meaning Whisper can also be used for low-resource languages. Structurally, Whisper is similar to Speech2Text. The audio signal is converted to a log-mel spectrogram encoded by the encoder. The decoder generates the transcript autoregressively from the encoder's hidden states and the previous tokens.

-An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
-include:
+## Multimodal

- Use [Axial position encoding](#axial-pos-encoding) (see below for more details). It’s a mechanism to avoid
-  having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
-  matrices.
- Replace traditional attention by [LSH (local-sensitive hashing) attention](#lsh-attention) (see below for more
-  details). It's a technique to avoid computing the full product query-key in the attention layers.
- Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
-  the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
-  for results inside a given layer (less efficient than storing them but saves memory).
- Compute the feedforward operations by chunks and not on the whole batch.
+<iframe style="border: 1px solid rgba(0, 0, 0, 0.1);" width="1000" height="450" src="https://www.figma.com/embed?embed_host=share&url=https%3A%2F%2Fwww.figma.com%2Ffile%2FcX125FQHXJS2gxeICiY93p%2Fmultimodal%3Fnode-id%3D0%253A1%26t%3DhPQwdx3HFPWJWnVf-1" allowfullscreen></iframe>

-With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models.
+### Encoder[[mm-encoder]]

-<Tip>
+[VisualBERT](model_doc/visual_bert) is a multimodal model for vision-language tasks released shortly after BERT. It combines BERT and a pretrained object detection system to extract image features into visual embeddings, passed alongside text embeddings to BERT. VisualBERT predicts the masked text based on the unmasked text and the visual embeddings, and it also has to predict whether the text is aligned with the image. When ViT was released, [ViLT](model_doc/vilt) adopted ViT in its architecture because it was easier to get the image embeddings this way. The image embeddings are jointly processed with the text embeddings. From there, ViLT is pretrained by image text matching, masked language modeling, and whole word masking.

-This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
-pretraining yet, though.
+[CLIP](model_doc/clip) takes a different approach and makes a pair prediction of (`image`, `text`) . An image encoder (ViT) and a text encoder (Transformer) are jointly trained on a 400 million (`image`, `text`) pair dataset to maximize the similarity between the image and text embeddings of the (`image`, `text`) pairs. After pretraining, you can use natural language to instruct CLIP to predict the text given an image or vice versa. [OWL-ViT](model_doc/owlvit) builds on top of CLIP by using it as its backbone for zero-shot object detection. After pretraining, an object detection head is added to make a set prediction over the (`class`, `bounding box`) pairs.

-</Tip>
+### Encoder-decoder[[mm-encoder-decoder]]

-The library provides a version of the model for language modeling only.
+Optical character recognition (OCR) is a long-standing text recognition task that typically involves several components to understand the image and generate the text. [TrOCR](model_doc/trocr) simplifies the process using an end-to-end Transformer. The encoder is a ViT-style model for image understanding and processes the image as fixed-size patches. The decoder accepts the encoder's hidden states and autoregressively generates text. [Donut](model_doc/donut) is a more general visual document understanding model that doesn't rely on OCR-based approaches. It uses a Swin Transformer as the encoder and multilingual BART as the decoder. Donut is pretrained to read text by predicting the next word based on the image and text annotations. The decoder generates a token sequence given a prompt. The prompt is represented by a special token for each downstream task. For example, document parsing has a special `parsing` token that is combined with the encoder hidden states to parse the document into a structured output format (JSON).

-### XLNet
+## Reinforcement learning

-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=xlnet">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
-</a>
-<a href="model_doc/xlnet">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/xlnet-base-cased">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
+<iframe style="border: 1px solid rgba(0, 0, 0, 0.1);" width="1000" height="450" src="https://www.figma.com/embed?embed_host=share&url=https%3A%2F%2Fwww.figma.com%2Ffile%2FiB3Y6RvWYki7ZuKO6tNgZq%2Freinforcement-learning%3Fnode-id%3D0%253A1%26t%3DhPQwdx3HFPWJWnVf-1" allowfullscreen></iframe>

+### Decoder[[rl-decoder]]

-[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237), Zhilin
-Yang et al.
-
-XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
-tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
-with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens
-for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,...,sequence length.
-
-XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.
-
-The library provides a version of the model for language modeling, token classification, sentence classification,
-multiple choice classification and question answering.
-
-<a id='autoencoding-models'></a>
-
-## Encoders or autoencoding models
-
-As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
-corrupted versions.
-
-<Youtube id="MUqNwgPjJvQ"/>
-
-### BERT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=bert">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
-</a>
-<a href="model_doc/bert">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/bert-base-uncased">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805),
-Jacob Devlin et al.
-
-Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
-15%) is masked by:
-
- a special mask token with probability 0.8
- a random token different from the one masked with probability 0.1
- the same token with probability 0.1
-
-The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a
-separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50%
-they are not related. The model has to predict if the sentences are consecutive or not.
-
-The library provides a version of the model for language modeling (traditional or masked), next sentence prediction,
-token classification, sentence classification, multiple choice classification and question answering.
-
-### ALBERT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=albert">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
-</a>
-<a href="model_doc/albert">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/albert-base-v2">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942),
-Zhenzhong Lan et al.
-
-Same as BERT but with a few tweaks:
-
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one
-  embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
-  sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
-  being the vocab size). If E < H, it has less parameters.
- Layers are split in groups that share parameters (to save memory).
- Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
-  B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
-  been swapped or not.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-### RoBERTa
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=roberta">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
-</a>
-<a href="model_doc/roberta">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/roberta-base">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692), Yinhan Liu et al.
-
-Same as BERT with better pretraining tricks:
-
- dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
- no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
-  contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
- train with larger batches
- use BPE with bytes as a subunit and not characters (because of unicode characters)
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-### DistilBERT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=distilbert">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
-</a>
-<a href="model_doc/distilbert">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/distilbert-base-uncased">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108),
-Victor Sanh et al.
-
-Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict
-the same probabilities as the larger model. The actual objective is a combination of:
-
- finding the same probabilities as the teacher model
- predicting the masked tokens correctly (but no next-sentence objective)
- a cosine similarity between the hidden states of the student and the teacher model
-
-The library provides a version of the model for masked language modeling, token classification, sentence classification
-and question answering.
-
-### ConvBERT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=convbert">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
-</a>
-<a href="model_doc/convbert">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-convbert-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/conv-bert-base">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496), Zihang Jiang,
-Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
-
-Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural
-language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large
-memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
-generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
-which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
-replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
-rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
-learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
-ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
-fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
-using less than 1/4 training cost.
-
-The library provides a version of the model for masked language modeling, token classification, sentence classification
-and question answering.
-
-### XLM
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=xlm">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
-</a>
-<a href="model_doc/xlm">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/xlm-mlm-en-2048">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291), Guillaume Lample and Alexis Conneau
-
-A transformer model trained on several languages. There are three different type of training for this model and the
-library provides checkpoints for all of them:
-
- Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
-  previous section as well). One of the languages is selected for each training sample, and the model input is a
-  sentence of 256 tokens, that may span over several documents in one of those languages.
- Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-  and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
-  with dynamic masking of the tokens.
- A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
-  different languages, with random masking. To predict one of the masked tokens, the model can use both, the
-  surrounding context in language 1 and the context given by language 2.
-
-Checkpoints refer to which method was used for pretraining by having *clm*, *mlm* or *mlm-tlm* in their names. On top
-of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an
-indication of the language used, and when training using MLM+TLM, an indication of the language used for each part.
-
-The library provides a version of the model for language modeling, token classification, sentence classification and
-question answering.
-
-### XLM-RoBERTa
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=xlm-roberta">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
-</a>
-<a href="model_doc/xlm-roberta">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/xlm-roberta-base">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116), Alexis Conneau et
-al.
-
-Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses
-masked language modeling on sentences coming from one language. However, the model is trained on many more languages
-(100) and doesn't use the language embeddings, so it's capable of detecting the input language by itself.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-### FlauBERT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=flaubert">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
-</a>
-<a href="model_doc/flaubert">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/flaubert_small_cased">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372), Hang Le et al.
-
-Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
-
-The library provides a version of the model for language modeling and sentence classification.
-
-### ELECTRA
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=electra">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
-</a>
-<a href="model_doc/electra">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/electra_large_discriminator_squad2_512">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555),
-Kevin Clark et al.
-
-ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are
-corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA
-has to predict which token is an original and which one has been replaced. Like for GAN training, the small language
-model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a
-traditional GAN setting) then the ELECTRA model is trained for a few steps.
-
-The library provides a version of the model for masked language modeling, token classification and sentence
-classification.
-
-### Funnel Transformer
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=funnel">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
-</a>
-<a href="model_doc/funnel">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/funnel-transformer-small">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236), Zihang Dai et al.
-
-Funnel Transformer is a transformer model using pooling, a bit like a ResNet model: layers are grouped in blocks, and
-at the beginning of each block (except the first one), the hidden states are pooled among the sequence dimension. This
-way, their length is divided by 2, which speeds up the computation of the next hidden states. All pretrained models
-have three blocks, which means the final hidden state has a sequence length that is one fourth of the original sequence
-length.
-
-For tasks such as classification, this is not a problem, but for tasks like masked language modeling or token
-classification, we need a hidden state with the same sequence length as the original input. In those cases, the final
-hidden states are upsampled to the input sequence length and go through two additional layers. That's why there are two
-versions of each checkpoint. The version suffixed with "-base" contains only the three blocks, while the version
-without that suffix contains the three blocks and the upsampling head with its additional layers.
-
-The pretrained models available use the same pretraining objective as ELECTRA.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-<a id='longformer'></a>
-
-### Longformer
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=longformer">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
-</a>
-<a href="model_doc/longformer">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/longformer-base-4096-finetuned-squadv1">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150), Iz Beltagy et al.
-
-A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g.,
-what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are
-still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the
-[local attention section](#local-attention) for more information.
-
-It is pretrained the same way a RoBERTa otherwise.
-
-<Tip>
-
-This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
-pretraining yet, though.
-
-</Tip>
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-<a id='seq-to-seq-models'></a>
-
-## Sequence-to-sequence models
-
-As mentioned before, these models keep both the encoder and the decoder of the original transformer.
-
-<Youtube id="0_4KEb08xrE"/>
-
-### BART
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=bart">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
-</a>
-<a href="model_doc/bart">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/bart-large-mnli">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461), Mike Lewis et al.
-
-Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
-fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of
-the following transformations are applied on the pretraining tasks for the encoder:
-
- mask random tokens (like in BERT)
- delete random tokens
- mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
- permute sentences
- rotate the document to make it start at a specific token
-
-The library provides a version of this model for conditional generation and sequence classification.
-
-### Pegasus
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=pegasus">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
-</a>
-<a href="model_doc/pegasus">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/pegasus_paraphrase">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf), Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
-
-Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
-two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
-objective, called Gap Sentence Generation (GSG).
-
- MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
-  BERT)
- GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
-  causal mask to hide the future words like a regular auto-regressive transformer decoder.
-
-In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
-masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
-summary.
-
-The library provides a version of this model for conditional generation, which should be used for summarization.
-
-
-### MarianMT
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=marian">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
-</a>
-<a href="model_doc/marian">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/opus-mt-zh-en">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Marian: Fast Neural Machine Translation in C++](https://arxiv.org/abs/1804.00344), Marcin Junczys-Dowmunt et al.
-
-A framework for translation models, using the same models as BART
-
-The library provides a version of this model for conditional generation.
-
-
-### T5
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=t5">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
-</a>
-<a href="model_doc/t5">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/t5-base">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683), Colin Raffel et al.
-
-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
-layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
-prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
-
-The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
-tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
-
-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
-individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
-single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
-sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
-
-For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and
-"cute", the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
-
-The library provides a version of this model for conditional generation.
-
-
-### MT5
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=mt5">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mt5-blueviolet">
-</a>
-<a href="model_doc/mt5">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mt5-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/mt5-small-finetuned-arxiv-cs-finetuned-arxiv-cs-full">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934), Linting Xue
-et al.
-
-The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
-supervised training. mT5 is trained on 101 languages.
-
-The library provides a version of this model for conditional generation.
-
-
-### MBart
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=mbart">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
-</a>
-<a href="model_doc/mbart">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/mbart-large-50-one-to-many-mmt">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu,
-Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-
-The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
-for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
-sequence-to-sequence model by denoising full texts in multiple languages,
-
-The library provides a version of this model for conditional generation.
-
-The [mbart-large-en-ro checkpoint](https://huggingface.co/facebook/mbart-large-en-ro) can be used for english ->
-romanian translation.
-
-The [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) checkpoint can be finetuned for other
-translation and summarization tasks, using code in ```examples/pytorch/translation/``` , but is not very useful without
-finetuning.
-
-
-### ProphetNet
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=prophetnet">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-prophetnet-blueviolet">
-</a>
-<a href="model_doc/prophetnet">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-prophetnet-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/prophetnet-large-uncased">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by
-Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
-
-ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
-future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
-time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
-to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
-the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
-self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
-
-The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
-summarization.
-
-### XLM-ProphetNet
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=xprophetnet">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
-</a>
-<a href="model_doc/xlm-prophetnet">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xprophetnet-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/xprophetnet-large-wiki100-cased-xglue-ntg">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by
-Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
-
-XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
-on the cross-lingual dataset [XGLUE](https://arxiv.org/abs/2004.01401).
-
-The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
-versions for headline generation and question generation, respectively.
-
-<a id='multimodal-models'></a>
-
-## Multimodal models
-
-There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the
-others.
-
-### MMBT
-
-[Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/abs/1909.02950), Douwe Kiela
-et al.
-
-A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
-model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
-the hidden state dimension of the transformer).
-
-The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
-model know which part of the input vector corresponds to the text and which to the image.
-
-The pretrained model only works for classification.
-
-<!--More information in this [model documentation](model_doc/mmbt). TODO: write this page
-->
-
-<a id='retrieval-based-models'></a>
-
-## Retrieval-based models
-
-Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.
-
-
-### DPR
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=dpr">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
-</a>
-<a href="model_doc/dpr">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
-</a>
-<a href="https://huggingface.co/spaces/docs-demos/dpr-question_encoder-bert-base-multilingual">
-<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-</a>
-</div>
-
-
-[Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906), Vladimir Karpukhin et
-al.
-
-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
-research.
-
-
-DPR consists in three models:
-
- Question encoder: encode questions as vectors
- Context encoder: encode contexts as vectors
- Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
-  inferred span actually answers the question).
-
-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
-then it calls the reader with the question and the retrieved documents to get the answer.
-
-### RAG
-
-<div class="flex flex-wrap space-x-1">
-<a href="https://huggingface.co/models?filter=rag">
-<img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
-</a>
-<a href="model_doc/rag">
-<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
-</a>
-</div>
-
-[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401), Patrick Lewis,
-Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
-Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
-
-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
-models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
-seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
-to adapt to downstream tasks.
-
-The two models RAG-Token and RAG-Sequence are available for generation.
-
-## More technical aspects
-
-### Full vs sparse attention
-
-Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
-computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
-use a sparse version of the attention matrix to speed up training.
-
-<a id='lsh-attention'></a>
-
-**LSH attention**
-
-[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
-dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
-the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so
-very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
-(determined by a n_rounds parameter) and then are averaged together.
-
-<a id='local-attention'></a>
-
-**Local attention**
-
-[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
-left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
-window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
-representation of the whole sentence.
-
-Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
-all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
-their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
-
-<img scale="50 %" align="center" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local_attention_mask.png"/>
-
-Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
-length.
-
-### Other tricks
-
-<a id='axial-pos-encoding'></a>
-
-**Axial positional encodings**
-
-[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
-E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
-that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
-dimensions \\(l_{1} \times d_{1}\\) and \\(l_{2} \times d_{2}\\), such that \\(l_{1} \times l_{2} = l\\) and
-\\(d_{1} + d_{2} = d\\) (with the product for the lengths, this ends up being way smaller). The embedding for time
-step \\(j\\) in E is obtained by concatenating the embeddings for timestep \\(j \% l1\\) in E1 and \\(j // l1\\)
-in E2.
+The Decision and Trajectory Transformer casts the state, action, and reward as a sequence modeling problem. The [Decision Transformer](model_doc/decision_transformer) generates a series of actions that lead to a future desired return based on returns-to-go, past states, and actions. For the last *K* timesteps, each of the three modalities are converted into token embeddings and processed by a GPT-like model to predict a future action token. [Trajectory Transformer](model_doc/trajectory_transformer) also tokenizes the states, actions, and rewards and processes them with a GPT architecture. Unlike the Decision Transformer, which is focused on reward conditioning, the Trajectory Transformer generates future actions with beam search.