Styling them all

2025-07-03 21:00:08 +06:00 · 2020-10-26 15:48:36 -04:00 · 2020-10-26 15:48:36 -04:00 · 7d029395fd
commit 7d029395fd
parent 4d9d9b8c5b
267 changed files with 9303 additions and 9305 deletions
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
@ -3,21 +3,27 @@ Benchmarks
 Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
-A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
+A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
 <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
 How to benchmark 🤗 Transformer models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
+The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
-The benchmark classes allow us to measure the `peak memory usage` and `required time` for both 
+benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
-`inference` and `training`. 
+for both `inference` and `training`.
 .. note::
-  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass.
+  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
  backward pass.
-The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class.
+The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
-In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked.
+object of type :class:`~transformers.PyTorchBenchmarkArguments` and
 :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
 :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
 classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
 is shown how a BERT model of type `bert-base-cased` can be benchmarked.
 .. code-block::
@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
    >>> benchmark = TensorFlowBenchmark(args)
-Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__
+Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
-The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked. 
+``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
-There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files 
+`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
-``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). 
+the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
-Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
+via the benchmark argument data classes. For more detail on these one can either directly consult the files
 ``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
 and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
 commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
 respectively.
 .. code-block:: bash
@ -65,7 +75,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            0.018     
    bert-base-uncased          8              512            0.088     
    --------------------------------------------------------------------------------
-    
+
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length    Memory in MB 
@ -75,7 +85,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            1307
    bert-base-uncased          8              512            1539
    --------------------------------------------------------------------------------
-    
+
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: PyTorch
@ -98,7 +108,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    - gpu_power_watts: 280.0
    - gpu_performance_state: 2
    - use_tpu: False
-    
+
    >>> ## TENSORFLOW CODE
    >>> results = benchmark.run()
    >>> print(results)
@ -111,7 +121,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            0.022
    bert-base-uncased          8              512            0.105
    --------------------------------------------------------------------------------
-    
+
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length    Memory in MB 
@ -121,7 +131,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    bert-base-uncased          8              128            1330
    bert-base-uncased          8              512            1770
    --------------------------------------------------------------------------------
-    
+
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: Tensorflow
@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
    - gpu_performance_state: 2
    - use_tpu: False
-By default, the `time` and the `required memory` for `inference` are benchmarked. 
+By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
-In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`. 
+two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
-In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`.
+information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
-This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively.
+out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
-In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
+when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
 :class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
 `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
-Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. 
+Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
-In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows.
+alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
 configurations must be inserted with the benchmark args as follows.
 .. code-block::
@ -183,7 +196,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            0.009     
    bert-6-lay                 8              512            0.044
    --------------------------------------------------------------------------------
-    
+
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length      Memory in MB 
@ -201,7 +214,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            1127     
    bert-6-lay                 8              512            1359
    --------------------------------------------------------------------------------
-    
+
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: PyTorch
@ -252,7 +265,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            0.0011
    bert-6-lay                 8              512            0.074
    --------------------------------------------------------------------------------
-    
+
    ====================      INFERENCE - MEMORY - RESULT       ====================
    --------------------------------------------------------------------------------
    Model Name             Batch Size     Seq Length      Memory in MB 
@ -270,7 +283,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    bert-6-lay                 8              128            1330
    bert-6-lay                 8              512            1540
    --------------------------------------------------------------------------------
-    
+
    ====================        ENVIRONMENT INFORMATION         ====================
    - transformers_version: 2.11.0
    - framework: Tensorflow
@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
    - use_tpu: False
-Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when 
+Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
-deciding for which configuration the model should be trained.
+of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
 should be trained.
 Benchmark best practices
@ -304,19 +318,28 @@ Benchmark best practices
 This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user 
+- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
-  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
+  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`.
+  shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community.
+- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
  :obj:`no_multi_processing` is set to :obj:`True`.
 - One should always state the environment information when sharing the results of a model benchmark. Results can vary
  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
  useful for the community.
 Sharing your benchmark
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
+Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
-and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
+settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
-TensorFlow XLA) and GPUs.
+done across CPUs (except for TensorFlow XLA) and GPUs.
-The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
+The approach is detailed in the `following blogpost
 <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
 available `here
 <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
-With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
+With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
 <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
@ -1,18 +1,26 @@
 BERTology
 -----------------------------------------------------------------------------------------------------------------------
-There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
+There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
 (that some call "BERTology"). Some good examples of this field are:
-* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
+* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
  https://arxiv.org/abs/1905.05950
 * Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
-* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
+* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
  Manning: https://arxiv.org/abs/1906.04341
-In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
+In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
 help people access the inner representations, mainly adapted from the great work of Paul Michel
 (https://arxiv.org/abs/1905.10650):
 * accessing all the hidden-states of BERT/GPT/GPT-2,
 * accessing all the attention weights for each head of BERT/GPT/GPT-2,
-* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
+* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
  in https://arxiv.org/abs/1905.10650.
-To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
+To help you understand and use these features, we have added a specific example script: `bertology.py
 <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
 information and prune a model pre-trained on GLUE.
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@ -1,24 +1,40 @@
 Converting Tensorflow Checkpoints
 =======================================================================================================================
-A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
+A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
 than be loaded using the ``from_pretrained`` methods of the library.
 .. note::
-    Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
+    Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
-    available in any transformers >= 2.3.0 installation.
+    transformers >= 2.3.0 installation.
    The documentation below reflects the **transformers-cli convert** command format.
 BERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
 <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
 `convert_bert_original_tf_checkpoint_to_pytorch.py
 <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
 script.
-This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
 configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
 from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
 can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
 <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
 `run_bert_classifier.py
 <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
 `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
 ).
-You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
 checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
 ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
-To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
+To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
 tensorflow``\ ). The rest of the repository only requires PyTorch.
 Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
     --config $BERT_BASE_DIR/bert_config.json \
     --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
-You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
+You can download Google's pre-trained models for the conversion `here
 <https://github.com/google-research/bert#pre-trained-models>`__.
 ALBERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
 `convert_albert_original_tf_checkpoint_to_pytorch.py
 <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
 script.
-The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
+The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
 configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
 will need to have TensorFlow and PyTorch installed.
 Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
     --config $ALBERT_BASE_DIR/albert_config.json \
     --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
-You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
+You can download Google's pre-trained models for the conversion `here
 <https://github.com/google-research/albert#pre-trained-models>`__.
 OpenAI GPT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
+Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
 save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
 )
 .. code-block:: shell
@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
 OpenAI GPT-2
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
+Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
 <https://github.com/openai/gpt-2>`__\ )
 .. code-block:: shell
@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
 Transformer-XL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
+Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
 <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
 .. code-block:: shell
@ -130,4 +157,4 @@ Here is an example of the conversion process for a pre-trained XLM model:
     --tf_checkpoint $XLM_CHECKPOINT_PATH \
     --pytorch_dump_output $PYTORCH_DUMP_OUTPUT
    [--config XML_CONFIG] \
-    [--finetuning_task_name XML_FINETUNED_TASK]
+    [--finetuning_task_name XML_FINETUNED_TASK]
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@ -3,15 +3,15 @@ Fine-tuning with custom datasets
 .. note::
-    The datasets used in this tutorial are available and can be more easily accessed using the
+    The datasets used in this tutorial are available and can be more easily accessed using the `🤗 NLP library
-    `🤗 NLP library <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here
+    <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here since this tutorial
-    since this tutorial meant to illustrate how to work with your own data. A brief of introduction can be found
+    meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the tutorial
-    at the end of the tutorial in the section ":ref:`nlplib`".
+    in the section ":ref:`nlplib`".
-This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The
+This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
-guide shows one of many valid workflows for using these models and is meant to be illustrative rather than
+shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
-definitive. We show examples of reading in several data formats, preprocessing the data for several types of tasks,
+show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
-and then preparing the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
+the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
 :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.
 We include several examples, each of which demonstrates a different type of common downstream task:
@ -28,13 +28,13 @@ Sequence Classification with IMDb Reviews
 .. note::
-    This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and can
+    This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
-    be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
+    can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
-In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task
+In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
-takes the text of a review and requires the model to predict whether the sentiment of the review is positive or
+the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
-negative. Let's start by downloading the dataset from the
+Let's start by downloading the dataset from the `Large Movie Review Dataset
-`Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
+<http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
 .. code-block:: bash
@ -62,9 +62,8 @@ read this in.
    train_texts, train_labels = read_imdb_split('aclImdb/train')
    test_texts, test_labels = read_imdb_split('aclImdb/test')
-We now have a train and test dataset, but let's also also create a validation set which we can use for for
+We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
-evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such
+and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
 splits:
 .. code-block:: python
@ -80,8 +79,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer.
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
 Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
-ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum
+ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
-input length. This will allow us to feed batches of sequences into the model at the same time.
+length. This will allow us to feed batches of sequences into the model at the same time.
 .. code-block:: python
@ -90,9 +89,9 @@ input length. This will allow us to feed batches of sequences into the model at
    test_encodings = tokenizer(test_texts, truncation=True, padding=True)
 Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
-``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input encodings and
+``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
-labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data can be
+encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
-easily batched such that each key in the batch encoding corresponds to a named parameter of the
+can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
 :meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.
 .. code-block:: python
@ -133,17 +132,17 @@ easily batched such that each key in the batch encoding corresponds to a named p
    ))
 Now that our datasets our ready, we can fine-tune a model either with the 🤗
-:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See
+:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
-:doc:`training <training>`.
+<training>`.
 .. _ft_trainer:
 Fine-tuning with Trainer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a
+The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
-model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments`
+to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
-and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
+instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
 .. code-block:: python
@ -248,15 +247,15 @@ Token Classification with W-NUT Emerging Entities
 .. note::
-    This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_), and can
+    This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
-    be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
+    and can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
 Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
-token. We'll demonstrate how to do this with 
+token. We'll demonstrate how to do this with `Named Entity Recognition
-`Named Entity Recognition <http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves
+<http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
-identifying tokens which correspond to a predefined set of "entities". Specifically, we'll use the
+a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
-`W-NUT Emerging and Rare entities <http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data
+<http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
-is given as a collection of pre-tokenized documents where each token is assigned a tag.
+pre-tokenized documents where each token is assigned a tag.
 Let's start by downloading the data.
@ -264,10 +263,10 @@ Let's start by downloading the data.
    wget http://noisy-text.github.io/2017/files/wnut17train.conll
-In this case, we'll just download the train set, which is a single text file. Each line of the file contains either
+In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
-(1) a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a
+a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
-function to read this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token
+this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
-strings, and ``token_tags`` which is a list of lists of tag strings.
+``token_tags`` which is a list of lists of tag strings.
 .. code-block:: python
@ -290,11 +289,11 @@ strings, and ``token_tags`` which is a list of lists of tag strings.
                tags.append(tag)
            token_docs.append(tokens)
            tag_docs.append(tags)
-        
+
        return token_docs, tag_docs
-    
+
    texts, tags = read_wnut('wnut17train.conll')
-    
+
 Just to see what this data looks like, let's take a look at a segment of the first document.
 .. code-block:: python
@ -303,8 +302,8 @@ Just to see what this data looks like, let's take a look at a segment of the fir
    ['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
    ['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']
-``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions of
+``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
-the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
+of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
 any entity.
 Now that we've read the data in, let's create a train/validation split:
@ -314,8 +313,8 @@ Now that we've read the data in, let's create a train/validation split:
    from sklearn.model_selection import train_test_split
    train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
-Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping
+Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
-which we'll use in a moment:
+we'll use in a moment:
 .. code-block:: python
@ -323,11 +322,11 @@ which we'll use in a moment:
    tag2id = {tag: id for id, tag in enumerate(unique_tags)}
    id2tag = {id: tag for tag, id in tag2id.items()}
-To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
+To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
-with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
+ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
-``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
+``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
-to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
+return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
-a moment.
+moment.
 .. code-block:: python
@ -339,26 +338,26 @@ a moment.
 Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
 model below.
-Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens
+Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
-in the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
+the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
-Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in
+Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
-the vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens
+vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
-``['@', 'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer
+'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
-splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
+token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
-One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in
+One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
-🤗 Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
+Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
 ``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
 ``[3, -100, -100]``.
 Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
 above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
-start position and end position relative to the original token it was split from. That means that if the first
+start position and end position relative to the original token it was split from. That means that if the first position
-position in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at
+in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
-it, we can also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must
+also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
-be a special token like ``[PAD]`` or ``[CLS]``.
+special token like ``[PAD]`` or ``[CLS]``.
-.. note:: 
+.. note::
    Due to a recently fixed bug, -1 must be used instead of -100 when using TensorFlow in 🤗 Transformers <= 3.02.
@ -379,7 +378,7 @@ be a special token like ``[PAD]`` or ``[CLS]``.
            encoded_labels.append(doc_enc_labels.tolist())
        return encoded_labels
-    
+
    train_labels = encode_tags(train_tags, train_encodings)
    val_labels = encode_tags(val_tags, val_encodings)
@ -447,8 +446,9 @@ Question Answering with SQuAD 2.0
 .. note::
-    This dataset can be explored in the Hugging Face model hub (`SQuAD V2 <https://huggingface.co/datasets/squad_v2>`_), and can
+    This dataset can be explored in the Hugging Face model hub (`SQuAD V2
-    be alternatively downloaded with the 🤗 NLP library with ``load_dataset("squad_v2")``.
+    <https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 NLP library with
    ``load_dataset("squad_v2")``.
 Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
 involves answering a question about a passage by highlighting the segment of the passage that answers the question.
@ -464,8 +464,8 @@ We will start by downloading the data:
    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json
 Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
-take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated
+take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
-since there are multiple questions per context):
+there are multiple questions per context):
 .. code-block:: python
@ -491,17 +491,17 @@ since there are multiple questions per context):
                        answers.append(answer)
        return contexts, questions, answers
-    
+
    train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
    val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')
-The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with
+The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
-the correct answer as well as an integer indicating the character at which the answer begins. In order to train a
+correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
-model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token*
+this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
-positions the answer begins and ends.
+answer begins and ends.
-First, let's get the *character* position at which the answer ends in the passage (we are given the starting
+First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
-position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
+Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
 .. code-block:: python
@ -510,7 +510,7 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
            gold_text = answer['text']
            start_idx = answer['answer_start']
            end_idx = start_idx + len(gold_text)
-            
+
            # sometimes squad answers are off by a character or two – fix this
            if context[start_idx:end_idx] == gold_text:
                answer['answer_end'] = end_idx
@ -524,9 +524,9 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
    add_end_idx(train_answers, train_contexts)
    add_end_idx(val_answers, val_contexts)
-Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions.
+Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
-Next, let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode
+let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
-them together as sequence pairs.
+as sequence pairs.
 .. code-block:: python
@ -536,8 +536,8 @@ them together as sequence pairs.
    train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
    val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)
-Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast
+Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
-Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
+we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
 .. code-block:: python
@ -557,9 +557,9 @@ Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_t
    add_token_positions(train_encodings, train_answers)
    add_token_positions(val_encodings, val_answers)
-Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for
+Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
-training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of
+PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
-``(inputs_dict, labels_dict)`` to the ``from_tensor_slices`` method.
+``from_tensor_slices`` method.
 .. code-block:: python
@ -575,7 +575,7 @@ training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pa
        def __len__(self):
            return len(self.encodings.input_ids)
-        
+
    train_dataset = SquadDataset(train_encodings)
    val_dataset = SquadDataset(val_encodings)
    ## TENSORFLOW CODE
@ -668,12 +668,11 @@ Additional Resources
 Using the 🤗 NLP Datasets & Metrics library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with
+This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
-🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the
+Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
-`🤗 NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the
+NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the `hub
-`hub <https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview,
+<https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
-we will show how to use the NLP library to download and prepare the IMDb dataset from the first example,
+will show how to use the NLP library to download and prepare the IMDb dataset from the first example, :ref:`seq_imdb`.
 :ref:`seq_imdb`.
 Start by downloading the dataset:
@ -689,8 +688,8 @@ Each dataset has multiple columns corresponding to different features. Let's see
    >>> print(train.column_names)
    ['label', 'text']
-Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column
+Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
-to ``labels`` to match the model's input arguments.
+``labels`` to match the model's input arguments.
 .. code-block:: python
@ -711,5 +710,5 @@ dataset elements.
    >>> {key: val.shape for key, val in train[0].items()})
    {'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}
-We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for
+We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for a
-a more thorough introduction.
+more thorough introduction.
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
    >>> tokenized_sequence = tokenizer.tokenize(sequence)
 The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
-in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
+in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
-added for "RA" and "M":
+is added for "RA" and "M":
 .. code-block::
@ -66,8 +66,8 @@ added for "RA" and "M":
    ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
 These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
-the sentence to the tokenizer, which leverages the Rust implementation of
+the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
-`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
+<https://github.com/huggingface/tokenizers>`__ for peak performance.
 .. code-block::
@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
 Attention mask
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The attention mask is an optional argument used when batching sequences together. This argument indicates to the
+The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
-model which tokens should be attended to, and which should not.
+which tokens should be attended to, and which should not.
 For example, consider these two sequences:
@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
    >>> padded_sequences["input_ids"]
    [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
-This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
+This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
-the position of the padded indices so that the model does not attend to them. For the
+position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
-:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
+:obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
-a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
+in the dictionary returned by the tokenizer under the key "attention_mask":
 .. code-block::
@ -161,15 +161,16 @@ Token Type IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Some models' purpose is to do sequence classification or question answering. These require two different sequences to
-be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
+be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
-tokens. For example, the BERT model builds its two sequence input as such:
+classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
 such:
 .. code-block::
   >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
-We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
+We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
-not a list, like before) like this:
+arguments (and not a list, like before) like this:
 .. code-block::
@ -189,8 +190,8 @@ which will return:
    [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
 This is enough for some models to understand where one sequence ends and where another begins. However, other models,
-such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary
+such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
-mask identifying the two types of sequence in the model.
+the two types of sequence in the model.
 The tokenizer returns this mask as the "token_type_ids" entry:
@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
 Position IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Contrary to RNNs that have the position of each token embedded within them,
+Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
-transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
+each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
 the list of tokens.
-They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute
+They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
-positional embeddings.
+absolute positional embeddings.
-Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
+Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
-use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
+other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
 .. _labels:
@ -224,43 +226,41 @@ Labels
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
-should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
+should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
-its predictions and the expected value (the label).
+predictions and the expected value (the label).
 These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
+- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
-  a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
+  tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
  entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
+- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
-  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
-  individual token.
+  token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
+- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
-  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
-  individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
+  labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
  -100).
 - For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
-  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
+  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
-  :obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
+  tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
-  input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
+  training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
-  attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
+  They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
-  Encoder-Decoder framework.
+  the documentation of each model for more information on each specific model's labels.
  See the documentation of each model for more information on each specific model's labels.
-The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
+The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
-simply outputting features.
+models, simply outputting features.
 .. _decoder-input-ids:
 Decoder input IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
+This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
-These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
+inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
-built in a way specific to each model.
+way specific to each model.
-Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
+Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
-In such models, passing the :obj:`labels` is the preferred way to handle training.
+such models, passing the :obj:`labels` is the preferred way to handle training.
 Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
@ -270,18 +270,18 @@ Feed Forward Chunking
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
-The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
+The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
-for ``bert-base-uncased``).
+``bert-base-uncased``).
 For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
 embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
 use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
 computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
 embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
-individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with
+individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
-``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a
+sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
-mathematically **equivalent** result.
+**equivalent** result.
 For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
 number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
-complexity.  If ``chunk_size`` is set to 0, no feed forward chunking is done.
+complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -47,7 +47,7 @@ The documentation is organized in five parts:
 - **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
  transformers model
 - The three last section contain the documentation of each public class and function, grouped in:
-    - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
+
    - **MODELS** for the classes and functions related to each model implemented in the library.
    - **INTERNAL HELPERS** for the classes and functions we use internally.
@ -122,7 +122,7 @@ conversion utilities for the following models:
 20. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
    Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
    Translator Team.
-21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper  `Multilingual Denoising Pre-training for
+21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
    Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
    Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
--- a/docs/source/internal/modeling_utils.rst
+++ b/docs/source/internal/modeling_utils.rst
@ -85,4 +85,4 @@ TensorFlow Helper Functions
 .. autofunction:: transformers.modeling_tf_utils.keras_serializable
-.. autofunction:: transformers.modeling_tf_utils.shape_list
+.. autofunction:: transformers.modeling_tf_utils.shape_list
--- a/docs/source/internal/tokenization_utils.rst
+++ b/docs/source/internal/tokenization_utils.rst
@ -25,6 +25,7 @@ SpecialTokensMixin
 Enums and namedtuples
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
 .. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
--- a/docs/source/internal/trainer_utils.rst
+++ b/docs/source/internal/trainer_utils.rst
@ -24,4 +24,4 @@ Distributed Evaluation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.trainer_pt_utils.DistributedTensorGatherer
-    :members:
+    :members:
--- a/docs/source/main_classes/logging.rst
+++ b/docs/source/main_classes/logging.rst
@ -17,7 +17,7 @@ You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override
 to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
 .. code-block:: bash
-               
+
    TRANSFORMERS_VERBOSITY=error ./myprogram.py
 All the methods of this logging module are documented below, the main ones are
@ -55,4 +55,4 @@ Other functions
 .. autofunction:: transformers.logging.enable_explicit_format
-.. autofunction:: transformers.logging.reset_format
+.. autofunction:: transformers.logging.reset_format
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
@ -52,4 +52,4 @@ Generative models
    :members:
 .. autoclass:: transformers.generation_tf_utils.TFGenerationMixin
-    :members:
+    :members:
--- a/docs/source/main_classes/pipelines.rst
+++ b/docs/source/main_classes/pipelines.rst
@ -1,8 +1,8 @@
 Pipelines
 -----------------------------------------------------------------------------------------------------------------------
-The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
+The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
-of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
+the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
 Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
 :doc:`task summary <../task_summary>` for examples of use.
@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
 The pipeline abstraction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
+The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
-other pipeline but requires an additional argument which is the `task`.
+pipeline but requires an additional argument which is the `task`.
 .. autofunction:: transformers.pipeline
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@ -8,8 +8,8 @@ Processors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 All processors follow the same architecture which is that of the
-:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
+:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
-of :class:`~transformers.data.processors.utils.InputExample`. These
+:class:`~transformers.data.processors.utils.InputExample`. These
 :class:`~transformers.data.processors.utils.InputExample` can be converted to
 :class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
@ -28,15 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
 GLUE
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
+`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
-the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
+performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
-`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__
+multi-task benchmark and analysis platform for natural language understanding
 <https://openreview.net/pdf?id=rJ4km2R5t7>`__
-This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched),
+This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
-CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
+QQP, QNLI, RTE and WNLI.
 Those processors are:
-    - :class:`~transformers.data.processors.utils.MrpcProcessor`
+
    - :class:`~transformers.data.processors.utils.MnliProcessor`
    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
    - :class:`~transformers.data.processors.utils.Sst2Processor`
@ -46,7 +47,7 @@ Those processors are:
    - :class:`~transformers.data.processors.utils.RteProcessor`
    - :class:`~transformers.data.processors.utils.WnliProcessor`
-Additionally, the following method  can be used to load values from a data file and convert them to a list of
+Additionally, the following method can be used to load values from a data file and convert them to a list of
 :class:`~transformers.data.processors.utils.InputExample`.
 .. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
@ -54,36 +55,38 @@ Additionally, the following method  can be used to load values from a data file
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
+An example using these processors is given in the `run_glue.py
 <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
 XNLI
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
+`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
-the quality of cross-lingual text representations. 
+quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
-XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment 
+<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
-annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
+different languages (including both high-resource language such as English and low-resource languages such as Swahili).
-It was released together with the paper
+It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
-`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
+<https://arxiv.org/abs/1809.05053>`__
 This library hosts the processor to load the XNLI data:
-    - :class:`~transformers.data.processors.utils.XnliProcessor`
+
 Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
-An example using these processors is given in the
+An example using these processors is given in the `run_xnli.py
-`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
+<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
 SQuAD
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
+`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
-the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
+evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
-`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside 
+(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
-the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
+<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
 Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
 This library hosts a processor for each of the two versions:
@ -91,7 +94,7 @@ Processors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Those processors are:
-    - :class:`~transformers.data.processors.utils.SquadV1Processor`
+
    - :class:`~transformers.data.processors.utils.SquadV2Processor`
 They both inherit from the abstract class :class:`~transformers.data.processors.utils.SquadProcessor`
@ -99,17 +102,18 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
 .. autoclass:: transformers.data.processors.squad.SquadProcessor
    :members:
-Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
+Additionally, the following method can be used to convert SQuAD examples into
-that can be used as model inputs.
+:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
 .. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
-These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
+These processors as well as the aforementionned method can be used with files containing the data as well as with the
-Examples are given below.
+`tensorflow_datasets` package. Examples are given below.
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example using the processors as well as the conversion method using data files:
 .. code-block::
@ -149,5 +153,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
    )
-Another example using these processors is given in the
+Another example using these processors is given in the `run_squad.py
-`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
+<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@ -29,11 +29,12 @@ methods for using all the tokenizers:
 :class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
 ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
-tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these
+tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
-methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace
+these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
-`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced
+HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
-alignment methods which can be used to map between the original string (character and words) and the token space (e.g.,
+several advanced alignment methods which can be used to map between the original string (character and words) and the
-getting the index of the token comprising a given character or the span of characters corresponding to a given token).
+token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
 to a given token).
 PreTrainedTokenizer
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@ -4,7 +4,7 @@ Trainer
 The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
 training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
-Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a 
+Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
 :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
 customization during training.
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
 longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
 techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
 that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
-self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
+self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
-tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
+with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
-RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
+SQuAD benchmarks while having fewer parameters compared to BERT-large.*
 Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  the right rather than the left.
+  than the left.
 - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.
--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
@ -2,9 +2,8 @@ AutoClasses
 -----------------------------------------------------------------------------------------------------------------------
 In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
-are supplying to the :obj:`from_pretrained()` method.
+are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
-AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path
+automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
 to the pretrained weights/config/vocabulary.
 Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
 :class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@ -29,10 +29,10 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` 
+- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
-  or :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
+  :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
 - The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
-  :func:`transformers.modeling_bart._prepare_bart_decoder_inputs`)  if they are not passed. This is different than some
+  :func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some
  other modeling APIs.
 - Model predictions are intended to be identical to the original implementation. This only works, however, if the
  string you pass to :func:`fairseq.encode` starts with a space.
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@ -25,8 +25,8 @@ improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).*
 Tips:
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the right rather than the left.
+  the left.
 - BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
--- a/docs/source/model_doc/bertgeneration.rst
+++ b/docs/source/model_doc/bertgeneration.rst
@ -25,14 +25,14 @@ Usage:
  BERT checkpoints for subsequent fine-tuning.
 :: code-block
-  
+
  # leverage checkpoints for Bert2Bert model...
-  # use BERT's cls token as BOS token and sep token as EOS token
+  # use BERT's cls token as BOS token and sep token as EOS token encoder =
-  encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
+  BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102) # add cross attention
-  # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
+  layers and use BERT's cls token as BOS token and sep token as EOS token decoder =
-  decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
+  BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True,
-  bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
+  bos_token_id=101, eos_token_id=102) bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
-  
+
  # create tokenizer...
  tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
@ -40,8 +40,7 @@ Usage:
  labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
  # train...
-  loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels, return_dict=True).loss
+  loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels, return_dict=True).loss loss.backward()
  loss.backward()
 - Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, e.g.,
--- a/docs/source/model_doc/blenderbot.rst
+++ b/docs/source/model_doc/blenderbot.rst
@ -1,16 +1,28 @@
 Blenderbot
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
+
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
+**DISCLAIMER:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
 <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
 Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
 The abstract of the paper is the following:
-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.*
+*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
 scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
 we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
 skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
 their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
 persona. We show that large scale models can learn these skills when given appropriate training data and choice of
 generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
 and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*
 The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
@ -20,8 +32,11 @@ Implementation Notes
 - Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
 - It inherits completely from :class:`~transformers.BartForConditionalGeneration`
- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
+- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`, regardless of checkpoint. To use the 3B parameter checkpoint, you must call :class:`~transformers.BlenderbotTokenizer` directly.
+  checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
 - :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`,
  regardless of checkpoint. To use the 3B parameter checkpoint, you must call
  :class:`~transformers.BlenderbotTokenizer` directly.
 - Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
@ -56,6 +71,7 @@ Here is how you can check out config values:
 BlenderbotConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotConfig
    :members:
@ -74,6 +90,7 @@ BlenderbotSmallTokenizer
 BlenderbotForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
 .. autoclass:: transformers.BlenderbotForConditionalGeneration
--- a/docs/source/model_doc/camembert.rst
+++ b/docs/source/model_doc/camembert.rst
@ -4,26 +4,26 @@ CamemBERT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__
+The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ by
-by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
+Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
 Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
 trained on 138GB of French text.
 The abstract from the paper is the following:
-*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success,
+*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
-most available models have either been trained on English data or on the concatenation of data in multiple
+models have either been trained on English data or on the concatenation of data in multiple languages. This makes
-languages. This makes practical use of such models --in all languages except English-- very limited. Aiming
+practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
-to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for
+we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
-Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple
+performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
-downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural
+dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
-language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the
+for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
-pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.*
+downstream applications for French NLP.*
 Tips:
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
-  examples as well as the information relative to the inputs and outputs.
+  as well as the information relative to the inputs and outputs.
 The original code can be found `here <https://camembert-model.fr/>`__.
@ -130,4 +130,4 @@ TFCamembertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertForQuestionAnswering
-    :members:
+    :members:
--- a/docs/source/model_doc/ctrl.rst
+++ b/docs/source/model_doc/ctrl.rst
@ -6,33 +6,33 @@ Overview
 CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
 <https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
-Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
+Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
-corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
+of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
 The abstract from the paper is the following:
 *Large-scale language models show promising text generation capabilities, but users cannot easily control particular
 aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
 trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
-derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning
+derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
-while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of
+providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
-the training data are most likely given a sequence. This provides a potential method for analyzing large amounts
+training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
-of data via model-based source attribution.*
+via model-based source attribution.*
 Tips:
 - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
-  or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__
+  or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ for
-  for more information.
+  more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the right rather than the left.
+  the left.
 - CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
-  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as
+  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
-  it can be observed in the `run_generation.py` example script.
+  observed in the `run_generation.py` example script.
 - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
-  this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
+  this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
-  See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage
+  `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
-  of this argument.
+  this argument.
 The original code can be found `here <https://github.com/salesforce/ctrl>`__.
--- a/docs/source/model_doc/deberta.rst
+++ b/docs/source/model_doc/deberta.rst
@ -1,40 +1,43 @@
 DeBERTa
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__
+The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
-by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
+<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
-It is based on Google's BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
+BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
-It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.
+It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
 RoBERTa.
 The abstract from the paper is the following:
-*Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. 
+*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
-In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa 
+language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
-models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode
+disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
-its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and 
+disentangled attention mechanism, where each word is represented using two vectors that encode its content and
-relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining.
+position, respectively, and the attention weights among words are computed using disentangled matrices on their
-We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to 
+contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
-RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements 
+predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
-on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained 
+of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
-models will be made publicly available at https://github.com/microsoft/DeBERTa.*
+of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
 (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
 pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
 The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
 DebertaConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaConfig
    :members:
 DebertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@ -42,21 +45,21 @@ DebertaTokenizer
 DebertaModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaModel
    :members:
 DebertaPreTrainedModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaPreTrainedModel
    :members:
 DebertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaForSequenceClassification
    :members:
--- a/docs/source/model_doc/dialogpt.rst
+++ b/docs/source/model_doc/dialogpt.rst
@ -4,36 +4,39 @@ DialoGPT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-DialoGPT was proposed in
+DialoGPT was proposed in `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
-`DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`_
+<https://arxiv.org/abs/1911.00536>`_ by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
-by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
+Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
-It's a GPT2 Model trained on 147M conversation-like exchanges extracted from Reddit.
+Reddit.
 The abstract from the paper is the following:
-*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer). 
+*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained
-Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings.
+transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning
-We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems.
+from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human
-The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.*
+both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems
 that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline
 systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response
 generation and the development of more intelligent open-domain dialogue systems.*
 Tips:
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  the right rather than the left.
+  than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful at response generation in open-domain dialogue systems.
+- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card <https://huggingface.co/microsoft/DialoGPT-medium>`_.
+  at response generation in open-domain dialogue systems.
 - DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card
  <https://huggingface.co/microsoft/DialoGPT-medium>`_.
 Training:
-In order to train or fine-tune DialoGPT, one can use causal language modeling training. 
+In order to train or fine-tune DialoGPT, one can use causal language modeling training. To cite the official paper: *We
-To cite the official paper: 
+follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language
-*We follow the OpenAI GPT-2 to model a multiturn dialogue session 
+modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
-as a long text and frame the generation task as language modeling. We first
+sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
 concatenate all dialog turns within a dialogue session into a long text 
 x_1,..., x_N (N is the sequence length), ended by the end-of-text token.* 
 For more information please confer to the original paper.
-DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring <https://huggingface.co/transformers/model_doc/gpt2.html>`_.
+
 DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring
 <https://huggingface.co/transformers/model_doc/gpt2.html>`_.
 The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
@ -4,13 +4,12 @@ DistilBERT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DistilBERT model was proposed in the blog post
+The DistilBERT model was proposed in the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
-`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
+distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a
-<https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a distilled version of BERT:
+distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. DistilBERT is a
-smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
+small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
-DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less
+`bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
-parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on
+understanding benchmark.
 the GLUE language understanding benchmark.
 The abstract from the paper is the following:
@ -18,13 +17,13 @@ The abstract from the paper is the following:
 operating these large models in on-the-edge and/or under constrained computational training or inference budgets
 remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
 model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
-counterparts. While most prior work investigated the use of distillation for building task-specific models, we
+counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
-leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
+knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
-BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage
+40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
-the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language
+biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
-modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train
+distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
-and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative
+demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
-on-device study.*
+study.*
 Tips:
@ -33,7 +32,8 @@ Tips:
 - DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
  necessary though, just let us know if you need this option.
-The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
+The original code can be found `here
 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
 DistilBertConfig
--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
@ -4,9 +4,9 @@ DPR
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research.
+Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
-It was intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__
+intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
-by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
+Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 The abstract from the paper is the following:
--- a/docs/source/model_doc/electra.rst
+++ b/docs/source/model_doc/electra.rst
@ -12,34 +12,28 @@ identify which tokens were replaced by the generator in the sequence.
 The abstract from the paper is the following:
-*Masked language modeling (MLM) pre-training methods such as BERT corrupt
+*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
-the input by replacing some tokens with [MASK] and then train a model to
+[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
-reconstruct the original tokens. While they produce good results when transferred
+downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
-to downstream NLP tasks, they generally require large amounts of compute to be
+more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
-effective. As an alternative, we propose a more sample-efficient pre-training task
+corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
-called replaced token detection. Instead of masking the input, our approach
+of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
-corrupts it by replacing some tokens with plausible alternatives sampled from a small
+predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
-generator network. Then, instead of training a model that predicts the original
+demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
-identities of the corrupted tokens, we train a discriminative model that predicts
+rather than just the small subset that was masked out. As a result, the contextual representations learned by our
-whether each token in the corrupted input was replaced by a generator sample
+approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
-or not. Thorough experiments demonstrate this new pre-training task is more
+particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
-efficient than MLM because the task is defined over all input tokens rather than
+using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
-just the small subset that was masked out. As a result, the contextual representations
+where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
-learned by our approach substantially outperform the ones learned by BERT
+using the same amount of compute.*
 given the same model size, data, and compute. The gains are particularly strong
 for small models; for example, we train a model on one GPU for 4 days that
 outperforms GPT (trained using 30x more compute) on the GLUE natural language
 understanding benchmark. Our approach also works well at scale, where it
 performs comparably to RoBERTa and XLNet while using less than 1/4 of their
 compute and outperforms them when using the same amount of compute.*
 Tips:
 - ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
  only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
-  while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
+  while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
-  their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
+  embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
-  projection layer is used.
+  layer is used.
 - The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
  contain both the generator and discriminator. The conversion script requires the user to name which model to export
  into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
--- a/docs/source/model_doc/encoderdecoder.rst
+++ b/docs/source/model_doc/encoderdecoder.rst
@ -13,7 +13,7 @@ any other models (see the examples for more information).
 An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder
 and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders
-<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata. 
+<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata.
 EncoderDecoderConfig
--- a/docs/source/model_doc/flaubert.rst
+++ b/docs/source/model_doc/flaubert.rst
@ -11,17 +11,17 @@ modeling (MLM) objective (like BERT).
 The abstract from the paper is the following:
 *Language models have become a key step to achieve state-of-the art results in many different Natural Language
-Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient
+Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
-way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
+to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
 contextualization at the sentence level. This has been widely demonstrated for English using contextualized
-representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et
+representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
-al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large
+2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
-and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre
+heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
-for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
+Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
-classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most
+classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
-of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified
+time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
-evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared
+protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
-to the research community for further reproducible experiments in French NLP.*
+community for further reproducible experiments in French NLP.*
 The original code can be found `here <https://github.com/getalp/Flaubert>`__.
--- a/docs/source/model_doc/fsmt.rst
+++ b/docs/source/model_doc/fsmt.rst
@ -58,4 +58,4 @@ FSMTForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FSMTForConditionalGeneration
-    :members: forward
+    :members: forward
--- a/docs/source/model_doc/funnel.rst
+++ b/docs/source/model_doc/funnel.rst
@ -30,8 +30,8 @@ Tips:
  directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
  tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
  sequence length as the input.
- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should
+- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
-  be used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
+  used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
  :class:`~transformers.FunnelForMaskedLM`, :class:`~transformers.FunnelForTokenClassification` and
  class:`~transformers.FunnelForQuestionAnswering`. The second ones should be used for
  :class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@ -6,44 +6,39 @@ Overview
 OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training
 <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
-by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional)
+by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
-transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book
+pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
 Corpus.
 The abstract from the paper is the following:
-*Natural language understanding comprises a wide range of diverse tasks such
+*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
-as textual entailment, question answering, semantic similarity assessment, and
+semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
-document classification. Although large unlabeled text corpora are abundant,
+labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
-labeled data for learning these specific tasks is scarce, making it challenging for
+perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
-discriminatively trained models to perform adequately. We demonstrate that large
+language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
-gains on these tasks can be realized by generative pre-training of a language model
+contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
-on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
+effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
-specific task. In contrast to previous approaches, we make use of task-aware input
+approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
-transformations during fine-tuning to achieve effective transfer while requiring
+discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
-minimal changes to the model architecture. We demonstrate the effectiveness of
+the state of the art in 9 out of the 12 tasks studied.*
 our approach on a wide range of benchmarks for natural language understanding.
 Our general task-agnostic model outperforms discriminatively trained models that
 use architectures specifically crafted for each task, significantly improving upon the
 state of the art in 9 out of the 12 tasks studied.*
 Tips:
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the right rather than the left.
+  the left.
 - GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
-  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
+  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
-  it can be observed in the `run_generation.py` example script.
+  observed in the `run_generation.py` example script.
-`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by
+`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by Hugging Face
-Hugging Face showcasing the generative capabilities of several models. GPT is one of them.
+showcasing the generative capabilities of several models. GPT is one of them.
 The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`__.
 Note:
-If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install 
+If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install ``ftfy``
-``ftfy`` and ``SpaCy``::
+and ``SpaCy``::
 .. code-block:: bash
@ -51,8 +46,7 @@ If you want to reproduce the original tokenization process of the `OpenAI GPT` p
    python -m spacy download en
 If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize
-using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't 
+using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
 worry).
 OpenAIGPTConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@ -5,29 +5,29 @@ Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners
-<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_
+<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_ by Alec
-by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
+Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
-transformer pretrained using  language modeling on a very large corpus of ~40 GB of text data.
+transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
 The abstract from the paper is the following:
-*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1]
+*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
-of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous
+web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
-words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring
+text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
-demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X
+across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
-the parameters and trained on more than 10X the amount of data.*
+10X the amount of data.*
 Tips:
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the right rather than the left.
+  the left.
 - GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
-  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
+  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
-  it can be observed in the `run_generation.py` example script.
+  observed in the `run_generation.py` example script.
 - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
-  this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
+  this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
-  See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage
+  `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
-  of this argument.
+  this argument.
 `Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
--- a/docs/source/model_doc/layoutlm.rst
+++ b/docs/source/model_doc/layoutlm.rst
@ -1,55 +1,66 @@
 LayoutLM
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__
+The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
-by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It's a simple but effective pre-training method
+Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
-of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.
+Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
 information extraction tasks, such as form understanding and receipt understanding.
 The abstract from the paper is the following:
-*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).*
+*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
 widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
 while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
 the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
 which is beneficial for a great number of real-world document image understanding tasks such as information extraction
 from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
 LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
 framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
 including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
 classification (from 93.07 to 94.42).*
 Tips:
 - LayoutLM has an extra input called :obj:`bbox`, which is the bounding boxes of the input tokens.
- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before passing them into model.
+- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before
  passing them into model.
 The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
 LayoutLMConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMConfig
    :members:
 LayoutLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMTokenizer
    :members:
 LayoutLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMModel
    :members:
 LayoutLMForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMForMaskedLM
    :members:
 LayoutLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMForTokenClassification
    :members:
--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
@ -27,20 +27,20 @@ The Authors' code can be found `here <https://github.com/allenai/longformer>`__.
 Longformer Self Attention
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Longformer self attention employs self attention on both a "local" context and a "global" context.
+Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
-Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous
+attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and
-tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
+:math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
 :obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
 different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
 conventionally done for all tokens in :obj:`BertSelfAttention`.
-Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
+Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
-Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all
+that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally"
-"globally" attending tokens so that global attention is *symmetric*.
+attending tokens so that global attention is *symmetric*.
 The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
 :obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
-:obj:`global_attention_mask`: 
+:obj:`global_attention_mask`:
 - 0: the token attends "locally",
 - 1: the token attends "globally".
--- a/docs/source/model_doc/lxmert.rst
+++ b/docs/source/model_doc/lxmert.rst
@ -8,9 +8,8 @@ The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Repres
 <https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
 (one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
 combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
-visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.
+visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
-The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering,
+consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
 VQA 2.0, and GQA.
 The abstract from the paper is the following:
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@ -3,7 +3,7 @@ MarianMT
 **Bugs:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
-and assign @sshleifer. 
+and assign @sshleifer.
 Translations should be similar, but not identical to, output in the test set linked to in each model card.
@ -12,14 +12,14 @@ Implementation Notes
 - Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by 
+- Models were originally trained by `Jörg Tiedemann
-  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
+  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
-  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
 - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
 - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
-    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
+
    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
@ -29,17 +29,17 @@ Implementation Notes
 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
+- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
 - The language codes used to name models are inconsistent. Two digit codes can usually be found `here
-  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
-  "language code {code}".
+  code {code}".
 - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
+All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
      looking at the model card, or the Group Members `mapping
@ -112,6 +112,7 @@ Code to see available pretrained models:
 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianConfig
    :members:
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@ -7,9 +7,10 @@ MBart
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
-<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
 corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
@ -21,12 +22,13 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
 Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
 As the model is multilingual it expects the sequences in a different format. A special language id token 
 is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
 where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
-The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode 
+MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
 multilingual it expects the sequences in a different format. A special language id token is added in both the source
 and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
 text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
 The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
 the sequences for sequence-to-sequence fine-tuning.
 - Supervised training
@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning.
 - Generation
-    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. 
+    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
-    The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
+    example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
 .. code-block::
--- a/docs/source/model_doc/mobilebert.rst
+++ b/docs/source/model_doc/mobilebert.rst
@ -14,23 +14,23 @@ The abstract from the paper is the following:
 *Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds
 of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot
 be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating
-the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied
+the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to
-to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
+various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
-equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward
+equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
-networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated
+To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE
-BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that
+model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is
-MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known
+4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the
-benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7
+natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms
-(0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task,
+latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
-MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
+90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
 Tips:
- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  the right rather than the left.
+  than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+  with a causal language modeling (CLM) objective are better in that regard.
 The original code can be found `here <https://github.com/google-research/mobilebert>`__.
--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@ -9,9 +9,8 @@ and assign @sshleifer.
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
+The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
-Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
+<https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
 Peter J. Liu on Dec 18, 2019.
 According to the abstract,
@ -26,7 +25,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus
 Checkpoints
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides 
+All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides
 `pegasus-large`, whence the other checkpoints are fine-tuned:
 - Each checkpoint is 2.2 GB on disk and 568M parameters.
@ -44,7 +43,7 @@ Implementation Notes
 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
 - Some key configuration differences:
-    - static, sinusoidal position embeddings
+
    - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
    - more beams are used (:obj:`num_beams=8`)
@ -84,6 +83,7 @@ PegasusConfig
 PegasusTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 warning: ``add_tokens`` does not work at the moment.
 .. autoclass:: transformers.PegasusTokenizer
--- a/docs/source/model_doc/prophetnet.rst
+++ b/docs/source/model_doc/prophetnet.rst
@ -8,13 +8,24 @@ ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
 <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
 Zhang, Ming Zhou on 13 Jan, 2020.
-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
+ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just
 the next token.
 The abstract from the paper is the following:
-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
 self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
 the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
 n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
 step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
 overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
 dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
 abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
 state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
--- a/docs/source/model_doc/rag.rst
+++ b/docs/source/model_doc/rag.rst
@ -1,8 +1,8 @@
 RAG
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
 sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs
 The abstract from the paper is the following:
-*Large pre-trained language models have been shown to store factual knowledge
+*Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve
-in their parameters, and achieve state-of-the-art results when fine-tuned on
+state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely
-downstream NLP tasks. However, their ability to access and precisely manipulate
+manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind
-knowledge is still limited, and hence on knowledge-intensive tasks, their
+task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge
-performance lags behind task-specific architectures. Additionally, providing
+remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric
-provenance for their decisions and updating their world knowledge remain open
+memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a
-research problems. Pre-trained models with a differentiable access mechanism to
+general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained
-explicit nonparametric memory can overcome this issue, but have so far been only
+parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a
-investigated for extractive downstream tasks. We explore a general-purpose
+pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a
-fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine
+pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages
-pre-trained parametric and non-parametric memory for language generation. We
+across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our
-introduce RAG models where the parametric memory is a pre-trained seq2seq model and
+models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks,
-the non-parametric memory is a dense vector index of Wikipedia, accessed with
+outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation
-a pre-trained neural retriever. We compare two RAG formulations, one which
+tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
-conditions on the same retrieved passages across the whole generated sequence, the
+parametric-only seq2seq baseline.*
 other can use different passages per token. We fine-tune and evaluate our models
 on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art
 on three open domain QA tasks, outperforming parametric seq2seq models and
 task-specific retrieve-and-extract architectures. For language generation tasks, we
 find that RAG models generate more specific, diverse and factual language than a
 state-of-the-art parametric-only seq2seq baseline.*
 RagConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagConfig
    :members:
 RagTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagTokenizer
    :members: prepare_seq2seq_batch
 Rag specific outputs
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
    :members:
@ -63,28 +57,28 @@ Rag specific outputs
    :members:
 RagRetriever
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagRetriever
    :members:
 RagModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagModel
    :members: forward
 RagSequenceForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagSequenceForGeneration
    :members: forward, generate
 RagTokenForGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagTokenForGeneration
    :members: forward, generate
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@ -10,7 +10,7 @@ Overview
 The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
 <https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-The abstract from the paper is the following: 
+The abstract from the paper is the following:
 *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
 be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
@ -36,12 +36,12 @@ would result in a position encoding matrix:
 .. math::
    X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] 
-which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices: 
+which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
 .. math::
    X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right] 
-and 
+and
 .. math::
    X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right] 
@ -67,22 +67,23 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj
 Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
 can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
-In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
+In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
-be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
+equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
-product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
+product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
-`sequence length` of the :obj:`input_ids`.
+length` of the :obj:`input_ids`.
 LSH Self Attention
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
 query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
 `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
 query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
 key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
-the same bucket. 
+the same bucket.
-The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument 
+The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
 :obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
 of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
 each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring
 For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
 <https://www.pragmatic.ml/reformer-deep-dive/>`__.
-Note that :obj:`config.num_buckets` can also be factorized into a list
+Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
-:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
+n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
-one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
+n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
-:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
+1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
-This is crucial for very long sequences to save memory.
+save memory.
 When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
 sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
@ -128,7 +129,7 @@ multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length`
 Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
 easily be trained on sequences as long as 64000 tokens.
-For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows: 
+For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:
 .. code-block::
--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
 <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
 Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
-It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
+It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
-objective and training with much larger mini-batches and learning rates.
+much larger mini-batches and learning rates.
 The abstract from the paper is the following:
@ -17,15 +17,15 @@ The abstract from the paper is the following:
 approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
 and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
 study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
-training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
+training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
-every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
+model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
-results highlight the importance of previously overlooked design choices, and raise questions about the source
+highlight the importance of previously overlooked design choices, and raise questions about the source of recently
-of recently reported improvements. We release our models and code.*
+reported improvements. We release our models and code.*
 Tips:
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
+- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
-  setup for Roberta pretrained models.
+  for Roberta pretrained models.
 - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
  different pretraining scheme.
 - RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@ -4,38 +4,34 @@ SqueezeBERT
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The SqueezeBERT model was proposed in
+The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
-`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
-<https://arxiv.org/abs/2006.11316>`__
+bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
-by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
+SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
 It's a bidirectional transformer similar to the BERT model.
 The key difference between the BERT architecture and the SqueezeBERT architecture
 is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
 instead of fully-connected layers for the Q, K, V and FFN layers.
 The abstract from the paper is the following:
-*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets,
-large datasets, large computing systems, and better neural network models, natural language processing (NLP)
+large computing systems, and better neural network models, natural language processing (NLP) technology has made
-technology has made significant strides in understanding, proofreading, and organizing these messages.
+significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant
-Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
+opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we
-social networks, and businesses. In particular, we consider smartphones and other mobile devices as
+consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's
-crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
+highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with
-models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
+BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods
-to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
+such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these
-convolutions have yielded significant speedups for computer vision networks, but many of these techniques
+techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
-have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called
-self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
+SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
-called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
+set. The SqueezeBERT code will be released.*
 accuracy on the GLUE test set. The SqueezeBERT code will be released.*
 Tips:
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
-  the right rather than the left.
+  rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+  with a causal language modeling (CLM) objective are better in that regard.
 - For best results when finetuning on sequence classification tasks, it is recommended to start with the
  `squeezebert/squeezebert-mnli-headless` checkpoint.
--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@ -29,13 +29,12 @@ Tips:
  each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
  different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
  for summarization: *summarize: ...*.
-  
+
  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
-  <https://arxiv.org/pdf/1910.10683.pdf>`__.
+  <https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
+  :obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
-  method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
+  cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
-  generates the decoder output.
+  embeddings. Encoder input padding can be done on the left and on the right.
 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
 The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
@ -51,14 +50,14 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
 - Unsupervised denoising training
-  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
+  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
-  and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
+  the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
-  Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, 
+  sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
  :class:`~transformers.T5Tokenizer`.
-  
+
  For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
-  processed as follows: 
+  processed as follows:
 .. code-block::
@ -69,10 +68,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
 - Supervised training
-  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
+  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In
-  In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
+  translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
  wunderbar.", the sentences should be processed as follows:
-  
+
 .. code-block::
  input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@ -14,19 +14,19 @@ The abstract from the paper is the following:
 *Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
 setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
-beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and
+beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a
-a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves
+novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the
-the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and
+context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450%
-450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up
+longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+
-to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results
+times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of
-of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on
+bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn
-Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
+Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
 coherent, novel text articles with thousands of tokens.*
 Tips:
- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
+- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
-  The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
+  original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.
 The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the
 The abstract from the paper is the following:
 *Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
-In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
+In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
-We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
+propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
 data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
-state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
+state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
-our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
+approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
-we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
+obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
-supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
+machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
-the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
+previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
 Tips:
 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
  select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
+- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
-  :doc:`multi-lingual <../multilingual>` page for more information.
+  <../multilingual>` page for more information.
 The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.
--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
@ -9,13 +9,25 @@ XLM-ProphetNet
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
 <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
 Zhang, Ming Zhou on 13 Jan, 2020.
-XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
+XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
 just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual
 "wiki100" Wikipedia dump.
 The abstract from the paper is the following:
-*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
 self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
 the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
 n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
 step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
 overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
 dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
 abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
 state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
 The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
@ -12,25 +12,25 @@ data.
 The abstract from the paper is the following:
-*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
+*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
-a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
+wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
 languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
-outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
+outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
-on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
+XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
-low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
+low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
-We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
+also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
-including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
+trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
-low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
+languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
-without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
+per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
-and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
+will make XLM-R code, data, and models publicly available.*
 Tips:
 - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
  language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
-  examples as well as the information relative to the inputs and outputs.
+  as well as the information relative to the inputs and outputs.
 The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@ -16,11 +16,11 @@ The abstract from the paper is the following:
 better performance than pretraining approaches based on autoregressive language modeling. However, relying on
 corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
 pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
-pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
+pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
-all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
+permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
-formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
+formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
-into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
+pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
-a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
+margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
 Tips:
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@ -15,8 +15,8 @@ Prepare your model for uploading
 We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
 done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
-the `model hub <https://huggingface.co/models>`__.
+`model hub <https://huggingface.co/models>`__.
 Basic steps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -60,22 +60,20 @@ Make your model work on all frameworks
 You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
 PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
-your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
+your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
-it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
+super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
-worry about the GPU, so it should be very easy. Check the
+TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
-`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
+installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
-and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
 First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type
 type
 .. code-block::
    from transformers import TFDistilBertForSequenceClassification
-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type
 type
 .. code-block::
@ -112,7 +110,8 @@ Make sure there are no garbage files in the directory you'll upload. It should o
 - a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
 - a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
  of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
 Other files can safely be deleted.
@ -135,7 +134,8 @@ Then log in using the same credentials as on huggingface.co. To upload your mode
 This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
-By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
+By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple
 models and need to script that process, you can add `-y` to bypass the prompt. For example:
 .. code-block::
@ -179,15 +179,15 @@ Add a model card
 To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
 considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
 placed in a subfolder with your username or organization, then another subfolder named like your model
-(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
-get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
+you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
-model card template (meta-suggestions are welcome).
+card template (meta-suggestions are welcome).
 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.
-If you have never made a pull request to the 🤗 Transformers repo, look at the
+If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
-:doc:`contributing guide <contributing>` to see the steps to follow.
+to see the steps to follow.
 .. Note::
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@ -1,12 +1,12 @@
 Summary of the models
 =======================================================================================================================
-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
-`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
+model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
 <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the
+models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
-:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
+page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
-community models <https://huggingface.co/models>`_.
+<https://huggingface.co/models>`_.
 Each one of the models in the library falls into one of the following categories:
@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories
 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
 sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
+models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
-A typical example of such models is GPT.
+typical example of such models is GPT.
 Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
 sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model
 Note that the only difference between autoregressive models and autoencoding models is in the way the model is
 pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
+model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
-introduced.
+was first introduced.
 Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
 tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
@ -60,8 +60,8 @@ Original GPT
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
   </a>
-`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
+`Improving Language Understanding by Generative Pre-Training
-Alec Radford et al.
+<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.
 The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
@ -80,7 +80,8 @@ GPT-2
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
   </a>
-`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+`Language Models are Unsupervised Multitask Learners
 <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
 Alec Radford et al.
 A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
@ -122,8 +123,8 @@ Transformer-XL
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
   </a>
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
-Zihang Dai et al.
+Dai et al.
 Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
 RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
@ -153,8 +154,7 @@ Reformer
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
   </a>
-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
+`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .
 Nikita Kitaev et al .
 An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
 include:
@ -188,8 +188,8 @@ XLNet
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
   </a>
-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
+`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
-Zhilin Yang et al.
+Yang et al.
 XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
 tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
@ -207,7 +207,8 @@ Autoencoding models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
+look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
 corrupted versions.
 BERT
 -----------------------------------------------------------------------------------------------------------------------
@ -260,8 +261,8 @@ Same as BERT but with a few tweaks:
    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
    being the vocab size). If E < H, it has less parameters.
  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
+  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
-    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+    B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
    been swapped or not.
 The library provides a version of the model for masked language modeling, token classification, sentence
@ -279,8 +280,7 @@ RoBERTa
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
   </a>
-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
+`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.
 Yinhan Liu et al.
 Same as BERT with better pretraining tricks:
@ -339,8 +339,8 @@ library provides checkpoints for all of them:
    previous section as well). One of the languages is selected for each training sample, and the model input is a
    sentence of 256 tokens, that may span over several documents in one of those languages.
  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
+    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
-    dynamic masking of the tokens.
+    with dynamic masking of the tokens.
  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
    surrounding context in language 1 and the context given by language 2.
@ -523,20 +523,21 @@ Pegasus
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
   </a>
-`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
+`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
 <https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
 Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like
+  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
-    in BERT)
+    BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
    causal mask to hide the future words like a regular auto-regressive transformer decoder.
 In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
-masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
 summary.
 The library provides a version of this model for conditional generation, which should be used for summarization.
@ -571,20 +572,20 @@ T5
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
   </a>
-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
+`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
-Colin Raffel et al.
+<https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.
-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
+Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
-each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
+layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
 prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
 The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
 tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
+Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
-replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group
+individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
-is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder
+single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
-is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
+sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
 For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and
 "cute", the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
@ -603,13 +604,12 @@ MBart
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
   </a>
-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan
+`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
-Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
+The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
-and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
+for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
-for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
+sequence-to-sequence model by denoising full texts in multiple languages,
 The library provides a version of this model for conditional generation.
@ -636,11 +636,11 @@ ProphetNet
 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
 ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
-future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at
+future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
-each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages
+time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
-the model to plan for the future tokens and prevent overfitting on strong local correlations.
+to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
-The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism
+the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
-in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
+self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
 The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
 summarization.
@ -682,8 +682,8 @@ et al.
 A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
 model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
+(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
-resnet to the hidden state dimension of the transformer).
+the hidden state dimension of the transformer).
 The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
 model know which part of the input vector corresponds to the text and which to the image.
@ -691,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t
 The pretrained model only works for classification.
 ..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
+    More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page
    TODO: write this page
 .. _retrieval-based-models:
@ -714,19 +713,22 @@ DPR
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
   </a>
-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
+`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
-Vladimir Karpukhin et al.
+al.
-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
+Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
 research.
 DPR consists in three models:
  * Question encoder: encode questions as vectors
  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
+  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
    inferred span actually answers the question).
-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
+DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
 then it calls the reader with the question and the retrieved documents to get the answer.
 RAG
 -----------------------------------------------------------------------------------------------------------------------
@ -740,12 +742,14 @@ RAG
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
   </a>
-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
+`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
-Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
+Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
 Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
+models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
 to adapt to downstream tasks.
 The two models RAG-Token and RAG-Sequence are available for generation.
@ -764,19 +768,19 @@ use a sparse version of the attention matrix to speed up training.
 **LSH attention**
 :ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
-dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
+dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so
-similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
+very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
-a n_rounds parameter) and then are averaged together.
+(determined by a n_rounds parameter) and then are averaged together.
 .. _local-attention:
 **Local attention**
-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
+:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
-right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
+left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
-last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.
 Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
@ -799,8 +803,9 @@ Other tricks
 :ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
 E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
-E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
+that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
-and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
+dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
-time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
+:math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
-:math:`j // l1` in E2.
+step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
 in E2.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@ -1,9 +1,9 @@
 Multi-lingual models
 =======================================================================================================================
-Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
-multi-lingual models are available and have a different mechanisms than mono-lingual models.
+models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
-This page details the usage of these models.
+models.
 The two models that currently support multiple languages are BERT and XLM.
@ -28,8 +28,8 @@ This section concerns the following checkpoints:
 These checkpoints require language embeddings that will specify the language used at inference time. These language
 embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
-these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
+these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
-from the tokenizer.
+the tokenizer.
 Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
@ -78,8 +78,9 @@ You can then feed it all as input to your model:
    >>> outputs = model(input_ids, langs=langs)
-The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
+The example `run_generation.py
-can generate text using the CLM checkpoints from XLM, using the language embeddings.
+<https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
 text using the CLM checkpoints from XLM, using the language embeddings.
 XLM without Language Embeddings
 -----------------------------------------------------------------------------------------------------------------------
@ -89,8 +90,8 @@ This section concerns the following checkpoints:
 - ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
 - ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
-These checkpoints do not require language embeddings at inference time. These models are used to have generic
+These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
-sentence representations, differently from previously-mentioned XLM checkpoints.
+representations, differently from previously-mentioned XLM checkpoints.
 BERT
@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
 - ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
 - ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
-These checkpoints do not require language embeddings at inference time. They should identify the language
+These checkpoints do not require language embeddings at inference time. They should identify the language used in the
-used in the context and infer accordingly.
+context and infer accordingly.
 XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
+XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
-gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
+over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
-sequence labeling and question answering.
+labeling and question answering.
 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@ -1,86 +1,69 @@
 Perplexity of fixed-length models
 =======================================================================================================================
-Perplexity (PPL) is one of the most common metrics for evaluating language
+Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
-models. Before diving in, we should note that the metric applies specifically
+that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
-to classical language models (sometimes called autoregressive or causal
+models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
-language models) and is not well defined for masked language models like BERT
+<model_summary>`).
 (see :doc:`summary of the models <model_summary>`).
-Perplexity is defined as the exponentiated average log-likelihood of a
+Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
-sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
+:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
 then the perplexity of :math:`X` is,
 .. math::
    \text{PPL}(X)
    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
+where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
-token conditioned on the preceding tokens :math:`x_{<i}` according to our
+:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
-model. Intuitively, it can be thought of as an evaluation of the model's
+predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
-ability to predict uniformly among the set of specified tokens in a corpus.
+procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
-Importantly, this means that the tokenization procedure has a direct impact
+different models.
 on a model's perplexity which should always be taken into consideration when
 comparing different models.
-This is also equivalent to the exponentiation of the cross-entropy between
+This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
-the data and model predictions. For more intuition about perplexity and its
+intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
-relationship to Bits Per Character (BPC) and data compression, check out this
+`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
 `fantastic blog post on The Gradient
 <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
 Calculating PPL with fixed-length models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If we weren't limited by a model's context size, we would evaluate the
+If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
-model's perplexity by autoregressively factorizing a sequence and
+factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
 conditioning on the entire preceding subsequence at each step, as shown
 below.
 .. image:: imgs/ppl_full.gif
    :width: 600
    :alt: Full decomposition of a sequence with unlimited context length
-When working with approximate models, however, we typically have a constraint
+When working with approximate models, however, we typically have a constraint on the number of tokens the model can
-on the number of tokens the model can process. The largest version
+process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
-of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
+cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
 tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
 :math:`t` is greater than 1024.
-Instead, the sequence is typically broken into subsequences equal to the
+Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
-model's maximum input size. If a model's max input size is :math:`k`, we
+input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
-then approximate the likelihood of a token :math:`x_t` by conditioning only
+:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
-on the :math:`k-1` tokens that precede it rather than the entire context.
+sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
-When evaluating the model's perplexity of a sequence, a tempting but
+log-likelihoods of each segment independently.
 suboptimal approach is to break the sequence into disjoint chunks and
 add up the decomposed log-likelihoods of each segment independently.
 .. image:: imgs/ppl_chunked.gif
    :width: 600
    :alt: Suboptimal PPL not taking advantage of full available context
-This is quick to compute since the perplexity of each segment can be computed
+This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
-in one forward pass, but serves as a poor approximation of the
+approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
-fully-factorized perplexity and will typically yield a higher (worse) PPL
+have less context at most of the prediction steps.
 because the model will have less context at most of the prediction steps.
-Instead, the PPL of fixed-length models should be evaluated with a
+Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
-sliding-window strategy. This involves repeatedly sliding the
+sliding the context window so that the model has more context when making each prediction.
 context window so that the model has more context when making each
 prediction.
 .. image:: imgs/ppl_sliding.gif
    :width: 600
    :alt: Sliding window PPL taking advantage of all available context
-This is a closer approximation to the true decomposition of the
+This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
-sequence probability and will typically yield a more favorable score.
+favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
-The downside is that it requires a separate forward pass for each token in
+practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
-the corpus. A good practical compromise is to employ a strided sliding
+1 token a time. This allows computation to procede much faster while still giving the model a large context to make
-window, moving the context by larger strides rather than sliding by 1 token a
+predictions at each step.
 time. This allows computation to procede much faster while still giving the
 model a large context to make predictions at each step.
 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
-We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
+We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
-different sliding-window strategies. Since this dataset is small and we're
+this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
-just doing one forward pass over the set, we can just load and encode the
+dataset in memory.
 entire dataset in memory.
 .. code-block:: python
@ -106,16 +88,13 @@ entire dataset in memory.
    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
+With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
-to our model, and the average log-likelihood for each token is returned as
+log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
-the loss. With our sliding window approach, however, there is overlap in the
+the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
-tokens we pass to the model at each iteration. We don't want the
+as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
-log-likelihood for the tokens we're just treating as context to be included
+is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
-in our loss, so we can set these targets to ``-100`` so that they are
+for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
-ignored. The following is an example of how we could do this with a stride of
+available to condition on).
 ``512``. This means that the model will have at least 512 tokens for context
 when calculating the conditional likelihood of any one token (provided there
 are 512 preceding tokens available to condition on).
 .. code-block:: python
@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).
    ppl = torch.exp(torch.stack(lls).sum() / end_loc)
-Running this with the stride length equal to the max input length is
+Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
-equivalent to the suboptimal, non-sliding-window strategy we discussed above.
+strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
-The smaller the stride, the more context the model will have in making each
+and the better the reported perplexity will typically be.
 prediction, and the better the reported perplexity will typically be.
-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
+When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
-PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
+as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
-GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
+strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
-window strategy, this jumps down to ``16.53``. This is not only a more
+closer to the true autoregressive decomposition of a sequence likelihood.
 favorable score, but is calculated in a way that is closer to the true
 autoregressive decomposition of a sequence likelihood.
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@ -12,15 +12,15 @@ The library was designed with two strong goals in mind:
 - Be as easy and fast to use as possible:
    - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
-      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`, 
+      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
      :doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
    - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
      :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
-      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary, 
+      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
-      and models' weights) from a pretrained checkpoint provided on 
+      and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
-      `Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
+      <https://huggingface.co/models>`__ or your own saved checkpoint.
    - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
-      using a model (plus its associated tokenizer and configuration) on a given task and 
+      using a model (plus its associated tokenizer and configuration) on a given task and
      :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
    - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
      extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
@ -52,10 +52,10 @@ Main concepts
 The library is built around three types of classes for each model:
- **Model classes**  such as :class:`~transformers.BertModel`, which are 30+ PyTorch models 
+- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
-  (`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models 
+  <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
-  (`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
+  <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
-  weights provided in the library.
+  library.
 - **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
  without any modification, creating the model will automatically take care of instantiating the configuration (which
@ -66,8 +66,8 @@ The library is built around three types of classes for each model:
 All these classes can be instantiated from pretrained instances and saved locally using two methods:
 - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`
+  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
-  or stored locally (or on a server) by the user,
+  stored locally (or on a server) by the user,
 - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
  :obj:`from_pretrained()`.
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@ -17,7 +17,7 @@ work properly.
    the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
    token to index (that we usually call a `vocab`) as during pretraining.
-To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the 
+To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
 :func:`~transformers.AutoTokenizer.from_pretrained` method:
 .. code-block::
@ -39,10 +39,10 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec
     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-This returns a dictionary string to list of ints.
+This returns a dictionary string to list of ints. The `input_ids <glossary.html#input-ids>`__ are the indices
-The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
+corresponding to each token in our sentence. We will see below what the `attention_mask
-below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
+<glossary.html#attention-mask>`__ is used for and in :ref:`the next section <sentence-pairs>` the goal of
-:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
+`token_type_ids <glossary.html#token-type-ids>`__.
 The tokenizer can decode a list of token ids in a proper sentence:
@ -51,10 +51,10 @@ The tokenizer can decode a list of token ids in a proper sentence:
    >>> tokenizer.decode(encoded_input["input_ids"])
    "[CLS] Hello, I'm a single sentence! [SEP]"
-As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need special
+As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
-tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
+special tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we
-seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
+would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
-those special tokens yourself) by passing ``add_special_tokens=False``.
+have added those special tokens yourself) by passing ``add_special_tokens=False``.
 If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
 tokenizer:
@ -114,9 +114,9 @@ You can do all of this by using the following options when feeding your list of
                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
+It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask
-all about: it points out which tokens the model should pay attention to and which ones it should not (because they
+<glossary.html#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which
-represent padding in this case).
+ones it should not (because they represent padding in this case).
 Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
@ -127,9 +127,9 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
 Preprocessing pairs of sentences
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in a
+Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
-pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
+a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
-then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
+is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
 You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
 (not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
@ -146,8 +146,8 @@ This will once again return a dict string to list of ints:
 This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
 of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
 `token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
-its associated model expects. You can force the return (or the non-return) of any of those special arguments by
+its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
-using ``return_input_ids`` or ``return_token_type_ids``.
+``return_input_ids`` or ``return_token_type_ids``.
 If we decode the token ids we obtained, we will see that the special tokens have been properly added.
@ -215,7 +215,7 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
      a single sequence).
    - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
      by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
-      padding will still be applied to it. 
+      padding will still be applied to it.
    - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
      behavior.
@ -238,9 +238,9 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
  truncation/padding to :obj:`max_length` is deactivated.
 Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
-any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in 
+any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in
-:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or
+:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`truncation=
-:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+'longest_first'` to control how both sequence in the pair are truncated as detailed before.
 +--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
 | Truncation                           | Padding                           | Instruction                                                                                 |
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@ -3,7 +3,8 @@ Pretrained models
 Here is the full list of the currently provided pretrained models together with a short presentation of each model.
-For a list that includes community-uploaded models, refer to `https://huggingface.co/models <https://huggingface.co/models>`__.
+For a list that includes community-uploaded models, refer to `https://huggingface.co/models
 <https://huggingface.co/models>`__.
 +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | Architecture       | Shortcut name                                              | Details of the model                                                                                                                  |
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@ -1,8 +1,8 @@
 Quick tour
 =======================================================================================================================
-Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for
+Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
-Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
+Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
 such as completing a prompt with new text or translating in another language.
 First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
@ -29,8 +29,8 @@ provides the following tasks out of the box:
 - Translation: translate a text in another language.
 - Feature extraction: return a tensor representation of the text.
-Let's see how this work for sentiment analysis (the other tasks are all covered in the
+Let's see how this work for sentiment analysis (the other tasks are all covered in the :doc:`task summary
-:doc:`task summary </task_summary>`):
+</task_summary>`):
 .. code-block::
@ -160,9 +160,10 @@ To apply these steps on a given text, we can just feed it to our tokenizer:
    >>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
-This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__,
+This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, as
-as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
+mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
-`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the sequence:
+`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the
 sequence:
 .. code-block::
@ -191,8 +192,8 @@ and get tensors back. You can specify all of that to the tokenizer:
    ...     return_tensors="tf"
    ... )
-The padding is automatically applied on the side expected by the model (in this case, on the right), with the
+The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
-padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:
+token the model was pretrained with. The attention mask is also adapted to take the padding into account:
 .. code-block::
@ -213,8 +214,8 @@ Using the model
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
-contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the
+contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
-dictionary keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
+keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
 .. code-block::
@ -223,8 +224,8 @@ dictionary keys directly to tensors, for a PyTorch model, you need to unpack the
    >>> ## TENSORFLOW CODE
    >>> tf_outputs = tf_model(tf_batch)
-In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the
+In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final
-final activations of the model.
+activations of the model.
 .. code-block::
@ -239,11 +240,10 @@ final activations of the model.
           [ 0.08181786, -0.04179301]], dtype=float32)>,)
 The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
-the final activations, so we get a tuple with one element.
+the final activations, so we get a tuple with one element. .. note::
 .. note::
-    All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final
+    All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
-    activation function (like SoftMax) since this final activation function is often fused with the loss.
+    function (like SoftMax) since this final activation function is often fused with the loss.
 Let's apply the SoftMax activation to get predictions.
@ -281,11 +281,11 @@ If you have labels, you can provide them to the model, it will return a tuple wi
    >>> import tensorflow as tf
    >>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
-Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or
+Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
-`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual
+<https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗
-training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if
+Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if you are using
-you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed
+TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision,
-precision, etc.). See the :doc:`training tutorial <training>` for more details.
+etc.). See the :doc:`training tutorial <training>` for more details.
 .. note::
@ -336,13 +336,13 @@ The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that wi
 pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
 code is easy to access and tweak if you need to.
-In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's
+In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
-using the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
+the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
-:class:`~transformers.AutoModelForSequenceClassification` (or :class:`~transformers.TFAutoModelForSequenceClassification`
+:class:`~transformers.AutoModelForSequenceClassification` (or
-if you are using TensorFlow) was used, the model automatically created is then a
+:class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow) was used, the model
-:class:`~transformers.DistilBertForSequenceClassification`. You can look at its documentation for all details relevant
+automatically created is then a :class:`~transformers.DistilBertForSequenceClassification`. You can look at its
-to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer
+documentation for all details relevant to that specific model, or browse the source code. This is how you would
-without the auto magic:
+directly instantiate model and tokenizer without the auto magic:
 .. code-block::
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@ -5,16 +5,18 @@ Exporting transformers models
 ONNX / ONNXRuntime
 =======================================================================================================================
-Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
+Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT)
-to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
+<https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field to provide a
 unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
 of hardware and dedicated optimizations.
 Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
-the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines using
+the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines
-Hugging Face Transformers and ONNX Runtime <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
+using Hugging Face Transformers and ONNX Runtime
 <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
-Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources.
+Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. The
-The following command shows how easy it is to export a BERT model from the library, simply run:
+following command shows how easy it is to export a BERT model from the library, simply run:
 .. code-block:: bash
@ -27,62 +29,66 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:
 * The generated model can be correctly loaded through onnxruntime.
 .. note::
-    Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations
+    Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations on the
-    on the ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please
+    ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please open up an issue on
-    open up an issue on transformers.
+    transformers.
 Also, the conversion tool supports different options which let you tune the behavior of the generated model:
-* **Change the target opset version of the generated model.**  (More recent opset generally supports more operators and enables faster inference)
+* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and
  enables faster inference)
-* **Export pipeline-specific prediction heads.**  (Allow to export model along with its task-specific prediction head(s))
+* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction
  head(s))
-* **Use the external data format (PyTorch only).**  (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_))
+* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info
  <https://github.com/pytorch/pytorch/pull/33062>`_))
 Optimizations
 -----------------------------------------------------------------------------------------------------------------------
-ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
+ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. Below
-Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
+are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
 * Constant folding
 * Attention Layer fusing
 * Skip connection LayerNormalization fusing
 * FastGeLU approximation
-Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances
+Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances if
-if used on another machine with a different hardware configuration than the one used for exporting the model.
+used on another machine with a different hardware configuration than the one used for exporting the model. For this
-For this reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled,
+reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, ensuring the model can be easily
-ensuring the model can be easily exported to various hardware.
+exported to various hardware. Optimizations can then be enabled when loading the model through ONNX runtime for
-Optimizations can then be enabled when loading the model through ONNX runtime for inference.
+inference.
 .. note::
-    When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the model
+    When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the
-    because quantization would modify the underlying graph making it impossible for ONNX runtime to do the optimizations
+    model because quantization would modify the underlying graph making it impossible for ONNX runtime to do the
-    afterwards.
+    optimizations afterwards.
 .. note::
-    For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
+    For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github
    <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
 Quantization
 -----------------------------------------------------------------------------------------------------------------------
 ONNX exporter supports generating a quantized version of the model to allow efficient inference.
-Quantization works by converting the memory representation of the parameters in the neural network
+Quantization works by converting the memory representation of the parameters in the neural network to a compact integer
-to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`)
+format. By default, weights of a neural network are stored as single-precision float (`float32`) which can express a
-which can express a wide-range of floating-point numbers with decent precision.
+wide-range of floating-point numbers with decent precision. These properties are especially interesting at training
-These properties are especially interesting at training where you want fine-grained representation.
+where you want fine-grained representation.
-On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers
+On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of
-without changing the performances of the neural network.
+`float32` numbers without changing the performances of the neural network.
-More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing
+More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus
-the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation)
+reducing the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating,
-according to the following formula:
+single byte, number representation) according to the following formula:
 .. math::
    y_{float32} = scale * x_{int8} - zero\_point
@ -96,9 +102,9 @@ Leveraging tiny-integers has numerous advantages when it comes to inference:
 * Integer operations execute a magnitude faster on modern hardware
 * Integer operations require less power to do the computations
-In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize``
+In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` when
-when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this
+using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this same script
-same script file.
+file.
 Example of quantized BERT model export:
@ -111,26 +117,27 @@ Example of quantized BERT model export:
 .. note::
    When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
-    above command will contain the original ONNX model storing `float32` weights.
+    above command will contain the original ONNX model storing `float32` weights. The second one, with ``-quantized``
-    The second one, with ``-quantized`` suffix, will hold the quantized parameters.
+    suffix, will hold the quantized parameters.
 TorchScript
 =======================================================================================================================
 .. note::
-    This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities
+    This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
-    with variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming
+    variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
-    releases, with more code examples, a more flexible implementation, and benchmarks comparing python-based codes
+    with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
-    with compiled TorchScript.
+    TorchScript.
-According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code".
+According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
-Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
+code". Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
 their model to be re-used in other programs, such as efficiency-oriented C++ programs.
-We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can
+We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused
-be reused in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using TorchScript.
+in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
 TorchScript.
 Exporting a model requires two things:
@ -145,13 +152,14 @@ Implications
 TorchScript flag and tied weights
 -----------------------------------------------------------------------------------------------------------------------
 This flag is necessary because most of the language models in this repository have tied weights between their
 ``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore
 it is necessary to untie and clone the weights beforehand.
-This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer
+This flag is necessary because most of the language models in this repository have tied weights between their
-separate, which means that they should not be trained down the line. Training would de-synchronize the two layers,
+``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied
-leading to unexpected results.
+weights, therefore it is necessary to untie and clone the weights beforehand.
 This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding``
 layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
 layers, leading to unexpected results.
 This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
 can be safely exported without the ``torchscript`` flag.
@ -160,8 +168,8 @@ Dummy inputs and standard lengths
 -----------------------------------------------------------------------------------------------------------------------
 The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
-Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
+Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
-to create the "trace" of the model.
+create the "trace" of the model.
 The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
 input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
@ -185,8 +193,8 @@ Below is an example, showing how to save, load models as well as how to use the
 Saving a model
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
+This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated according
-according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
+to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
 .. code-block:: python
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@ -2,30 +2,30 @@ Summary of the tasks
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 This page shows the most frequent use-cases when using the library. The models available allow for many different
-configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage
+configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage for
-for tasks such as question answering, sequence classification, named entity recognition and others.
+tasks such as question answering, sequence classification, named entity recognition and others.
 These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint,
 automatically selecting the correct model architecture. Please check the :class:`~transformers.AutoModel` documentation
-for more information.
+for more information. Feel free to modify the code to be more specific and adapt it to your specific use-case.
 Feel free to modify the code to be more specific and adapt it to your specific use-case.
 In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These
 checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the
 following:
 - Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage
-  one of the `run_$TASK.py` scripts in the
+  one of the `run_$TASK.py` scripts in the `examples
-  `examples <https://github.com/huggingface/transformers/tree/master/examples>`__ directory.
+  <https://github.com/huggingface/transformers/tree/master/examples>`__ directory.
- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case
+- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case and
-  and domain. As mentioned previously, you may leverage the
+  domain. As mentioned previously, you may leverage the `examples
-  `examples <https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you
+  <https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you may
-  may create your own training script.
+  create your own training script.
 In order to do an inference on a task, several mechanisms are made available by the library:
 - Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer (PyTorch/TensorFlow) and full inference capacity.
+- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer
  (PyTorch/TensorFlow) and full inference capacity.
 Both approaches are showcased here.
@ -40,15 +40,17 @@ Both approaches are showcased here.
 Sequence Classification
 -----------------------------------------------------------------------------------------------------------------------
-Sequence classification is the task of classifying sequences according to a given number of classes. An example
+Sequence classification is the task of classifying sequences according to a given number of classes. An example of
-of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
+sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
-a model on a GLUE sequence classification task, you may leverage the
+model on a GLUE sequence classification task, you may leverage the `run_glue.py
-`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`__ and
+<https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`__ and
-`run_pl_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_pl_glue.py>`__ or
+`run_pl_glue.py
-`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`__ scripts.
+<https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_pl_glue.py>`__ or
 `run_tf_glue.py
 <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`__ scripts.
-Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative.
+Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
-It leverages a fine-tuned model on sst2, which is a GLUE task.
+leverages a fine-tuned model on sst2, which is a GLUE task.
 This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
@ -67,18 +69,16 @@ This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
    label: POSITIVE, with score: 0.9999
-Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases
+Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases of
-of each other. The process is the following:
+each other. The process is the following:
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
-   identified as a BERT model and loads it with the weights stored in the
+   with the weights stored in the checkpoint.
-   checkpoint.
+2. Build a sequence from the two sentences, with the correct model-specific separators token type ids and attention
-2. Build a sequence from the two sentences, with the correct model-specific
+   masks (:func:`~transformers.PreTrainedTokenizer.encode` and :func:`~transformers.PreTrainedTokenizer.__call__` take
-   separators token type ids and attention masks
+   care of this).
-   (:func:`~transformers.PreTrainedTokenizer.encode` and
+3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
-   :func:`~transformers.PreTrainedTokenizer.__call__` take care of this).
+   paraphrase) and 1 (is a paraphrase).
 3. Pass this sequence through the model so that it is classified in one of the
   two available classes: 0 (not a paraphrase) and 1 (is a paraphrase).
 4. Compute the softmax of the result to get probabilities over the classes.
 5. Print the results.
@ -155,14 +155,15 @@ Extractive Question Answering
 -----------------------------------------------------------------------------------------------------------------------
 Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
-question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
+question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
-a model on a SQuAD task, you may leverage the
+model on a SQuAD task, you may leverage the `run_squad.py
-`run_squad.py <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_squad.py>`__ and
+<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_squad.py>`__ and
-`run_tf_squad.py <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
+`run_tf_squad.py
 <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
-Here is an example of using pipelines to do question answering: extracting an answer from a text given a question.
+Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
-It leverages a fine-tuned model on SQuAD.
+leverages a fine-tuned model on SQuAD.
 .. code-block::
@ -176,8 +177,8 @@ It leverages a fine-tuned model on SQuAD.
    ... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
    ... """
-This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which
+This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
-are the positions of the extracted answer in the text.
+positions of the extracted answer in the text.
 .. code-block::
@ -192,16 +193,13 @@ are the positions of the extracted answer in the text.
 Here is an example of question answering using a model and a tokenizer. The process is the following:
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
-   identified as a BERT model and loads it with the weights stored in the
+   with the weights stored in the checkpoint.
   checkpoint.
 2. Define a text and a few questions.
-3. Iterate over the questions and build a sequence from the text and the current
+3. Iterate over the questions and build a sequence from the text and the current question, with the correct
-   question, with the correct model-specific separators token type ids and
+   model-specific separators token type ids and attention masks.
-   attention masks.
+4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and
-4. Pass this sequence through the model. This outputs a range of scores across
+   text), for both the start and end positions.
   the entire sequence tokens (question and text), for both the start and end
   positions.
 5. Compute the softmax of the result to get probabilities over the tokens.
 6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
 7. Print the results.
@ -299,22 +297,22 @@ Here is an example of question answering using a model and a tokenizer. The proc
 Language Modeling
 -----------------------------------------------------------------------------------------------------------------------
-Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based
+Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular
-models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with
+transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
-causal language modeling.
+GPT-2 with causal language modeling.
 Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
-domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset
+domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
-or on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
+on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
 Masked Language Modeling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
 fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
-right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis
+right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
-for downstream tasks, requiring bi-directional context such as SQuAD (question answering,
+downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
-see `Lewis, Lui, Goyal et al. <https://arxiv.org/abs/1910.13461>`__, part 4.2).
+<https://arxiv.org/abs/1910.13461>`__, part 4.2).
 Here is an example of using pipelines to replace a mask from a sequence:
@ -324,8 +322,7 @@ Here is an example of using pipelines to replace a mask from a sequence:
    >>> nlp = pipeline("fill-mask")
-This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer
+This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer vocabulary:
 vocabulary:
 .. code-block::
@ -359,14 +356,12 @@ vocabulary:
 Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and
-   identified as a DistilBERT model and loads it with the weights stored in the
+   loads it with the weights stored in the checkpoint.
   checkpoint.
 2. Define a sequence with a masked token, placing the :obj:`tokenizer.mask_token` instead of a word.
 3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
-4. Retrieve the predictions at the index of the mask token: this tensor has the
+4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the
-   same size as the vocabulary, and the values are the scores attributed to each
+   values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that
   token. The model gives higher score to tokens it deems probable in that
   context.
 5. Retrieve the top 5 tokens using the PyTorch :obj:`topk` or TensorFlow :obj:`top_k` methods.
 6. Replace the mask token by the tokens and print the results
@ -427,9 +422,12 @@ Causal language modeling is the task of predicting the token following a sequenc
 model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
 for generation tasks.
-Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.
+Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
 input sequence.
-Here is an example of using the tokenizer and model and leveraging the :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence of tokens.
+Here is an example of using the tokenizer and model and leveraging the
 :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence
 of tokens.
 .. code-block::
@ -490,12 +488,16 @@ This outputs a (hopefully) coherent next token following the original sequence,
    >>> print(resulting_string)
    Hugging Face is based in DUMBO, New York City, and has
-In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
+In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to
 generate multiple tokens up to a user-defined length.
 Text Generation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
+In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a
 continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text.
 As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations
 (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
 .. code-block::
@ -507,8 +509,9 @@ In text generation (*a.k.a* *open-ended text generation*) the goal is to create
-Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am concerned, I will"*.
+Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
-The default arguments of ``PreTrainedModel.generate()`` can be directly overriden in the pipeline, as is shown above for the argument ``max_length``.
+concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overriden in the pipeline,
 as is shown above for the argument ``max_length``.
 Here is an example of text generation using ``XLNet`` and its tokenzier.
@ -569,25 +572,30 @@ Here is an example of text generation using ``XLNet`` and its tokenzier.
    >>> print(generated)
    Today the weather is really nice and I am planning on anning on taking a nice...... of a great time!<eop>...............
-Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often need to be padded to work well.
+Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
-GPT-2 is usually a good choice for *open-ended text generation* because it was trained on millions of webpages with a causal language modeling objective.
+PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
 need to be padded to work well. GPT-2 is usually a good choice for *open-ended text generation* because it was trained
 on millions of webpages with a causal language modeling objective.
-For more information on how to apply different decoding strategies for text generation, please also refer to our text generation blog post `here <https://huggingface.co/blog/how-to-generate>`__.
+For more information on how to apply different decoding strategies for text generation, please also refer to our text
 generation blog post `here <https://huggingface.co/blog/how-to-generate>`__.
 Named Entity Recognition
 -----------------------------------------------------------------------------------------------------------------------
-Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a
+Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
-token as a person, an organisation or a location.
+as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
-An example of a named entity recognition dataset is the CoNLL-2003 dataset, which is entirely based on that task.
+which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
-If you would like to fine-tune a model on an NER task, you may leverage the
+`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__
-`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__ (PyTorch),
+(PyTorch), `run_pl_ner.py
-`run_pl_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_pl_ner.py>`__ (leveraging pytorch-lightning) or the
+<https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_pl_ner.py>`__ (leveraging
-`run_tf_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_tf_ner.py>`__ (TensorFlow) scripts.
+pytorch-lightning) or the `run_tf_ner.py
 <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_tf_ner.py>`__ (TensorFlow)
 scripts.
-Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as belonging to one
+Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
-of 9 classes:
+belonging to one of 9 classes:
 - O, Outside of a named entity
 - B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
@ -599,8 +607,8 @@ of 9 classes:
 - B-LOC, Beginning of a location right after another location
 - I-LOC, Location
-It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from
+It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from `dbmdz
-`dbmdz <https://github.com/dbmdz>`__.
+<https://github.com/dbmdz>`__.
 .. code-block::
@ -612,8 +620,8 @@ It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https:
    ...            "close to the Manhattan Bridge which is visible from the window."
-This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above. Here are the
+This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
-expected results:
+Here are the expected results:
 .. code-block::
@ -633,24 +641,21 @@ expected results:
        {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
    ]
-Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City", "DUMBO" and
+Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
-"Manhattan Bridge" have been identified as locations.
+"DUMBO" and "Manhattan Bridge" have been identified as locations.
 Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
-   identified as a BERT model and loads it with the weights stored in the
+   with the weights stored in the checkpoint.
   checkpoint.
 2. Define the label list with which the model was trained on.
 3. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
-4. Split words into tokens so that they can be mapped to predictions. We use a
+4. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
-   small hack by, first, completely encoding and decoding the sequence, so that
+   encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
   we're left with a string that contains the special tokens.
 5. Encode that sequence into IDs (special tokens are added automatically).
-6. Retrieve the predictions by passing the input to the model and getting the
+6. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
-   first output. This results in a distribution over the 9 possible classes for
+   distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
-   each token. We take the argmax to retrieve the most likely class for each
+   each token.
   token.
 7. Zip together each token with its prediction and print it.
 .. code-block::
@ -713,9 +718,9 @@ Here is an example of doing named entity recognition, using a model and a tokeni
    >>> predictions = tf.argmax(outputs, axis=2)
-This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every token has
+This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
-a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that token. The
+token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
-following array should be the output:
+token. The following array should be the output:
 .. code-block::
@ -727,11 +732,13 @@ Summarization
 Summarization is the task of summarizing a document or an article into a shorter text.
-An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization.
+An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
-If you would like to fine-tune a model on a summarization task, various approaches are described in this
+created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
-`document <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
+approaches are described in this `document
 <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
-Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set.
+Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
 / Daily Mail data set.
 .. code-block::
@ -758,9 +765,9 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
    ... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
    ... """
-Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
+Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
-of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
+arguments of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown
-This outputs the following summary:
+below. This outputs the following summary:
 .. code-block::
@ -769,12 +776,14 @@ This outputs the following summary:
 Here is an example of doing summarization using a model and a tokenizer. The process is the following:
-1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
+1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
   model, such as ``Bart`` or ``T5``.
 2. Define the article that should be summarized.
 3. Add the T5 specific prefix "summarize: ".
 4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
-In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results.
+In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
 CNN / Daily Mail), it yields very good results.
 .. code-block::
@ -802,14 +811,13 @@ Translation
 Translation is the task of translating a text from one language to another.
-An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input data
+An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
-and the corresponding sentences in German as the target data.
+data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
-If you would like to fine-tune a model on a translation task, various approaches are described in this
+translation task, various approaches are described in this `document
-`document <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
+<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
-Here is an example of using the pipelines to do translation.
+Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
-It leverages a T5 model that was only pre-trained on a multi-task mixture dataset (including WMT), yet, yielding impressive
+multi-task mixture dataset (including WMT), yet, yielding impressive translation results.
 translation results.
 .. code-block::
@ -819,12 +827,13 @@ translation results.
    >>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
    [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
-Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
+Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
-of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
+arguments of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
 Here is an example of doing translation using a model and a tokenizer. The process is the following:
-1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
+1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
   model, such as ``Bart`` or ``T5``.
 2. Define the article that should be summarizaed.
 3. Add the T5 specific prefix "translate English to German: "
 4. Use the ``PreTrainedModel.generate()`` method to perform the translation.
--- a/docs/source/testing.rst
+++ b/docs/source/testing.rst
@ -12,17 +12,26 @@ There are 2 test suites in the repository:
 How transformers are tested
 -----------------------------------------------------------------------------------------------------------------------
-1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__, so that if needed you can reproduce the same environment on your machine.
+1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
-   
+   are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__,
   so that if needed you can reproduce the same environment on your machine.
   These CI jobs don't run ``@slow`` tests.
-   
+
 2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
-   * `torch hub integration <https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__:  checks whether torch hub integration works.
+   * `torch hub integration
     <https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__: checks
     whether torch hub integration works.
-   * `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__: runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
+   * `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__:
-     
+     runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code
-   * `self-hosted runner <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and slow tests on GPU in ``tests`` and ``examples``:
+     in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards,
     notebooks, etc.)
   * `self-hosted runner
     <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and
     slow tests on GPU in ``tests`` and ``examples``:
   .. code-block:: bash
@ -43,7 +52,8 @@ Running tests
 Choosing which tests to run
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
+This document goes into many details of how tests can be run. If after reading everything, you need even more details
 you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
 Here are some most useful ways of running tests.
@ -90,7 +100,7 @@ All tests of a given test file:
   pytest tests/test_optimization.py --collect-only -q
-   
+
 Run a specific test module
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -99,12 +109,13 @@ To run an individual test module:
 .. code-block:: bash
   pytest tests/test_logging.py
-   
+
 Run specific tests
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be:
+Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
 class containing those tests. For example, it could be:
 .. code-block:: bash
@ -131,7 +142,7 @@ As mentioned earlier you can see what tests are contained inside the ``Optimizat
   pytest tests/test_optimization.py::OptimizationTest --collect-only -q
-  
+
 You can run tests by keyword expressions.
 To run only tests whose name contains ``adam``:
@ -158,7 +169,9 @@ And you can combine the two patterns in one:
 Run only modified tests
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked <https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break anything, since it won't run the tests related to files you didn't touch.
+You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked
 <https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break
 anything, since it won't run the tests related to files you didn't touch.
 .. code-block:: bash
@ -168,17 +181,14 @@ You can run the tests related to the unstaged files or the current branch (accor
    pytest --picked
-All tests will be run from files and folders which are modified, but not
+All tests will be run from files and folders which are modified, but not yet committed.
 yet committed.
 Automatically rerun failed tests on source modification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a
+`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a very useful feature of detecting all failed
-very useful feature of detecting all failed tests, and then waiting for
+tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
-you to modify files and continuously re-rerun those failing tests until
+fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
 they pass while you fix them. So that you don't need to re start pytest
 after you made the fix. This is repeated until all tests pass after
 which again a full run is performed.
 .. code-block:: bash
@ -187,10 +197,9 @@ which again a full run is performed.
 To enter the mode: ``pytest -f`` or ``pytest --looponfail``
-File changes are detected by looking at ``looponfailroots`` root
+File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively).
-directories and all of their contents (recursively). If the default for
+If the default for this value does not work for you, you can change it in your project by setting a configuration
-this value does not work for you, you can change it in your project by
+option in ``setup.cfg``:
 setting a configuration option in ``setup.cfg``:
 .. code-block:: ini
@ -204,17 +213,17 @@ or ``pytest.ini``/``tox.ini`` files:
    [pytest]
    looponfailroots = transformers tests
-This would lead to only looking for file changes in the respective
+This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
-directories, specified relatively to the ini-file’s directory.
+directory.
-`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an
+`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an alternative implementation of this functionality.
 alternative implementation of this functionality.
 Skip a test module
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except ``test_modeling_*.py`` tests:
+If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
 example, to run all except ``test_modeling_*.py`` tests:
 .. code-block:: bash
@ -224,8 +233,7 @@ If you want to run all test modules, except a few you can exclude them by giving
 Clearing state
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-CI builds and when isolation is important (against speed), cache should
+CI builds and when isolation is important (against speed), cache should be cleared:
 be cleared:
 .. code-block:: bash
@ -234,24 +242,23 @@ be cleared:
 Running tests in parallel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2`` to run 2 parallel jobs).
+As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2``
 to run 2 parallel jobs).
-``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the tests located in one file onto the same process.
+``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the
 tests located in one file onto the same process.
-Since the order of executed tests is different and unpredictable, if
+Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist``
-running the test suite with ``pytest-xdist`` produces failures (meaning
+produces failures (meaning we have some undetected coupled tests), use `pytest-replay
-we have some undetected coupled tests), use
+<https://github.com/ESSS/pytest-replay>`__ to replay the tests in the same order, which should help with then somehow
-`pytest-replay <https://github.com/ESSS/pytest-replay>`__ to replay the
+reducing that failing sequence to a minimum.
 tests in the same order, which should help with then somehow reducing
 that failing sequence to a minimum.
 Test order and repetition
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-It's good to repeat the tests several times, in sequence, randomly, or
+It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
-in sets, to detect any potential inter-dependency and state-related bugs
+inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
-(tear down). And the straightforward multiple repetition is just good to
+some problems that get uncovered by randomness of DL.
 detect some problems that get uncovered by randomness of DL.
 Repeat tests
@ -268,10 +275,10 @@ And then run every test multiple times (50 by default):
 .. code-block:: bash
   pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
-   
+
 .. note::
   This plugin doesn't work with ``-n`` flag from ``pytest-xdist``.
-   
+
 .. note::
   There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``.
@ -283,14 +290,11 @@ Run tests in a random order
    pip install pytest-random-order
-Important: the presence of ``pytest-random-order`` will automatically
+Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or
-randomize tests, no configuration change or command line options is
+command line options is required.
 required.
-As explained earlier this allows detection of coupled tests - where one
+As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
-test's state affects the state of another. When ``pytest-random-order``
+``pytest-random-order`` is installed it will print the random seed it used for that session, e.g:
 is installed it will print the random seed it used for that session,
 e.g:
 .. code-block:: bash
@ -299,8 +303,7 @@ e.g:
   Using --random-order-bucket=module
   Using --random-order-seed=573663
-So that if the given particular sequence fails, you can reproduce it by
+So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
 adding that exact seed, e.g.:
 .. code-block:: bash
@ -309,11 +312,9 @@ adding that exact seed, e.g.:
   Using --random-order-bucket=module
   Using --random-order-seed=573663
-It will only reproduce the exact order if you use the exact same list of
+It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
-tests (or no list at all). Once you start to manually narrowing
+manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
-down the list you can no longer rely on the seed, but have to list them
+they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.:
 manually in the exact order they failed and tell pytest to not randomize
 them instead using ``--random-order-bucket=none``, e.g.:
 .. code-block:: bash
@ -325,12 +326,13 @@ To disable the shuffling for all tests:
    pytest --random-order-bucket=none
-By default ``--random-order-bucket=module`` is implied, which will
+By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also
-shuffle the files on the module levels. It can also shuffle on
+shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its
-``class``, ``package``, ``global`` and ``none`` levels. For the complete
+`documentation <https://github.com/jbasko/pytest-random-order>`__.
 details please see its `documentation <https://github.com/jbasko/pytest-random-order>`__.
-Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This module has a very similar functionality/interface, but it doesn't have the bucket modes available in ``pytest-random-order``. It has the same problem of imposing itself once installed.
+Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This
 module has a very similar functionality/interface, but it doesn't have the bucket modes available in
 ``pytest-random-order``. It has the same problem of imposing itself once installed.
 Look and feel variations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -338,13 +340,11 @@ Look and feel variations
 pytest-sugar
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a
+`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a plugin that improves the look-n-feel, adds a
-plugin that improves the look-n-feel, adds a progressbar, and show tests
+progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
 that fail and the assert instantly. It gets activated automatically upon
 installation.
 .. code-block:: bash
-                
+
   pip install pytest-sugar
 To run tests without it, run:
@ -360,8 +360,7 @@ or uninstall it.
 Report each sub-test name and its progress
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-For a single or a group of tests via ``pytest`` (after
+For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``):
 ``pip install pytest-pspec``):
 .. code-block:: bash
@ -372,9 +371,8 @@ For a single or a group of tests via ``pytest`` (after
 Instantly shows failed tests
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__
+`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ shows failures and errors instantly instead of
-shows failures and errors instantly instead of waiting until the end of
+waiting until the end of test session.
 test session.
 .. code-block:: bash
@ -390,18 +388,20 @@ To GPU or not to GPU
 On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
 .. code-block:: bash
-                
+
    CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
-or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the second gpu if you have gpus ``0`` and ``1``, you can run:
+or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the
 second gpu if you have gpus ``0`` and ``1``, you can run:
 .. code-block:: bash
-                
+
    CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
 This is handy when you want to run different tasks on different GPUs.
-Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
+Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
 decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
 * ``require_torch`` - this test will run only under torch
 * ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
@ -423,7 +423,8 @@ If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
    @require_tf
    def test_tf_thing_with_tensorflow():
-These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is how to set it up:
+These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
 how to set it up:
 .. code-block:: python
@ -431,7 +432,8 @@ These decorators can be stacked. For example, if a test is slow and requires at
    @slow
    def test_example_slow_on_gpu():
-Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage:
+Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed
 last for them to work correctly. Here is an example of the correct usage:
 .. code-block:: python
@ -439,7 +441,8 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_
    @require_torch_multigpu
    def test_integration_foo():
-This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still work. But it only works with non-unittests.
+This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
 work. But it only works with non-unittests.
 Inside tests:
@ -450,16 +453,22 @@ Inside tests:
   torch.cuda.device_count()
-   
+
 Distributed training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes.
+``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
 thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
 spawns a normal process that then spawns off multiple workers and manages the IO pipes.
 This is still under development but you can study 2 different tests that perform this successfully:
-* `test_seq2seq_examples_multi_gpu.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a ``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default) 
+* `test_seq2seq_examples_multi_gpu.py
-* `test_finetune_trainer.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal (non-PL) test
+  <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a
  ``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
 * `test_finetune_trainer.py
  <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal
  (non-PL) test
 To jump right into the execution point, search for the ``execute_async_std`` function in those tests.
@ -474,12 +483,10 @@ You will need at least 2 GPUs to see these tests in action:
 Output capture
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-During test execution any output sent to ``stdout`` and ``stderr`` is
+During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its
-captured. If a test or a setup method fails, its according captured
+according captured output will usually be shown along with the failure traceback.
 output will usually be shown along with the failure traceback.
-To disable output capturing and to get the ``stdout`` and ``stderr``
+To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``:
 normally, use ``-s`` or ``--capture=no``:
 .. code-block:: bash
@ -512,9 +519,8 @@ Creating a URL for each test failure:
   pytest --pastebin=failed tests/test_logging.py
-This will submit test run information to a remote Paste service and
+This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
-provide a URL for each failure. You may select tests as usual or add for
+tests as usual or add for example -x if you only want to send one particular failure.
 example -x if you only want to send one particular failure.
 Creating a URL for a whole test session log:
@ -527,18 +533,22 @@ Creating a URL for a whole test session log:
 Writing tests
 -----------------------------------------------------------------------------------------------------------------------
-🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems can be used.
+🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems
 can be used.
-You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module ``parameterized`` that works in a similar way.
+You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important
 thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module
 ``parameterized`` that works in a similar way.
 Parametrization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments.
+Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
 the test, but then there is no way of running that test for just one set of arguments.
 .. code-block:: python
-                
+
    # test_this1.py
    import unittest
    from parameterized import parameterized
@ -551,7 +561,8 @@ Often, there is a need to run the same test multiple times, but with different a
        def test_floor(self, name, input, expected):
            assert_equal(math.floor(input), expected)
-Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the corresponding arguments in the parameter list.
+Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the
 corresponding arguments in the parameter list.
 and you could run just the ``negative`` and ``integer`` sets of params with:
@ -565,14 +576,15 @@ or all but ``negative`` sub-tests, with:
   pytest -k "not negative" tests/test_mytest.py
-Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any or all of them using their exact names. 
+Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any
-        
+or all of them using their exact names.
 .. code-block:: bash
-                
+
    pytest test_this1.py --collect-only -q
 and it will list:
-                
+
 .. code-block:: bash
    test_this1.py::TestMathUnitTest::test_floor_0_negative
@ -584,10 +596,12 @@ So now you can run just 2 specific sub-tests:
 .. code-block:: bash
    pytest test_this1.py::TestMathUnitTest::test_floor_0_negative  test_this1.py::TestMathUnitTest::test_floor_1_integer
 The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
-If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in some existing tests, mostly under ``examples``).
+The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies
 of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
 If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in
 some existing tests, mostly under ``examples``).
 Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
@ -606,14 +620,16 @@ Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
    def test_floor(name, input, expected):
        assert_equal(math.floor(input), expected)
-Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of names for the sub-tests. Here is what they look like:
+Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are
-        
+run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
 names for the sub-tests. Here is what they look like:
 .. code-block:: bash
-                
+
    pytest test_this2.py --collect-only -q
 and it will list:
-                
+
 .. code-block:: bash
    test_this2.py::test_floor[integer-1-1.0]
@ -628,16 +644,20 @@ So now you can run just the specific test:
 as in the previous example.
-    
+
 Temporary files and directories
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite each other's data. Also we want to get the temp files and directories removed at the end of each test that created them. Therefore, using packages like ``tempfile``, which address these needs is essential.
+Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
 each other's data. Also we want to get the temp files and directories removed at the end of each test that created
 them. Therefore, using packages like ``tempfile``, which address these needs is essential.
-However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to know it's exact path and not having it randomized on every test re-run.
+However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to
 know it's exact path and not having it randomized on every test re-run.
-A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of :obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
+A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
 :obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
 Here is an example of its usage:
@ -650,23 +670,27 @@ Here is an example of its usage:
 This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
-In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless ``after=False`` is passed to the helper function.
+In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless
 ``after=False`` is passed to the helper function.
-* Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a specific directory:
+* Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a
  specific directory:
 .. code-block:: python
    def test_whatever(self):
        tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test")
-* Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the temp results:
+* Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the
  temp results:
 .. code-block:: python
    def test_whatever(self):
        tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", after=False)
-* Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion in the previous test run and want to make sure the that temporary directory is empty before the new test is run:
+* Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion
  in the previous test run and want to make sure the that temporary directory is empty before the new test is run:
 .. code-block:: python
@ -674,38 +698,33 @@ In this and all the following scenarios the temporary directory will be auto-rem
        tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", before=True)
 .. note::
-   In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will get nuked. i.e. please always pass paths that start with ``./``.
+   In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
   an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will
   get nuked. i.e. please always pass paths that start with ``./``.
 .. note::
-   Each test can register multiple temporary directories and they all will get auto-removed, unless requested otherwise.
+   Each test can register multiple temporary directories and they all will get auto-removed, unless requested
   otherwise.
 Skipping tests
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This is useful when a bug is found and a new test is written, yet the
+This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
-bug is not fixed yet. In order to be able to commit it to the main
+commit it to the main repository we need make sure it's skipped during ``make test``.
 repository we need make sure it's skipped during ``make test``.
 Methods:
-  A **skip** means that you expect your test to pass only if some
+-  A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
-   conditions are met, otherwise pytest should skip running the test
+   running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
-   altogether. Common examples are skipping windows-only tests on
+   tests that depend on an external resource which is not available at the moment (for example a database).
   non-windows platforms, or skipping tests that depend on an external
   resource which is not available at the moment (for example a
   database).
-  A **xfail** means that you expect a test to fail for some reason. A
+-  A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
-   common example is a test for a feature not yet implemented, or a bug
+   implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
-   not yet fixed. When a test passes despite being expected to fail
+   pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
   (marked with pytest.mark.xfail), it’s an xpass and will be reported
   in the test summary.
-One of the important differences between the two is that ``skip``
+One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the
-doesn't run the test, and ``xfail`` does. So if the code that's buggy
+code that's buggy causes some bad state that will affect other tests, do not use ``xfail``.
 causes some bad state that will affect other tests, do not use
 ``xfail``.
 Implementation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -772,7 +791,7 @@ or:
    @unittest.skipIf(torch_device == "cpu", "Can't do half precision")
    def test_feature_x():
-   
+
 or skip the whole module:
 .. code-block:: python
@ -786,7 +805,9 @@ More details, example and ways are `here <https://docs.pytest.org/en/latest/skip
 Slow tests
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be marked as in the example below:
+The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
 an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
 marked as in the example below:
 .. code-block:: python
@ -799,8 +820,9 @@ Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var
 .. code-block:: bash
    RUN_SLOW=1 pytest tests
-    
+
-Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators ``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
+Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators
 ``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
 .. code-block:: python
@ -808,39 +830,55 @@ Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow``
    @slow
    def test_integration_foo():
-As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your machine before submitting the PR.
+As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
 checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
 get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
 machine before submitting the PR.
 Here is a rough decision making mechanism for choosing which tests should be marked as slow:
-If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files, pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library, such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine this approach we should have exceptions:
+If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
 pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
 such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
 this approach we should have exceptions:
-* All tests that need to download a heavy set of weights (e.g., model or tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs.
+* All tests that need to download a heavy set of weights (e.g., model or tokenizer integration tests, pipeline
  integration tests) should be set to slow. If you're adding a new model, you should create and upload to the hub a
  tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs.
 * All tests that need to do a training not specifically optimized to be fast should be set to slow.
-* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to ``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked as ``@slow``.
+* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
  ``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
  as ``@slow``.
 * If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
-Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast.
+Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
-For example, a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc.
+a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
-Then the ``@slow`` tests can use large slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
+have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large
 slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
 .. code-block:: bash
    grep tiny tests examples
-Here is a an example of a `script <https://github.com/huggingface/transformers/blob/master/scripts/fsmt/fsmt-make-tiny-model.py>`__ that created the tiny model `stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your specific model's architecture.
+Here is a an example of a `script
 <https://github.com/huggingface/transformers/blob/master/scripts/fsmt/fsmt-make-tiny-model.py>`__ that created the tiny
 model `stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your
 specific model's architecture.
-It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
+It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
 you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
 execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
-That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast. If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest tests.
+That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
 If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
 tests.
 Testing the stdout/stderr output
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In order to test functions that write to ``stdout`` and/or ``stderr``,
+In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the
-the test can access those streams using the ``pytest``'s `capsys
+``pytest``'s `capsys system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how this is accomplished:
 system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how
 this is accomplished:
 .. code-block:: python
@ -859,8 +897,8 @@ this is accomplished:
        assert msg in out
        assert msg in err
-And, of course, most of the time, ``stderr`` will come as a part of an
+And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such
-exception, so try/except has to be used in such a case:
+a case:
 .. code-block:: python
@ -892,16 +930,13 @@ Another approach to capturing stdout is via ``contextlib.redirect_stdout``:
        # test:
        assert msg in out
-An important potential issue with capturing stdout is that it may
+An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print``
-contain ``\r`` characters that in normal ``print`` reset everything that
+reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these
-has been printed so far. There is no problem with ``pytest``, but with
+characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an
-``pytest -s`` these characters get included in the buffer, so to be able
+extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
 to have the test run with and without ``-s``, you have to make an extra
 cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
-But, then we have a helper context manager wrapper to automatically take
+But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
-care of it all, regardless of whether it has some ``\r``'s in it or
+some ``\r``'s in it or not, so it's a simple:
 not, so it's a simple:
 .. code-block:: python
@ -921,8 +956,7 @@ Here is a full test example:
        print(msg + final)
    assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
-If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class
+If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead:
 instead:
 .. code-block:: python
@ -931,8 +965,7 @@ instead:
        function_that_writes_to_stderr()
    print(cs.err)
-If you need to capture both streams at once, use the parent
+If you need to capture both streams at once, use the parent :obj:`CaptureStd` class:
 :obj:`CaptureStd` class:
 .. code-block:: python
@ -964,7 +997,8 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger`
 Testing with environment variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If you want to test the impact of environment variables for a specific test you can use a helper decorator ``transformers.testing_utils.mockenv``
+If you want to test the impact of environment variables for a specific test you can use a helper decorator
 ``transformers.testing_utils.mockenv``
 .. code-block:: python
@ -978,8 +1012,8 @@ If you want to test the impact of environment variables for a specific test you
 Getting reproducible results
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In some situations you may want to remove randomness for your tests. To
+In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
-get identical reproducable results set, you will need to fix the seed:
+will need to fix the seed:
 .. code-block:: python
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@ -1,12 +1,12 @@
 Tokenizer summary
 -----------------------------------------------------------------------------------------------------------------------
-In this page, we will have a closer look at tokenization. As we saw in
+In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
-:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
+<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
-are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
+part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
-specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
+different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
-:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
+:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
-:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
+those.
 Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
 algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
@ -16,8 +16,8 @@ Introduction to tokenization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
-instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
+instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
-this text is just to split it by spaces, which would give:
+text is just to split it by spaces, which would give:
 .. code-block::
@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like:
 Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
 sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
-you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
+you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
-:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
+XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
 size of 267,735!
 A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
 TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
 form (almost) arbitrarily long complex words by stringing together some subwords.
 This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
-subwords. This also enables the model to process words it has never seen before, by decomposing them into
+subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
-subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
+knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
 this:
 .. code-block::
@ -81,8 +79,8 @@ this:
    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
 Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
-vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The "##"
+vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
-means that the rest of the token should be attached to the previous one, without space (for when we need to decode
+"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
 predictions and reverse the tokenization).
 Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
@ -106,9 +104,9 @@ Byte-Pair Encoding
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
-splitting the training data into words, which can be a simple space tokenization
+splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
-(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
+:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
-(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
+Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
 :doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
@ -148,10 +146,10 @@ represented as
    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
-If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
+If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
-were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
+that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
-``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
+tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
-base corpus uses all of them), but to special characters like emojis.
+(since the base corpus uses all of them), but to special characters like emojis.
 As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
 to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
@ -161,24 +159,24 @@ Byte-level BPE
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
-all unicode characters, the
+all unicode characters, the `GPT-2 paper
-`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
+<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
-introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
+clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
-additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
+deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
-token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
+the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
-256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
+a special end-of-text token and the symbols learned with 50,000 merges.
 .. _wordpiece:
 WordPiece
 =======================================================================================================================
-WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
+WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
-:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
+<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
-`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
+<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
-on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
+base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
-progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
+given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
-frequent but the one that will maximize the likelihood on the corpus once merged.
+that will maximize the likelihood on the corpus once merged.
 What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
 having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
@ -198,10 +196,10 @@ with :ref:`SentencePiece <sentencepiece>`.
 More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
 for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
-sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and removes
+sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
-all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
+removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
-reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
+has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
-BPE or WordPiece).
+like BPE or WordPiece).
 Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
 tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is
 tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
 of the tokenization according to their probabilities).
-Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
+Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
-words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
+x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
-tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
+:math:`x_{i}` (with the current vocabulary), then the loss is defined as
 .. math::
    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`
 includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
 That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
-the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
+the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
-all of them together and replace '▁' with space.
+of them together and replace '▁' with space.
 All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
 :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@ -1,18 +1,14 @@
 Training and fine-tuning
 =======================================================================================================================
-Model classes in 🤗 Transformers are designed to be compatible with native
+Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
-PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
+seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
-quickstart, we will show how to fine-tune (or train from scratch) a model
+standard training tools available in either framework. We will also show how to use our included
-using the standard training tools available in either framework. We will also
+:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
 show how to use our included :func:`~transformers.Trainer` class which
 handles much of the complexity of training for you.
-This guide assume that you are already familiar with loading and use our
+This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the
-models for inference; otherwise, see the :doc:`task summary <task_summary>`. We also assume
+:doc:`task summary <task_summary>`. We also assume that you are familiar with training deep neural networks in either
-that you are familiar with training deep neural networks in either PyTorch or
+PyTorch or TF2, and focus specifically on the nuances and tools for training models in 🤗 Transformers.
 TF2, and focus specifically on the nuances and tools for training models in
 🤗 Transformers.
 Sections:
@ -26,25 +22,19 @@ Sections:
 Fine-tuning in native PyTorch
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Model classes in 🤗 Transformers that don't begin with ``TF`` are
+Model classes in 🤗 Transformers that don't begin with ``TF`` are `PyTorch Modules
-`PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
+<https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, meaning that you can use them just as you would any
-meaning that you can use them just as you would any model in PyTorch for
+model in PyTorch for both inference and optimization.
 both inference and optimization.
-Let's consider the common task of fine-tuning a masked language model like
+Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
-BERT on a sequence classification dataset. When we instantiate a model with
+When we instantiate a model with :func:`~transformers.PreTrainedModel.from_pretrained`, the model configuration and
-:func:`~transformers.PreTrainedModel.from_pretrained`, the model
+pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
-configuration and pre-trained weights
+task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
 of the specified model are used to initialize the model. The
 library also includes a number of task-specific final layers or 'heads' whose
 weights are instantiated randomly when not present in the specified
 pre-trained model. For example, instantiating a model with
-``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)``
+``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` will create a BERT model instance
-will create a BERT model instance with encoder weights copied from the
+with encoder weights copied from the ``bert-base-uncased`` model and a randomly initialized sequence classification
-``bert-base-uncased`` model and a randomly initialized sequence
+head on top of the encoder with an output size of 2. Models are initialized in ``eval`` mode by default. We can call
-classification head on top of the encoder with an output size of 2. Models
+``model.train()`` to put it in train mode.
 are initialized in ``eval`` mode by default. We can call ``model.train()`` to
 put it in train mode.
 .. code-block:: python
@ -52,20 +42,17 @@ put it in train mode.
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
    model.train()
-This is useful because it allows us to make use of the pre-trained BERT
+This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever
-encoder and easily train it on whatever sequence classification dataset we
+sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the
-choose. We can use any PyTorch optimizer, but our library also provides the
+:func:`~transformers.AdamW` optimizer which implements gradient bias correction as well as weight decay.
 :func:`~transformers.AdamW` optimizer which implements gradient bias
 correction as well as weight decay.
 .. code-block:: python
    from transformers import AdamW
    optimizer = AdamW(model.parameters(), lr=1e-5)
-The optimizer allows us to apply different hyperpameters for specific
+The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can apply
-parameter groups. For example, we can apply weight decay to all parameters
+weight decay to all parameters other than bias and layer normalization terms:
 other than bias and layer normalization terms:
 .. code-block:: python
@ -75,11 +62,9 @@ other than bias and layer normalization terms:
        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
-    
+
-Now we can set up a simple dummy training batch using
+Now we can set up a simple dummy training batch using :func:`~transformers.PreTrainedTokenizer.__call__`. This returns
-:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a
+a :func:`~transformers.BatchEncoding` instance which prepares everything we might need to pass to the model.
 :func:`~transformers.BatchEncoding` instance which
 prepares everything we might need to pass to the model.
 .. code-block:: python
@ -90,10 +75,9 @@ prepares everything we might need to pass to the model.
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']
-When we call a classification model with the ``labels`` argument, the first
+When we call a classification model with the ``labels`` argument, the first returned element is the Cross Entropy loss
-returned element is the Cross Entropy loss between the predictions and the
+between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
-passed labels. Having already set up our optimizer, we can then do a
+update the weights:
 backwards pass and update the weights:
 .. code-block:: python
@ -103,8 +87,8 @@ backwards pass and update the weights:
    loss.backward()
    optimizer.step()
-Alternatively, you can just get the logits and calculate the loss yourself.
+Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous
-The following is equivalent to the previous example:
+example:
 .. code-block:: python
@ -115,12 +99,10 @@ The following is equivalent to the previous example:
    loss.backward()
    optimizer.step()
-Of course, you can train on GPU by calling ``to('cuda')`` on the model and
+Of course, you can train on GPU by calling ``to('cuda')`` on the model and inputs as usual.
 inputs as usual.
-We also provide a few learning rate scheduling tools. With the following, we
+We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
-can set up a scheduler which warms up for ``num_warmup_steps`` and then
+``num_warmup_steps`` and then linearly decays to 0 by the end of training.
 linearly decays to 0 by the end of training.
 .. code-block:: python
@ -135,19 +117,16 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.
    optimizer.step()
    scheduler.step()
-We highly recommend using :func:`~transformers.Trainer`, discussed below,
+We highly recommend using :func:`~transformers.Trainer`, discussed below, which conveniently handles the moving parts
-which conveniently handles the moving parts of training 🤗 Transformers models
+of training 🤗 Transformers models with features like mixed precision and easy tensorboard logging.
 with features like mixed precision and easy tensorboard logging.
 Freezing the encoder
 -----------------------------------------------------------------------------------------------------------------------
-In some cases, you might be interested in keeping the weights of the
+In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
-pre-trained encoder frozen and optimizing only the weights of the head
+weights of the head layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on the encoder
-layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on
+parameters, which can be accessed with the ``base_model`` submodule on any task-specific model in the library:
 the encoder parameters, which can be accessed with the ``base_model``
 submodule on any task-specific model in the library:
 .. code-block:: python
@ -160,10 +139,8 @@ submodule on any task-specific model in the library:
 Fine-tuning in native TensorFlow 2
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
+Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with
-TensorFlow models can be instantiated with
+:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of the encoder from a pretrained model.
 :func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of
 the encoder from a pretrained model.
 .. code-block:: python
@ -171,11 +148,9 @@ the encoder from a pretrained model.
    model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
 Let's use ``tensorflow_datasets`` to load in the `MRPC dataset
-<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We
+<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We can then use our built-in
-can then use our built-in
+:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` to tokenize MRPC and convert it to a
-:func:`~transformers.data.processors.glue.glue_convert_examples_to_features`
+TensorFlow ``Dataset`` object. Note that tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
 to tokenize MRPC and convert it to a TensorFlow ``Dataset`` object. Note that
 tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
 the pretrained tokenizer name.
 .. code-block:: python
@ -197,8 +172,8 @@ The model can then be compiled and trained as any Keras model:
    model.compile(optimizer=optimizer, loss=loss)
    model.fit(train_dataset, epochs=2, steps_per_epoch=115)
-With the tight interoperability between TensorFlow and PyTorch models, you
+With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
-can even save the model and then reload it as a PyTorch model (or vice-versa):
+as a PyTorch model (or vice-versa):
 .. code-block:: python
@ -212,12 +187,9 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
 Trainer
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-We also provide a simple but feature-complete training and evaluation
+We also provide a simple but feature-complete training and evaluation interface through :func:`~transformers.Trainer`
-interface through :func:`~transformers.Trainer` and
+and :func:`~transformers.TFTrainer`. You can train, fine-tune, and evaluate any 🤗 Transformers model with a wide range
-:func:`~transformers.TFTrainer`. You can train, fine-tune,
+of training options and with built-in features like logging, gradient accumulation, and mixed precision.
 and evaluate any 🤗 Transformers model with a wide range of training options and
 with built-in features like logging, gradient accumulation, and mixed
 precision.
 .. code-block:: python
@ -264,21 +236,16 @@ precision.
        eval_dataset=tfds_test_dataset       # tensorflow_datasets evaluation dataset
    )
-Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to
+Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to evaluate. You can use your own module as
-evaluate. You can use your own module as well, but the first
+well, but the first argument returned from ``forward`` must be the loss which you wish to optimize.
 argument returned from ``forward`` must be the loss which you wish to
 optimize.
-:func:`~transformers.Trainer` uses a built-in default function to collate
+:func:`~transformers.Trainer` uses a built-in default function to collate batches and prepare them to be fed into the
-batches and prepare them to be fed into the model. If needed, you can also
+model. If needed, you can also use the ``data_collator`` argument to pass your own collator function which takes in the
-use the ``data_collator`` argument to pass your own collator function which
+data in the format provided by your dataset and returns a batch ready to be fed into the model. Note that
-takes in the data in the format provided by your dataset and returns a
+:func:`~transformers.TFTrainer` expects the passed datasets to be dataset objects from ``tensorflow_datasets``.
 batch ready to be fed into the model. Note that
 :func:`~transformers.TFTrainer` expects the passed datasets to be dataset
 objects from ``tensorflow_datasets``.
-To calculate additional metrics in addition to the loss, you can also define
+To calculate additional metrics in addition to the loss, you can also define your own ``compute_metrics`` function and
-your own ``compute_metrics`` function and pass it to the trainer.
+pass it to the trainer.
 .. code-block:: python
@ -296,8 +263,8 @@ your own ``compute_metrics`` function and pass it to the trainer.
            'recall': recall
        }
-Finally, you can view the results, including any calculated metrics, by
+Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
-launching tensorboard in your specified ``logging_dir`` directory.
+``logging_dir`` directory.
 .. _additional-resources:
@ -308,11 +275,12 @@ Additional resources
 - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
  which uses ``Trainer`` for IMDb sentiment classification.
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_
+- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ including scripts for
-  including scripts for training and fine-tuning on GLUE, SQuAD, and several other tasks.
+  training and fine-tuning on GLUE, SQuAD, and several other tasks.
- `How to train a language model <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_,
+- `How to train a language model
-  a detailed colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
+  <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, a detailed
  colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
 - `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
  training and using 🤗 Transformers on a variety of tasks.
--- a/src/transformers/activations.py
+++ b/src/transformers/activations.py
@ -14,18 +14,19 @@ def swish(x):
 def _gelu_python(x):
-    """Original Implementation of the gelu activation function in Google Bert repo when initially created.
+    """
-    For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+    Original Implementation of the gelu activation function in Google Bert repo when initially created. For
-    0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+    information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 +
-    This is now written in C in torch.nn.functional
+    torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in
-    Also see https://arxiv.org/abs/1606.08415
+    torch.nn.functional Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
 def gelu_new(x):
-    """Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
+    """
-    Also see https://arxiv.org/abs/1606.08415
+    Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT). Also see
    https://arxiv.org/abs/1606.08415
    """
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
--- a/src/transformers/activations_tf.py
+++ b/src/transformers/activations_tf.py
@ -4,11 +4,11 @@ import tensorflow as tf
 def gelu(x):
-    """Gaussian Error Linear Unit.
+    """
-    Original Implementation of the gelu activation function in Google Bert repo when initially created.
+    Gaussian Error Linear Unit. Original Implementation of the gelu activation function in Google Bert repo when
-        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+    initially created. For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
-        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+    0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) Also see
-        Also see https://arxiv.org/abs/1606.08415
+    https://arxiv.org/abs/1606.08415
    """
    x = tf.convert_to_tensor(x)
    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
@ -17,11 +17,12 @@ def gelu(x):
 def gelu_new(x):
-    """Gaussian Error Linear Unit.
+    """
-    This is a smoother version of the GELU.
+    Gaussian Error Linear Unit. This is a smoother version of the GELU. Original paper: https://arxiv.org/abs/1606.0841
-    Original paper: https://arxiv.org/abs/1606.08415
+
    Args:
-        x: float Tensor to perform activation.
+        x: float Tensor to perform activation
    Returns:
        `x` with the GELU activation applied.
    """
--- a/src/transformers/benchmark/benchmark_args.py
+++ b/src/transformers/benchmark/benchmark_args.py
@ -46,8 +46,9 @@ class PyTorchBenchmarkArguments(BenchmarkArguments):
    ]
    def __init__(self, **kwargs):
-        """This __init__ is there for legacy code. When removing
+        """
-        deprecated args completely, the class can simply be deleted
+        This __init__ is there for legacy code. When removing deprecated args completely, the class can simply be
        deleted
        """
        for deprecated_arg in self.deprecated_args:
            if deprecated_arg in kwargs:
--- a/src/transformers/benchmark/benchmark_args_tf.py
+++ b/src/transformers/benchmark/benchmark_args_tf.py
@ -43,8 +43,9 @@ class TensorFlowBenchmarkArguments(BenchmarkArguments):
    ]
    def __init__(self, **kwargs):
-        """This __init__ is there for legacy code. When removing
+        """
-        deprecated args completely, the class can simply be deleted
+        This __init__ is there for legacy code. When removing deprecated args completely, the class can simply be
        deleted
        """
        for deprecated_arg in self.deprecated_args:
            if deprecated_arg in kwargs:
--- a/src/transformers/benchmark/benchmark_args_utils.py
+++ b/src/transformers/benchmark/benchmark_args_utils.py
@ -1,147 +1,145 @@
-# coding=utf-8
+# coding=utf-8
-# Copyright 2018 The HuggingFace Inc. team.
+# Copyright 2018 The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
+#
-# Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+# You may obtain a copy of the License at
-#
+#
-#     http://www.apache.org/licenses/LICENSE-2.0
+#     http://www.apache.org/licenses/LICENSE-2.0
-#
+#
-# Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-# limitations under the License.
+# limitations under the License.
-
+
-import dataclasses
+import dataclasses
-import json
+import json
-from dataclasses import dataclass, field
+from dataclasses import dataclass, field
-from time import time
+from time import time
-from typing import List
+from typing import List
-
+
-from ..utils import logging
+from ..utils import logging
-
+
-
+
-logger = logging.get_logger(__name__)
+logger = logging.get_logger(__name__)
-
+
-
+
-def list_field(default=None, metadata=None):
+def list_field(default=None, metadata=None):
-    return field(default_factory=lambda: default, metadata=metadata)
+    return field(default_factory=lambda: default, metadata=metadata)
-
+
-
+
-@dataclass
+@dataclass
-class BenchmarkArguments:
+class BenchmarkArguments:
-    """
+    """
-    BenchMarkArguments are arguments we use in our benchmark scripts
+    BenchMarkArguments are arguments we use in our benchmark scripts **which relate to the training loop itself**.
-    **which relate to the training loop itself**.
+
-
+    Using `HfArgumentParser` we can turn this class into argparse arguments to be able to specify them on the command
-    Using `HfArgumentParser` we can turn this class
+    line.
-    into argparse arguments to be able to specify them on
+    """
-    the command line.
+
-    """
+    models: List[str] = list_field(
-
+        default=[],
-    models: List[str] = list_field(
+        metadata={
-        default=[],
+            "help": "Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models"
-        metadata={
+        },
-            "help": "Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models"
+    )
-        },
+
-    )
+    batch_sizes: List[int] = list_field(
-
+        default=[8], metadata={"help": "List of batch sizes for which memory and time performance will be evaluated"}
-    batch_sizes: List[int] = list_field(
+    )
-        default=[8], metadata={"help": "List of batch sizes for which memory and time performance will be evaluated"}
+
-    )
+    sequence_lengths: List[int] = list_field(
-
+        default=[8, 32, 128, 512],
-    sequence_lengths: List[int] = list_field(
+        metadata={"help": "List of sequence lengths for which memory and time performance will be evaluated"},
-        default=[8, 32, 128, 512],
+    )
-        metadata={"help": "List of sequence lengths for which memory and time performance will be evaluated"},
+
-    )
+    inference: bool = field(
-
+        default=True,
-    inference: bool = field(
+        metadata={"help": "Whether to benchmark inference of model. Inference can be disabled via --no-inference."},
-        default=True,
+    )
-        metadata={"help": "Whether to benchmark inference of model. Inference can be disabled via --no-inference."},
+    cuda: bool = field(
-    )
+        default=True,
-    cuda: bool = field(
+        metadata={"help": "Whether to run on available cuda devices. Cuda can be disabled via --no-cuda."},
-        default=True,
+    )
-        metadata={"help": "Whether to run on available cuda devices. Cuda can be disabled via --no-cuda."},
+    tpu: bool = field(
-    )
+        default=True, metadata={"help": "Whether to run on available tpu devices. TPU can be disabled via --no-tpu."}
-    tpu: bool = field(
+    )
-        default=True, metadata={"help": "Whether to run on available tpu devices. TPU can be disabled via --no-tpu."}
+    fp16: bool = field(default=False, metadata={"help": "Use FP16 to accelerate inference."})
-    )
+    training: bool = field(default=False, metadata={"help": "Benchmark training of model"})
-    fp16: bool = field(default=False, metadata={"help": "Use FP16 to accelerate inference."})
+    verbose: bool = field(default=False, metadata={"help": "Verbose memory tracing"})
-    training: bool = field(default=False, metadata={"help": "Benchmark training of model"})
+    speed: bool = field(
-    verbose: bool = field(default=False, metadata={"help": "Verbose memory tracing"})
+        default=True,
-    speed: bool = field(
+        metadata={"help": "Whether to perform speed measurements. Speed measurements can be disabled via --no-speed."},
-        default=True,
+    )
-        metadata={"help": "Whether to perform speed measurements. Speed measurements can be disabled via --no-speed."},
+    memory: bool = field(
-    )
+        default=True,
-    memory: bool = field(
+        metadata={
-        default=True,
+            "help": "Whether to perform memory measurements. Memory measurements can be disabled via --no-memory"
-        metadata={
+        },
-            "help": "Whether to perform memory measurements. Memory measurements can be disabled via --no-memory"
+    )
-        },
+    trace_memory_line_by_line: bool = field(default=False, metadata={"help": "Trace memory line by line"})
-    )
+    save_to_csv: bool = field(default=False, metadata={"help": "Save result to a CSV file"})
-    trace_memory_line_by_line: bool = field(default=False, metadata={"help": "Trace memory line by line"})
+    log_print: bool = field(default=False, metadata={"help": "Save all print statements in a log file"})
-    save_to_csv: bool = field(default=False, metadata={"help": "Save result to a CSV file"})
+    env_print: bool = field(default=False, metadata={"help": "Whether to print environment information"})
-    log_print: bool = field(default=False, metadata={"help": "Save all print statements in a log file"})
+    multi_process: bool = field(
-    env_print: bool = field(default=False, metadata={"help": "Whether to print environment information"})
+        default=True,
-    multi_process: bool = field(
+        metadata={
-        default=True,
+            "help": "Whether to use multiprocessing for memory and speed measurement. It is highly recommended to use multiprocessing for accurate CPU and GPU memory measurements. This option should only be disabled for debugging / testing and on TPU."
-        metadata={
+        },
-            "help": "Whether to use multiprocessing for memory and speed measurement. It is highly recommended to use multiprocessing for accurate CPU and GPU memory measurements. This option should only be disabled for debugging / testing and on TPU."
+    )
-        },
+    inference_time_csv_file: str = field(
-    )
+        default=f"inference_time_{round(time())}.csv",
-    inference_time_csv_file: str = field(
+        metadata={"help": "CSV filename used if saving time results to csv."},
-        default=f"inference_time_{round(time())}.csv",
+    )
-        metadata={"help": "CSV filename used if saving time results to csv."},
+    inference_memory_csv_file: str = field(
-    )
+        default=f"inference_memory_{round(time())}.csv",
-    inference_memory_csv_file: str = field(
+        metadata={"help": "CSV filename used if saving memory results to csv."},
-        default=f"inference_memory_{round(time())}.csv",
+    )
-        metadata={"help": "CSV filename used if saving memory results to csv."},
+    train_time_csv_file: str = field(
-    )
+        default=f"train_time_{round(time())}.csv",
-    train_time_csv_file: str = field(
+        metadata={"help": "CSV filename used if saving time results to csv for training."},
-        default=f"train_time_{round(time())}.csv",
+    )
-        metadata={"help": "CSV filename used if saving time results to csv for training."},
+    train_memory_csv_file: str = field(
-    )
+        default=f"train_memory_{round(time())}.csv",
-    train_memory_csv_file: str = field(
+        metadata={"help": "CSV filename used if saving memory results to csv for training."},
-        default=f"train_memory_{round(time())}.csv",
+    )
-        metadata={"help": "CSV filename used if saving memory results to csv for training."},
+    env_info_csv_file: str = field(
-    )
+        default=f"env_info_{round(time())}.csv",
-    env_info_csv_file: str = field(
+        metadata={"help": "CSV filename used if saving environment information."},
-        default=f"env_info_{round(time())}.csv",
+    )
-        metadata={"help": "CSV filename used if saving environment information."},
+    log_filename: str = field(
-    )
+        default=f"log_{round(time())}.csv",
-    log_filename: str = field(
+        metadata={"help": "Log filename used if print statements are saved in log."},
-        default=f"log_{round(time())}.csv",
+    )
-        metadata={"help": "Log filename used if print statements are saved in log."},
+    repeat: int = field(default=3, metadata={"help": "Times an experiment will be run."})
-    )
+    only_pretrain_model: bool = field(
-    repeat: int = field(default=3, metadata={"help": "Times an experiment will be run."})
+        default=False,
-    only_pretrain_model: bool = field(
+        metadata={
-        default=False,
+            "help": "Instead of loading the model as defined in `config.architectures` if exists, just load the pretrain model weights."
-        metadata={
+        },
-            "help": "Instead of loading the model as defined in `config.architectures` if exists, just load the pretrain model weights."
+    )
-        },
+
-    )
+    def to_json_string(self):
-
+        """
-    def to_json_string(self):
+        Serializes this instance to a JSON string.
-        """
+        """
-        Serializes this instance to a JSON string.
+        return json.dumps(dataclasses.asdict(self), indent=2)
-        """
+
-        return json.dumps(dataclasses.asdict(self), indent=2)
+    @property
-
+    def model_names(self):
-    @property
+        assert (
-    def model_names(self):
+            len(self.models) > 0
-        assert (
+        ), "Please make sure you provide at least one model name / model identifier, *e.g.* `--models bert-base-cased` or `args.models = ['bert-base-cased']."
-            len(self.models) > 0
+        return self.models
-        ), "Please make sure you provide at least one model name / model identifier, *e.g.* `--models bert-base-cased` or `args.models = ['bert-base-cased']."
+
-        return self.models
+    @property
-
+    def do_multi_processing(self):
-    @property
+        if not self.multi_process:
-    def do_multi_processing(self):
+            return False
-        if not self.multi_process:
+        elif self.is_tpu:
-            return False
+            logger.info("Multiprocessing is currently not possible on TPU.")
-        elif self.is_tpu:
+            return False
-            logger.info("Multiprocessing is currently not possible on TPU.")
+        else:
-            return False
+            return True
        else:
            return True
--- a/src/transformers/benchmark/benchmark_utils.py
+++ b/src/transformers/benchmark/benchmark_utils.py
--- a/src/transformers/commands/convert.py
+++ b/src/transformers/commands/convert.py
@ -16,9 +16,9 @@ def convert_command_factory(args: Namespace):
    )
-IMPORT_ERROR_MESSAGE = """transformers can only be used from the commandline to convert TensorFlow models in PyTorch,
+IMPORT_ERROR_MESSAGE = """
-In that case, it requires TensorFlow to be installed. Please see
+transformers can only be used from the commandline to convert TensorFlow models in PyTorch, In that case, it requires
-https://www.tensorflow.org/install/ for installation instructions.
+TensorFlow to be installed. Please see https://www.tensorflow.org/install/ for installation instructions.
 """
--- a/src/transformers/commands/serving.py
+++ b/src/transformers/commands/serving.py
@ -164,9 +164,9 @@ class ServeCommand(BaseTransformersCLICommand):
    def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):
        """
-        Tokenize the provided input and eventually returns corresponding tokens id:
+        Tokenize the provided input and eventually returns corresponding tokens id: - **text_input**: String to
-        - **text_input**: String to tokenize
+        tokenize - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer
-        - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping.
+        mapping.
        """
        try:
            tokens_txt = self._pipeline.tokenizer.tokenize(text_input)
@ -187,10 +187,9 @@ class ServeCommand(BaseTransformersCLICommand):
        cleanup_tokenization_spaces: bool = Body(True, embed=True),
    ):
        """
-        Detokenize the provided tokens ids to readable text:
+        Detokenize the provided tokens ids to readable text: - **tokens_ids**: List of tokens ids -
-        - **tokens_ids**: List of tokens ids
+        **skip_special_tokens**: Flag indicating to not try to decode special tokens - **cleanup_tokenization_spaces**:
-        - **skip_special_tokens**: Flag indicating to not try to decode special tokens
+        Flag indicating to remove all leading/trailing spaces and intermediate ones.
        - **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.
        """
        try:
            decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)
--- a/src/transformers/configuration_albert.py
+++ b/src/transformers/configuration_albert.py
@ -37,9 +37,8 @@ class AlbertConfig(PretrainedConfig):
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
    configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30000):
@ -61,15 +60,15 @@ class AlbertConfig(PretrainedConfig):
        inner_group_num (:obj:`int`, `optional`, defaults to 1):
            The number of inner repetition of attention and ffn.
        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with. Typically set this to something
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            large (e.g., 512 or 1024 or 2048).
+            (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
            :class:`~transformers.TFAlbertModel`.
--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@ -258,8 +258,8 @@ class AutoConfig:
        r"""
        Instantiate one of the configuration classes of the library from a pretrained model configuration.
-        The configuration class to instantiate is selected based on the :obj:`model_type` property of the config
+        The configuration class to instantiate is selected based on the :obj:`model_type` property of the config object
-        object that is loaded, or when it's missing, by falling back to using pattern matching on
+        that is loaded, or when it's missing, by falling back to using pattern matching on
        :obj:`pretrained_model_name_or_path`:
        List options
@ -287,9 +287,8 @@ class AutoConfig:
                Whether or not to delete incompletely received files. Will attempt to resume the download if such a
                file exists.
            proxies (:obj:`Dict[str, str]`, `optional`):
-                A dictionary of proxy servers to use by protocol or endpoint, e.g.,
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., :obj:`{'http': 'foo.bar:3128',
-                :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
                request.
            return_unused_kwargs (:obj:`bool`, `optional`, defaults to :obj:`False`):
                If :obj:`False`, then this function returns just the final configuration object.
@ -298,8 +297,8 @@ class AutoConfig:
                the part of ``kwargs`` which has not been used to update ``config`` and is otherwise ignored.
            kwargs(additional keyword arguments, `optional`):
                The values in kwargs of any keys which are configuration attributes will be used to override the loaded
-                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is
+                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
-                controlled by the ``return_unused_kwargs`` keyword parameter.
+                by the ``return_unused_kwargs`` keyword parameter.
        Examples::
--- a/src/transformers/configuration_bart.py
+++ b/src/transformers/configuration_bart.py
@ -36,9 +36,8 @@ class BartConfig(PretrainedConfig):
    This is the configuration class to store the configuration of a :class:`~transformers.BartModel`. It is used to
    instantiate a BART model according to the specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 50265):
@ -59,8 +58,8 @@ class BartConfig(PretrainedConfig):
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@ -70,8 +69,8 @@ class BartConfig(PretrainedConfig):
        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for classifier.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
@ -95,11 +94,11 @@ class BartConfig(PretrainedConfig):
        bos_token_id (:obj:`int`, `optional`, defaults to 0)
            Beginning of stream token id.
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            The LayerDrop probability for the encoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            The LayerDrop probability for the decoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
            How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`.
        num_labels: (:obj:`int`, `optional`, defaults to 3):
@ -107,8 +106,8 @@ class BartConfig(PretrainedConfig):
        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether this is an encoder/decoder model.
        force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``),
+            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
-            only :obj:`True` for `bart-large-cnn`.
+            :obj:`True` for `bart-large-cnn`.
    """
    model_type = "bart"
--- a/src/transformers/configuration_bert.py
+++ b/src/transformers/configuration_bert.py
@ -51,13 +51,12 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class BertConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
-    :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified
+    :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified arguments,
-    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
-    configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
+    to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
@ -74,15 +73,15 @@ class BertConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
            :class:`~transformers.TFBertModel`.
--- a/src/transformers/configuration_bert_generation.py
+++ b/src/transformers/configuration_bert_generation.py
@ -23,9 +23,8 @@ class BertGenerationConfig(PretrainedConfig):
    :class:`~transformers.BertGenerationPreTrainedModel`. It is used to instantiate a BertGeneration model according to
    the specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 50358):
@ -40,15 +39,15 @@ class BertGenerationConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
            Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
--- a/src/transformers/configuration_blenderbot.py
+++ b/src/transformers/configuration_blenderbot.py
@ -14,7 +14,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # LICENSE file in the root directory of this source tree.
-"""BlenderbotConfig has the same signature as BartConfig. We only rewrite the signature in order to document blenderbot-90M defaults."""
+"""
 BlenderbotConfig has the same signature as BartConfig. We only rewrite the signature in order to document
 blenderbot-90M defaults.
 """
 from .configuration_bart import BartConfig
@ -26,12 +29,12 @@ BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class BlenderbotConfig(BartConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BlenderbotForConditionalGeneration`.
+    This is the configuration class to store the configuration of a
-    It inherits from :class:`~transformers.BartConfig` and has the same signature with different defaults.
+    :class:`~transformers.BlenderbotForConditionalGeneration`. It inherits from :class:`~transformers.BartConfig` and
    has the same signature with different defaults.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 54944):
@ -52,8 +55,8 @@ class BlenderbotConfig(BartConfig):
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@ -63,8 +66,8 @@ class BlenderbotConfig(BartConfig):
        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for classifier.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
@ -88,11 +91,11 @@ class BlenderbotConfig(BartConfig):
        bos_token_id (:obj:`int`, `optional`, defaults to 0)
            Beginning of stream token id.
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            The LayerDrop probability for the encoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            The LayerDrop probability for the decoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
            How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`.
        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
--- a/src/transformers/configuration_camembert.py
+++ b/src/transformers/configuration_camembert.py
@ -30,8 +30,8 @@ CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class CamembertConfig(RobertaConfig):
    """
-    This class overrides :class:`~transformers.RobertaConfig`. Please check the
+    This class overrides :class:`~transformers.RobertaConfig`. Please check the superclass for the appropriate
-    superclass for the appropriate documentation alongside usage examples.
+    documentation alongside usage examples.
    """
    model_type = "camembert"
--- a/src/transformers/configuration_ctrl.py
+++ b/src/transformers/configuration_ctrl.py
@ -26,13 +26,12 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h
 class CTRLConfig(PretrainedConfig):
    """
    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel` or a
-    :class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified
+    :class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified arguments,
-    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
-    configuration to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
+    to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 246534):
@ -40,8 +39,8 @@ class CTRLConfig(PretrainedConfig):
            :obj:`inputs_ids` passed when calling :class:`~transformers.CTRLModel` or
            :class:`~transformers.TFCTRLModel`.
        n_positions (:obj:`int`, `optional`, defaults to 256):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        n_ctx (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the causal mask (usually same as n_positions).
        n_embd (:obj:`int`, `optional`, defaults to 1280):
--- a/src/transformers/configuration_deberta.py
+++ b/src/transformers/configuration_deberta.py
@ -45,16 +45,16 @@ class DebertaConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"`, :obj:`"gelu"`, :obj:`"tanh"`, :obj:`"gelu_fast"`,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"`, :obj:`"gelu"`, :obj:`"tanh"`, :obj:`"gelu_fast"`,
            :obj:`"mish"`, :obj:`"linear"`, :obj:`"sigmoid"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.DebertaModel` or
            :class:`~transformers.TFDebertaModel`.
@ -65,15 +65,15 @@ class DebertaConfig(PretrainedConfig):
        relative_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether use relative position encoding.
        max_relative_positions (:obj:`int`, `optional`, defaults to 1):
-            The range of relative positions :obj:`[-max_position_embeddings, max_position_embeddings]`.
+            The range of relative positions :obj:`[-max_position_embeddings, max_position_embeddings]`. Use the same
-            Use the same value as :obj:`max_position_embeddings`.
+            value as :obj:`max_position_embeddings`.
        pad_token_id (:obj:`int`, `optional`, defaults to 0):
            The value used to pad input_ids.
        position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether add absolute position embedding to content embedding.
        pos_att_type (:obj:`List[str]`, `optional`):
-            The type of relative position attention, it can be a combination of :obj:`["p2c", "c2p", "p2p"]`,
+            The type of relative position attention, it can be a combination of :obj:`["p2c", "c2p", "p2p"]`, e.g.
-            e.g. :obj:`["p2c"]`, :obj:`["p2c", "c2p"]`, :obj:`["p2c", "c2p", 'p2p"]`.
+            :obj:`["p2c"]`, :obj:`["p2c", "c2p"]`, :obj:`["p2c", "c2p", 'p2p"]`.
        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
    """
--- a/src/transformers/configuration_distilbert.py
+++ b/src/transformers/configuration_distilbert.py
@ -36,21 +36,20 @@ class DistilBertConfig(PretrainedConfig):
    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
    :class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    configuration to that of the DistilBERT
+    configuration to that of the DistilBERT `distilbert-base-uncased
-    `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
+    <https://huggingface.co/distilbert-base-uncased>`__ architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
+            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by
-            :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
            :class:`~transformers.TFDistilBertModel`.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
            Whether to use sinusoidal positional embeddings.
        n_layers (:obj:`int`, `optional`, defaults to 6):
@ -66,8 +65,8 @@ class DistilBertConfig(PretrainedConfig):
        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
--- a/src/transformers/configuration_dpr.py
+++ b/src/transformers/configuration_dpr.py
@ -32,20 +32,19 @@ DPR_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class DPRConfig(PretrainedConfig):
    r"""
-    :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a
+    :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a `DPRModel`.
    `DPRModel`.
    This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
    :class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
    components of the DPR model.
-    This class is a subclass of :class:`~transformers.BertConfig`. Please check the
+    This class is a subclass of :class:`~transformers.BertConfig`. Please check the superclass for the documentation of
-    superclass for the documentation of all kwargs.
+    all kwargs.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the DPR model. Defines the different tokens that
+            Vocabulary size of the DPR model. Defines the different tokens that can be represented by the `inputs_ids`
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
+            passed to the forward method of :class:`~transformers.BertModel`.
        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
@ -55,15 +54,15 @@ class DPRConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
@ -73,8 +72,8 @@ class DPRConfig(PretrainedConfig):
        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
        projection_dim (:obj:`int`, `optional`, defaults to 0):
-            Dimension of the projection for the context and question encoders.
+            Dimension of the projection for the context and question encoders. If it is set to zero (default), then no
-            If it is set to zero (default), then no projection is done.
+            projection is done.
    """
    model_type = "dpr"
--- a/src/transformers/configuration_electra.py
+++ b/src/transformers/configuration_electra.py
@ -36,12 +36,11 @@ class ElectraConfig(PretrainedConfig):
    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
    :class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    configuration to that of the ELECTRA
+    configuration to that of the ELECTRA `google/electra-small-discriminator
-    `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
+    <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
@ -60,15 +59,15 @@ class ElectraConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
            :class:`~transformers.TFElectraModel`.
--- a/src/transformers/configuration_encoder_decoder.py
+++ b/src/transformers/configuration_encoder_decoder.py
@ -29,9 +29,8 @@ class EncoderDecoderConfig(PretrainedConfig):
    :class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
    specified arguments, defining the encoder and decoder configs.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        kwargs (`optional`):
@ -93,7 +92,8 @@ class EncoderDecoderConfig(PretrainedConfig):
        cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig, **kwargs
    ) -> PretrainedConfig:
        r"""
-        Instantiate a :class:`~transformers.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.
+        Instantiate a :class:`~transformers.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model
        configuration and decoder model configuration.
        Returns:
            :class:`EncoderDecoderConfig`: An instance of a configuration object
--- a/src/transformers/configuration_flaubert.py
+++ b/src/transformers/configuration_flaubert.py
@ -34,20 +34,19 @@ class FlaubertConfig(XLMConfig):
    :class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
    arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether to apply the layer normalization before or after the feed forward layer following the
+            Whether to apply the layer normalization before or after the feed forward layer following the attention in
-            attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)
+            each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)
        layerdrop (:obj:`float`, `optional`, defaults to 0.0):
-            Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand
+            Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand with
-            with Structured Dropout. ICLR 2020)
+            Structured Dropout. ICLR 2020)
        vocab_size (:obj:`int`, `optional`, defaults to 30145):
-            Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the
+            Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by
-            :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
            :class:`~transformers.TFFlaubertModel`.
        emb_dim (:obj:`int`, `optional`, defaults to 2048):
            Dimensionality of the encoder layers and the pooler layer.
@ -56,8 +55,7 @@ class FlaubertConfig(XLMConfig):
        n_head (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probability for all fully connected
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
            layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for the attention mechanism
        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
@ -65,28 +63,25 @@ class FlaubertConfig(XLMConfig):
        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not the model shoul behave in a causal manner.
+            Whether or not the model shoul behave in a causal manner. Causal models use a triangular attention mask in
-            Causal models use a triangular attention mask in order to only attend to the left-side context instead
+            order to only attend to the left-side context instead if a bidirectional context.
            if a bidirectional context.
        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
            layer.
        n_langs (:obj:`int`, `optional`, defaults to 1):
            The number of languages the model handles. Set to 1 for monolingual models.
        use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
-            Whether to use language embeddings. Some models use additional language embeddings, see
+            Whether to use language embeddings. Some models use additional language embeddings, see `the multilingual
-            `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
+            models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__ for
-            for information on how to use them.
+            information on how to use them.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            ever be used with. Typically set this to something large just in case
+            just in case (e.g., 512 or 1024 or 2048).
            (e.g., 512 or 1024 or 2048).
        embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
-            The standard deviation of the truncated_normal_initializer for
+            The standard deviation of the truncated_normal_initializer for initializing the embedding matrices.
            initializing the embedding matrices.
        init_std (:obj:`int`, `optional`, defaults to 50257):
-            The standard deviation of the truncated_normal_initializer for
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the
-            initializing all weight matrices except the embedding matrices.
+            embedding matrices.
        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        bos_index (:obj:`int`, `optional`, defaults to 0):
@ -134,8 +129,7 @@ class FlaubertConfig(XLMConfig):
        mask_token_id (:obj:`int`, `optional`, defaults to 0):
            Model agnostic parameter to identify masked tokens when generating text in an MLM context.
        lang_id (:obj:`int`, `optional`, defaults to 1):
-            The ID of the language used by the model. This parameter is used when generating
+            The ID of the language used by the model. This parameter is used when generating text in a given language.
            text in a given language.
    """
    model_type = "flaubert"
--- a/src/transformers/configuration_fsmt.py
+++ b/src/transformers/configuration_fsmt.py
@ -28,8 +28,7 @@ FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
 class DecoderConfig(PretrainedConfig):
    r"""
-    Configuration class for FSMT's decoder specific things.
+    Configuration class for FSMT's decoder specific things. note: this is a private helper class
    note: this is a private helper class
    """
    model_type = "fsmt_decoder"
@ -44,9 +43,8 @@ class FSMTConfig(PretrainedConfig):
    This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
    instantiate a FSMT model according to the specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        langs (:obj:`List[str]`):
@ -72,8 +70,8 @@ class FSMTConfig(PretrainedConfig):
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
        activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@ -81,8 +79,8 @@ class FSMTConfig(PretrainedConfig):
        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
@ -104,14 +102,13 @@ class FSMTConfig(PretrainedConfig):
        tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to tie input and output embeddings.
        num_beams (:obj:`int`, `optional`, defaults to 5)
-            Number of beams for beam search that will be used by default in the :obj:`generate` method
+            Number of beams for beam search that will be used by default in the :obj:`generate` method of the model. 1
-            of the model. 1 means no beam search.
+            means no beam search.
        length_penalty (:obj:`float`, `optional`, defaults to 1)
-            Exponential penalty to the length that will be used by default in the :obj:`generate` method
+            Exponential penalty to the length that will be used by default in the :obj:`generate` method of the model.
            of the model.
        early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
-            Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop
+            Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam
-            the beam search when at least ``num_beams`` sentences are finished per batch or not.
+            search when at least ``num_beams`` sentences are finished per batch or not.
        Examples::
--- a/src/transformers/configuration_funnel.py
+++ b/src/transformers/configuration_funnel.py
@ -42,9 +42,8 @@ class FunnelConfig(PretrainedConfig):
    configuration to that of the Funnel Transformer `funnel-transformer/small
    <https://huggingface.co/funnel-transformer/small>`__ architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
@ -66,8 +65,8 @@ class FunnelConfig(PretrainedConfig):
        d_inner (:obj:`int`, `optional`, defaults to 3072):
            Inner dimension in the feed-forward blocks.
        hidden_act (:obj:`str` or :obj:`callable`, `optional`, defaults to :obj:`"gelu_new"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
@ -75,8 +74,8 @@ class FunnelConfig(PretrainedConfig):
        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout probability used between the two layers of the feed-forward blocks.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 3):
            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
            :class:`~transformers.TFFunnelModel`.
@ -90,19 +89,17 @@ class FunnelConfig(PretrainedConfig):
        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-9):
            The epsilon used by the layer normalization layers.
        pooling_type (:obj:`str`, `optional`, defaults to :obj:`"mean"`):
-            Possible values are ``"mean"`` or ``"max"``. The way pooling is performed at the beginning of each
+            Possible values are ``"mean"`` or ``"max"``. The way pooling is performed at the beginning of each block.
            block.
        attention_type (:obj:`str`, `optional`, defaults to :obj:`"relative_shift"`):
-            Possible values are ``"relative_shift"`` or ``"factorized"``. The former is faster on CPU/GPU while
+            Possible values are ``"relative_shift"`` or ``"factorized"``. The former is faster on CPU/GPU while the
-            the latter is faster on TPU.
+            latter is faster on TPU.
        separate_cls (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to separate the cls token when applying pooling.
        truncate_seq (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            When using ``separate_cls``, whether or not to truncate the last token when pooling, to avoid getting
+            When using ``separate_cls``, whether or not to truncate the last token when pooling, to avoid getting a
-            a sequence length that is not a multiple of 2.
+            sequence length that is not a multiple of 2.
        pool_q_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not to apply the pooling only to the query or to query, key and values for the attention
+            Whether or not to apply the pooling only to the query or to query, key and values for the attention layers.
            layers.
    """
    model_type = "funnel"
--- a/src/transformers/configuration_gpt2.py
+++ b/src/transformers/configuration_gpt2.py
@ -33,13 +33,12 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class GPT2Config(PretrainedConfig):
    """
    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
-    :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified
+    :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified arguments,
-    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
-    configuration to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
+    to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
@ -48,8 +47,8 @@ class GPT2Config(PretrainedConfig):
            :obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
            :class:`~transformers.TFGPT2Model`.
        n_positions (:obj:`int`, `optional`, defaults to 1024):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        n_ctx (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the causal mask (usually same as n_positions).
        n_embd (:obj:`int`, `optional`, defaults to 768):
@ -73,8 +72,8 @@ class GPT2Config(PretrainedConfig):
        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
-            Argument used when doing sequence summary, used in the models
+            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
            Has to be one of the following options:
@ -84,8 +83,8 @@ class GPT2Config(PretrainedConfig):
                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
                - :obj:`"attn"`: Not implemented now, use multi-head attention.
        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary, used in the models
+            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
            Whether or not to add a projection after the vector extraction.
        summary_activation (:obj:`str`, `optional`):
@ -94,13 +93,13 @@ class GPT2Config(PretrainedConfig):
            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary, used in the models
+            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
-            Argument used when doing sequence summary, used in the models
+            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
            The dropout ratio to be used after the projection and activation.
        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
--- a/src/transformers/configuration_layoutlm.py
+++ b/src/transformers/configuration_layoutlm.py
@ -29,20 +29,19 @@ LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class LayoutLMConfig(BertConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.LayoutLMModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.LayoutLMModel`. It is used to
-    It is used to instantiate a LayoutLM model according to the specified arguments, defining the model
+    instantiate a LayoutLM model according to the specified arguments, defining the model architecture. Instantiating a
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    configuration with the defaults will yield a similar configuration to that of the LayoutLM `layoutlm-base-uncased
-    the LayoutLM `layoutlm-base-uncased <https://huggingface.co/microsoft/layoutlm-base-uncased>`__ architecture.
+    <https://huggingface.co/microsoft/layoutlm-base-uncased>`__ architecture.
-    Configuration objects inherit from :class:`~transformers.BertConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.BertConfig` and can be used to control the model outputs.
-    to control the model outputs. Read the documentation from :class:`~transformers.BertConfig`
+    Read the documentation from :class:`~transformers.BertConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the LayoutLM model. Defines the different tokens that
+            Vocabulary size of the LayoutLM model. Defines the different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.LayoutLMModel`.
+            `inputs_ids` passed to the forward method of :class:`~transformers.LayoutLMModel`.
        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
@ -52,15 +51,15 @@ class LayoutLMConfig(BertConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the :obj:`token_type_ids` passed into :class:`~transformers.LayoutLMModel`.
        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
@ -70,8 +69,8 @@ class LayoutLMConfig(BertConfig):
        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
        max_2d_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
-            The maximum value that the 2D position embedding might ever used.
+            The maximum value that the 2D position embedding might ever used. Typically set this to something large
-            Typically set this to something large just in case (e.g., 1024).
+            just in case (e.g., 1024).
    Examples::
--- a/src/transformers/configuration_longformer.py
+++ b/src/transformers/configuration_longformer.py
@ -37,19 +37,19 @@ class LongformerConfig(RobertaConfig):
    :class:`~transformers.TFLongformerModel`. It is used to instantiate a Longformer model according to the specified
    arguments, defining the model architecture.
-    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`. It is used
-    It is used to instantiate an Longformer model according to the specified arguments, defining the model
+    to instantiate an Longformer model according to the specified arguments, defining the model architecture.
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the RoBERTa
-    the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.
+    `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.
-    The :class:`~transformers.LongformerConfig` class directly inherits :class:`~transformers.RobertaConfig`.
+    The :class:`~transformers.LongformerConfig` class directly inherits :class:`~transformers.RobertaConfig`. It reuses
-    It reuses the same defaults. Please check the parent class for more information.
+    the same defaults. Please check the parent class for more information.
    Args:
        attention_window (:obj:`int` or :obj:`List[int]`, `optional`, defaults to 512):
-            Size of an attention window around each token. If an :obj:`int`, use the same size for all layers.
+            Size of an attention window around each token. If an :obj:`int`, use the same size for all layers. To
-            To specify a different window size for each layer, use a :obj:`List[int]` where
+            specify a different window size for each layer, use a :obj:`List[int]` where ``len(attention_window) ==
-            ``len(attention_window) == num_hidden_layers``.
+            num_hidden_layers``.
    Example::
--- a/src/transformers/configuration_lxmert.py
+++ b/src/transformers/configuration_lxmert.py
@ -32,9 +32,8 @@ class LxmertConfig(PretrainedConfig):
    :class:`~transformers.TFLxmertModel`. It is used to instantiate a LXMERT model according to the specified
    arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
@ -55,15 +54,15 @@ class LxmertConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
@ -71,15 +70,14 @@ class LxmertConfig(PretrainedConfig):
        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        visual_feat_dim (:obj:`int`, `optional`, defaults to 2048):
-            This represents the last dimension of the pooled-object features used as input for the model,
+            This represents the last dimension of the pooled-object features used as input for the model, representing
-            representing the size of each object feature itself.
+            the size of each object feature itself.
        visual_pos_dim (:obj:`int`, `optional`, defaults to 4):
-            This represents the number of spacial features that are mixed into the visual features.
+            This represents the number of spacial features that are mixed into the visual features. The default is set
-            The default is set to 4 because most commonly this will represent the location of a bounding box.
+            to 4 because most commonly this will represent the location of a bounding box. i.e., (x, y, width, height)
            i.e., (x, y, width, height)
        visual_loss_normalizer (:obj:`float`, `optional`, defaults to 1/15):
-            This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
+            This represents the scaling factor in which each visual loss is multiplied by if during pretraining, one
-            one decided to train with multiple vision-based loss objectives.
+            decided to train with multiple vision-based loss objectives.
        num_qa_labels (:obj:`int`, `optional`, defaults to 9500):
            This represents the total number of different question answering (QA) labels there are. If using more than
            one dataset with QA, the user will need to account for the total number of labels that all of the datasets
@ -91,8 +89,8 @@ class LxmertConfig(PretrainedConfig):
            This represents the total number of semantically unique attributes that lxmert will be able to classify a
            pooled-object feature as possessing.
        task_matched (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            This task is used for sentence-image matching. If the sentence correctly describes the image the label
+            This task is used for sentence-image matching. If the sentence correctly describes the image the label will
-            will be 1. If the sentence does not correctly describe the image, the label will be 0.
+            be 1. If the sentence does not correctly describe the image, the label will be 0.
        task_mask_lm (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to add masked language modeling (as used in pretraining models such as BERT) to the loss
            objective.
@ -108,8 +106,8 @@ class LxmertConfig(PretrainedConfig):
        visual_feat_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to calculate the feature-regression loss objective
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not the model should return the attentions from the vision, langauge, and cross-modality
+            Whether or not the model should return the attentions from the vision, langauge, and cross-modality layers
-            layers should be returned.
+            should be returned.
        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not the model should return the hidden states from the vision, langauge, and cross-modality
            layers should be returned.
--- a/src/transformers/configuration_marian.py
+++ b/src/transformers/configuration_marian.py
@ -27,9 +27,8 @@ class MarianConfig(BartConfig):
    This is the configuration class to store the configuration of a :class:`~transformers.MarianMTModel`. It is used to
    instantiate a Marian model according to the specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 58101):
@ -50,8 +49,8 @@ class MarianConfig(BartConfig):
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@ -61,8 +60,8 @@ class MarianConfig(BartConfig):
        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for classifier.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
@ -84,11 +83,11 @@ class MarianConfig(BartConfig):
        bos_token_id (:obj:`int`, `optional`, defaults to 0)
            Beginning of stream token id.
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            The LayerDrop probability for the encoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            The LayerDrop probability for the decoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
            How many extra learned positional embeddings to use.
        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
--- a/src/transformers/configuration_mbart.py
+++ b/src/transformers/configuration_mbart.py
@ -29,12 +29,11 @@ MBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class MBartConfig(BartConfig):
    """
    This is the configuration class to store the configuration of a
-    :class:`~transformers.MBartForConditionalGeneration`. It is used to
+    :class:`~transformers.MBartForConditionalGeneration`. It is used to instantiate a BART model according to the
-    instantiate a BART model according to the specified arguments, defining the model architecture.
+    specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 250027):
@ -55,8 +54,8 @@ class MBartConfig(BartConfig):
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@ -66,8 +65,8 @@ class MBartConfig(BartConfig):
        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for classifier.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
@ -89,11 +88,11 @@ class MBartConfig(BartConfig):
        bos_token_id (:obj:`int`, `optional`, defaults to 0)
            Beginning of stream token id.
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            The LayerDrop probability for the encoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            The LayerDrop probability for the decoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
            How many extra learned positional embeddings to use. Should be equal to :obj:`pad_token_id+1`.
        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
--- a/src/transformers/configuration_mobilebert.py
+++ b/src/transformers/configuration_mobilebert.py
@ -29,9 +29,8 @@ class MobileBertConfig(PretrainedConfig):
    :class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
    arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
@ -48,15 +47,15 @@ class MobileBertConfig(PretrainedConfig):
        intermediate_size (:obj:`int`, `optional`, defaults to 512):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
            or :class:`~transformers.TFMobileBertModel`.
@ -88,18 +87,14 @@ class MobileBertConfig(PretrainedConfig):
        >>> from transformers import MobileBertModel, MobileBertConfig
-        >>> # Initializing a MobileBERT configuration
+        >>> # Initializing a MobileBERT configuration >>> configuration = MobileBertConfig()
        >>> configuration = MobileBertConfig()
-        >>> # Initializing a model from the configuration above
+        >>> # Initializing a model from the configuration above >>> model = MobileBertModel(configuration)
        >>> model = MobileBertModel(configuration)
-        >>> # Accessing the model configuration
+        >>> # Accessing the model configuration >>> configuration = model.config
        >>> configuration = model.config
-    Attributes:
+    Attributes: pretrained_config_archive_map (Dict[str, str]): A dictionary containing all the available pre-trained
-        pretrained_config_archive_map (Dict[str, str]):
+    checkpoints.
            A dictionary containing all the available pre-trained checkpoints.
    """
    pretrained_config_archive_map = MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
    model_type = "mobilebert"
--- a/src/transformers/configuration_openai.py
+++ b/src/transformers/configuration_openai.py
@ -33,9 +33,8 @@ class OpenAIGPTConfig(PretrainedConfig):
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
    configuration to that of the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 40478):
@ -43,8 +42,8 @@ class OpenAIGPTConfig(PretrainedConfig):
            :obj:`inputs_ids` passed when calling :class:`~transformers.OpenAIGPTModel` or
            :class:`~transformers.TFOpenAIGPTModel`.
        n_positions (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        n_ctx (:obj:`int`, `optional`, defaults to 512):
            Dimensionality of the causal mask (usually same as n_positions).
        n_embd (:obj:`int`, `optional`, defaults to 768):
@ -54,8 +53,8 @@ class OpenAIGPTConfig(PretrainedConfig):
        n_head (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        afn (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
--- a/src/transformers/configuration_pegasus.py
+++ b/src/transformers/configuration_pegasus.py
@ -68,12 +68,11 @@ task_specific_params = {
 class PegasusConfig(BartConfig):
    """
    This is the configuration class to store the configuration of a
-    :class:`~transformers.PegasusForConditionalGeneration`. It is used to
+    :class:`~transformers.PegasusForConditionalGeneration`. It is used to instantiate a Pegasus model according to the
-    instantiate a Pegasus model according to the specified arguments, defining the model architecture.
+    specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 96103):
@ -94,8 +93,8 @@ class PegasusConfig(BartConfig):
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@ -105,8 +104,8 @@ class PegasusConfig(BartConfig):
        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for classifier.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
@ -128,11 +127,11 @@ class PegasusConfig(BartConfig):
        bos_token_id (:obj:`int`, `optional`, defaults to 0)
            Beginning of stream token id.
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            The LayerDrop probability for the encoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
-            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            The LayerDrop probability for the decoder. See the `LayerDrop paper <see
-            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+            https://arxiv.org/abs/1909.11556>`__ for more details.
        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
            How many extra learned positional embeddings to use. Should be pad_token_id+1 for bart.
        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
--- a/src/transformers/configuration_prophetnet.py
+++ b/src/transformers/configuration_prophetnet.py
@ -28,22 +28,21 @@ PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class ProphetNetConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ProphetNetModel`. It is used to
+    This is the configuration class to store the configuration of a :class:`~transformers.ProphetNetModel`. It is used
-    instantiate a ProphetNet model according to the specified arguments, defining the model architecture.
+    to instantiate a ProphetNet model according to the specified arguments, defining the model architecture.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        activation_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for activations inside the fully connected layer.
        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler.
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the ProphetNET model. Defines the number of different tokens that can be represented by the
+            Vocabulary size of the ProphetNET model. Defines the number of different tokens that can be represented by
-            :obj:`inputs_ids` passed when calling :class:`~transformers.ProphetNetModel`.
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.ProphetNetModel`.
        hidden_size (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the layers and the pooler layer.
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
@ -63,8 +62,8 @@ class ProphetNetConfig(PretrainedConfig):
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
-            The maximum sequence length that this model might ever be used with.
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            just in case (e.g., 512 or 1024 or 2048).
        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        add_cross_attention (:obj:`bool`, `optional`, defaults to :obj:`True`):
@ -78,21 +77,19 @@ class ProphetNetConfig(PretrainedConfig):
        eos_token_id (:obj:`int`, `optional`, defaults to 2)
            End of stream token id.
        ngram (:obj:`int`, `optional`, defaults to 2)
-            Number of future tokens to predict.
+            Number of future tokens to predict. Set to 1 to be same as traditional Language model to predict next first
-            Set to 1 to be same as traditional Language model to predict next first token.
+            token.
        num_buckets (:obj:`int`, `optional`, defaults to 32)
-            The number of buckets to use for each attention layer.
+            The number of buckets to use for each attention layer. This is for relative position calculation. See the
-            This is for relative position calculation. See the `T5 paper
+            `T5 paper <see https://arxiv.org/abs/1910.10683>`__ for more details.
            <see https://arxiv.org/abs/1910.10683>`__ for more details.
        relative_max_distance (:obj:`int`, `optional`, defaults to 128)
-            Relative distances greater than this number will be put into the last same bucket.
+            Relative distances greater than this number will be put into the last same bucket. This is for relative
-            This is for relative position calculation. See the `T5 paper
+            position calculation. See the `T5 paper <see https://arxiv.org/abs/1910.10683>`__ for more details.
            <see https://arxiv.org/abs/1910.10683>`__ for more details.
        disable_ngram_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether be trained predicting only the next first token.
        eps (:obj:`float`, `optional`, defaults to 0.0):
-            Controls the ``epsilon`` parameter value for label
+            Controls the ``epsilon`` parameter value for label smoothing in the loss calculation. If set to 0, no label
-            smoothing in the loss calculation. If set to 0, no label smoothing is performed.
+            smoothing is performed.
    """
    model_type = "prophetnet"
--- a/src/transformers/configuration_rag.py
+++ b/src/transformers/configuration_rag.py
@ -21,16 +21,17 @@ from .file_utils import add_start_docstrings
 RAG_CONFIG_DOC = r"""
-    :class:`~transformers.RagConfig` stores the configuration of a `RagModel`.
+    :class:`~transformers.RagConfig` stores the configuration of a `RagModel`. Configuration objects inherit from
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    :class:`~transformers.PretrainedConfig` and can be used to control the model outputs. Read the documentation from
-    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    :class:`~transformers.PretrainedConfig` for more information.
    for more information.
    Args:
        title_sep (:obj:`str`, `optional`, defaults to  ``" / "``):
-            Separator inserted between the title and the text of the retrieved document when calling :class:`~transformers.RagRetriever`.
+            Separator inserted between the title and the text of the retrieved document when calling
            :class:`~transformers.RagRetriever`.
        doc_sep (:obj:`str`, `optional`, defaults to  ``" // "``):
-            Separator inserted between the the text of the retrieved document and the original input when calliang :class:`~transformers.RagRetriever`.
+            Separator inserted between the the text of the retrieved document and the original input when calliang
            :class:`~transformers.RagRetriever`.
        n_docs (:obj:`int`, `optional`, defaults to 5):
            Number of documents to retrieve.
        max_combined_length (:obj:`int`, `optional`, defaults to 300):
@ -41,8 +42,8 @@ RAG_CONFIG_DOC = r"""
            Retrieval batch size, defined as the number of queries issues concurrently to the faiss index excapsulated
            :class:`~transformers.RagRetriever`.
        dataset (:obj:`str`, `optional`, defaults to :obj:`"wiki_dpr"`):
-            A dataset identifier of the indexed dataset in HuggingFace Datasets (list all available datasets and
+            A dataset identifier of the indexed dataset in HuggingFace Datasets (list all available datasets and ids
-            ids using :obj:`datasets.list_datasets()`).
+            using :obj:`datasets.list_datasets()`).
        dataset_split (:obj:`str`, `optional`, defaults to :obj:`"train"`)
            Which split of the :obj:`dataset` to load.
        index_name (:obj:`str`, `optional`, defaults to :obj:`"compressed"`)
@ -59,13 +60,13 @@ RAG_CONFIG_DOC = r"""
            Only relevant if ``return_loss`` is set to :obj:`True`. Controls the ``epsilon`` parameter value for label
            smoothing in the loss calculation. If set to 0, no label smoothing is performed.
        do_marginalize (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            If :obj:`True`, the logits are marginalized over all documents
+            If :obj:`True`, the logits are marginalized over all documents by making use of
-            by making use of ``torch.nn.functional.log_softmax``.
+            ``torch.nn.functional.log_softmax``.
        reduce_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to reduce the NLL loss using the ``torch.Tensor.sum`` operation.
        do_deduplication (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether or not to deduplicate the generations from different context documents for a given input.
+            Whether or not to deduplicate the generations from different context documents for a given input. Has to be
-            Has to be set to :obj:`False` if used while training with distributed backend.
+            set to :obj:`False` if used while training with distributed backend.
        exclude_bos_score (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to disregard the BOS token when computing the loss.
        output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`):
@ -160,7 +161,8 @@ class RagConfig(PretrainedConfig):
        cls, question_encoder_config: PretrainedConfig, generator_config: PretrainedConfig, **kwargs
    ) -> PretrainedConfig:
        r"""
-        Instantiate a :class:`~transformers.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.
+        Instantiate a :class:`~transformers.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model
        configuration and decoder model configuration.
        Returns:
            :class:`EncoderDecoderConfig`: An instance of a configuration object
@ -169,7 +171,8 @@ class RagConfig(PretrainedConfig):
    def to_dict(self):
        """
-        Serializes this instance to a Python dictionary. Override the default :meth:`~transformers.PretrainedConfig.to_dict`.
+        Serializes this instance to a Python dictionary. Override the default
        :meth:`~transformers.PretrainedConfig.to_dict`.
        Returns:
            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
--- a/Show More
+++ b/Show More
`@ -85,4 +85,4 @@ TensorFlow Helper Functions`

	`.. autofunction:: transformers.modeling_tf_utils.keras_serializable`	`.. autofunction:: transformers.modeling_tf_utils.keras_serializable`

	`.. autofunction:: transformers.modeling_tf_utils.shape_list`	`.. autofunction:: transformers.modeling_tf_utils.shape_list`