Convert tutorials (#14665)

* Convert a few docs * And another * Last tutorials * New syntax for colab links * Convert a few docs * And another * Last tutorials * New syntax for colab links
2025-07-31 02:02:21 +06:00 · 2021-12-08 13:19:46 -05:00 · 2021-12-08 13:19:46 -05:00 · cf36f4d7a8
commit cf36f4d7a8
parent 0f4e39c559
18 changed files with 3608 additions and 3836 deletions
--- a/docs/source/benchmarks.mdx
+++ b/docs/source/benchmarks.mdx
@ -0,0 +1,347 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Benchmarks
+
+[[open-in-colab]]
+
+Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
+
+A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found [here](https://github.com/huggingface/transformers/tree/master/notebooks/05-benchmark.ipynb).
+
+## How to benchmark 🤗 Transformer models
+
+The classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] allow to flexibly benchmark 🤗 Transformer models. The benchmark classes allow us to measure the _peak memory usage_ and _required time_ for both _inference_ and _training_.
+
+<Tip>
+
+Hereby, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
+backward pass.
+
+</Tip>
+
+The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an object of type [`PyTorchBenchmarkArguments`] and
+[`TensorFlowBenchmarkArguments`], respectively, for instantiation. [`PyTorchBenchmarkArguments`] and [`TensorFlowBenchmarkArguments`] are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type _bert-base-cased_ can be benchmarked.
+
+```py
+>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
+
+>>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> benchmark = PyTorchBenchmark(args)
+
+===PT-TF-SPLIT===
+>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
+
+>>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> benchmark = TensorFlowBenchmark(args)
+```
+
+Here, three arguments are given to the benchmark argument data classes, namely `models`, `batch_sizes`, and
+`sequence_lengths`. The argument `models` is required and expects a `list` of model identifiers from the
+[model hub](https://huggingface.co/models) The `list` arguments `batch_sizes` and `sequence_lengths` define
+the size of the `input_ids` on which the model is benchmarked. There are many more parameters that can be configured
+via the benchmark argument data classes. For more detail on these one can either directly consult the files
+`src/transformers/benchmark/benchmark_args_utils.py`, `src/transformers/benchmark/benchmark_args.py` (for PyTorch)
+and `src/transformers/benchmark/benchmark_args_tf.py` (for Tensorflow). Alternatively, running the following shell
+commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
+respectively.
+
+```bash
+python examples/pytorch/benchmarking/run_benchmark.py --help
+
+===PT-TF-SPLIT===
+python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
+```
+
+An instantiated benchmark object can then simply be run by calling `benchmark.run()`.
+
+```py
+>>> results = benchmark.run()
+>>> print(results)
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length     Time in s                  
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             0.006     
+bert-base-uncased          8               32            0.006     
+bert-base-uncased          8              128            0.018     
+bert-base-uncased          8              512            0.088     
+--------------------------------------------------------------------------------
+
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length    Memory in MB 
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             1227
+bert-base-uncased          8               32            1281
+bert-base-uncased          8              128            1307
+bert-base-uncased          8              512            1539
+--------------------------------------------------------------------------------
+
+====================        ENVIRONMENT INFORMATION         ====================
+
+- transformers_version: 2.11.0
+- framework: PyTorch
+- use_torchscript: False
+- framework_version: 1.4.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 08:58:43.371351
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+
+===PT-TF-SPLIT===
+>>> results = benchmark.run()
+>>> print(results)
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length     Time in s                  
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             0.005
+bert-base-uncased          8               32            0.008
+bert-base-uncased          8              128            0.022
+bert-base-uncased          8              512            0.105
+--------------------------------------------------------------------------------
+
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length    Memory in MB 
+--------------------------------------------------------------------------------
+bert-base-uncased          8               8             1330
+bert-base-uncased          8               32            1330
+bert-base-uncased          8              128            1330
+bert-base-uncased          8              512            1770
+--------------------------------------------------------------------------------
+
+====================        ENVIRONMENT INFORMATION         ====================
+
+- transformers_version: 2.11.0
+- framework: Tensorflow
+- use_xla: False
+- framework_version: 2.2.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 09:26:35.617317
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+```
+
+By default, the _time_ and the _required memory_ for _inference_ are benchmarked. In the example output above the first
+two sections show the result corresponding to _inference time_ and _inference memory_. In addition, all relevant
+information about the computing environment, _e.g._ the GPU type, the system, the library versions, etc... are printed
+out in the third section under _ENVIRONMENT INFORMATION_. This information can optionally be saved in a _.csv_ file
+when adding the argument `save_to_csv=True` to [`PyTorchBenchmarkArguments`] and
+[`TensorFlowBenchmarkArguments`] respectively. In this case, every section is saved in a separate
+_.csv_ file. The path to each _.csv_ file can optionally be defined via the argument data classes.
+
+Instead of benchmarking pre-trained models via their model identifier, _e.g._ `bert-base-uncased`, the user can
+alternatively benchmark an arbitrary configuration of any available model class. In this case, a `list` of
+configurations must be inserted with the benchmark args as follows.
+
+```py
+>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
+
+>>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> config_base = BertConfig()
+>>> config_384_hid = BertConfig(hidden_size=384)
+>>> config_6_lay = BertConfig(num_hidden_layers=6)
+
+>>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
+>>> benchmark.run()
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length       Time in s                  
+--------------------------------------------------------------------------------
+bert-base                  8              128            0.006
+bert-base                  8              512            0.006
+bert-base                  8              128            0.018     
+bert-base                  8              512            0.088     
+bert-384-hid              8               8             0.006     
+bert-384-hid              8               32            0.006     
+bert-384-hid              8              128            0.011     
+bert-384-hid              8              512            0.054     
+bert-6-lay                 8               8             0.003     
+bert-6-lay                 8               32            0.004     
+bert-6-lay                 8              128            0.009     
+bert-6-lay                 8              512            0.044
+--------------------------------------------------------------------------------
+
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length      Memory in MB 
+--------------------------------------------------------------------------------
+bert-base                  8               8             1277
+bert-base                  8               32            1281
+bert-base                  8              128            1307     
+bert-base                  8              512            1539     
+bert-384-hid              8               8             1005     
+bert-384-hid              8               32            1027     
+bert-384-hid              8              128            1035     
+bert-384-hid              8              512            1255     
+bert-6-lay                 8               8             1097     
+bert-6-lay                 8               32            1101     
+bert-6-lay                 8              128            1127     
+bert-6-lay                 8              512            1359
+--------------------------------------------------------------------------------
+
+====================        ENVIRONMENT INFORMATION         ====================
+
+- transformers_version: 2.11.0
+- framework: PyTorch
+- use_torchscript: False
+- framework_version: 1.4.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 09:35:25.143267
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+
+===PT-TF-SPLIT===
+>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
+
+>>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> config_base = BertConfig()
+>>> config_384_hid = BertConfig(hidden_size=384)
+>>> config_6_lay = BertConfig(num_hidden_layers=6)
+
+>>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
+>>> benchmark.run()
+====================       INFERENCE - SPEED - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length       Time in s                  
+--------------------------------------------------------------------------------
+bert-base                  8               8             0.005
+bert-base                  8               32            0.008
+bert-base                  8              128            0.022
+bert-base                  8              512            0.106
+bert-384-hid              8               8             0.005
+bert-384-hid              8               32            0.007
+bert-384-hid              8              128            0.018
+bert-384-hid              8              512            0.064
+bert-6-lay                 8               8             0.002
+bert-6-lay                 8               32            0.003
+bert-6-lay                 8              128            0.0011
+bert-6-lay                 8              512            0.074
+--------------------------------------------------------------------------------
+
+====================      INFERENCE - MEMORY - RESULT       ====================
+--------------------------------------------------------------------------------
+Model Name             Batch Size     Seq Length      Memory in MB 
+--------------------------------------------------------------------------------
+bert-base                  8               8             1330
+bert-base                  8               32            1330
+bert-base                  8              128            1330
+bert-base                  8              512            1770
+bert-384-hid              8               8             1330
+bert-384-hid              8               32            1330
+bert-384-hid              8              128            1330
+bert-384-hid              8              512            1540
+bert-6-lay                 8               8             1330
+bert-6-lay                 8               32            1330
+bert-6-lay                 8              128            1330
+bert-6-lay                 8              512            1540
+--------------------------------------------------------------------------------
+
+====================        ENVIRONMENT INFORMATION         ====================
+
+- transformers_version: 2.11.0
+- framework: Tensorflow
+- use_xla: False
+- framework_version: 2.2.0
+- python_version: 3.6.10
+- system: Linux
+- cpu: x86_64
+- architecture: 64bit
+- date: 2020-06-29
+- time: 09:38:15.487125
+- fp16: False
+- use_multiprocessing: True
+- only_pretrain_model: False
+- cpu_ram_mb: 32088
+- use_gpu: True
+- num_gpus: 1
+- gpu: TITAN RTX
+- gpu_ram_mb: 24217
+- gpu_power_watts: 280.0
+- gpu_performance_state: 2
+- use_tpu: False
+```
+
+Again, _inference time_ and _required memory_ for _inference_ are measured, but this time for customized configurations
+of the `BertModel` class. This feature can especially be helpful when deciding for which configuration the model
+should be trained.
+
+
+## Benchmark best practices
+
+This section lists a couple of best practices one should be aware of when benchmarking a model.
+
+- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
+  specifies on which device the code should be run by setting the `CUDA_VISIBLE_DEVICES` environment variable in the
+  shell, _e.g._ `export CUDA_VISIBLE_DEVICES=0` before running the code.
+- The option `no_multi_processing` should only be set to `True` for testing and debugging. To ensure accurate
+  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
+  `no_multi_processing` is set to `True`.
+- One should always state the environment information when sharing the results of a model benchmark. Results can vary
+  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
+  useful for the community.
+
+
+## Sharing your benchmark
+
+Previously all available core models (10 at the time) have been benchmarked for _inference time_, across many different
+settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
+done across CPUs (except for TensorFlow XLA) and GPUs.
+
+The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2) and the results are
+available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing).
+
+With the new _benchmark_ tools, it is easier than ever to share your benchmark results with the community
+
+- [PyTorch Benchmarking Results](https://github.com/huggingface/transformers/tree/master/examples/pytorch/benchmarking/README.md).
+- [TensorFlow Benchmarking Results](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/benchmarking/README.md).
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
@ -1,363 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Benchmarks
-=======================================================================================================================
-
-Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
-
-A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found :prefix_link:`here
-<notebooks/05-benchmark.ipynb>`.
-
-How to benchmark 🤗 Transformer models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
-benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
-for both `inference` and `training`.
-
-.. note::
-
-  Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
-  backward pass.
-
-The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
-object of type :class:`~transformers.PyTorchBenchmarkArguments` and
-:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
-:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
-classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
-is shown how a BERT model of type `bert-base-cased` can be benchmarked.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
-
-    >>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> benchmark = PyTorchBenchmark(args)
-
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
-
-    >>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> benchmark = TensorFlowBenchmark(args)
-
-
-Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
-``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
-`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
-the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
-via the benchmark argument data classes. For more detail on these one can either directly consult the files
-``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
-and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
-commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
-respectively.
-
-.. code-block:: bash
-
-    ## PYTORCH CODE
-    python examples/pytorch/benchmarking/run_benchmark.py --help
-
-    ## TENSORFLOW CODE
-    python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
-
-
-An instantiated benchmark object can then simply be run by calling ``benchmark.run()``.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> results = benchmark.run()
-    >>> print(results)
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length     Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             0.006     
-    bert-base-uncased          8               32            0.006     
-    bert-base-uncased          8              128            0.018     
-    bert-base-uncased          8              512            0.088     
-    --------------------------------------------------------------------------------
-
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length    Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             1227
-    bert-base-uncased          8               32            1281
-    bert-base-uncased          8              128            1307
-    bert-base-uncased          8              512            1539
-    --------------------------------------------------------------------------------
-
-    ====================        ENVIRONMENT INFORMATION         ====================
-
-    - transformers_version: 2.11.0
-    - framework: PyTorch
-    - use_torchscript: False
-    - framework_version: 1.4.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 08:58:43.371351
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-
-    >>> ## TENSORFLOW CODE
-    >>> results = benchmark.run()
-    >>> print(results)
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length     Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             0.005
-    bert-base-uncased          8               32            0.008
-    bert-base-uncased          8              128            0.022
-    bert-base-uncased          8              512            0.105
-    --------------------------------------------------------------------------------
-
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length    Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base-uncased          8               8             1330
-    bert-base-uncased          8               32            1330
-    bert-base-uncased          8              128            1330
-    bert-base-uncased          8              512            1770
-    --------------------------------------------------------------------------------
-
-    ====================        ENVIRONMENT INFORMATION         ====================
-
-    - transformers_version: 2.11.0
-    - framework: Tensorflow
-    - use_xla: False
-    - framework_version: 2.2.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 09:26:35.617317
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-
-By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
-two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
-information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
-out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
-when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
-:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
-`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
-
-Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
-alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
-configurations must be inserted with the benchmark args as follows.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
-
-    >>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> config_base = BertConfig()
-    >>> config_384_hid = BertConfig(hidden_size=384)
-    >>> config_6_lay = BertConfig(num_hidden_layers=6)
-
-    >>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
-    >>> benchmark.run()
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length       Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base                  8              128            0.006
-    bert-base                  8              512            0.006
-    bert-base                  8              128            0.018     
-    bert-base                  8              512            0.088     
-    bert-384-hid              8               8             0.006     
-    bert-384-hid              8               32            0.006     
-    bert-384-hid              8              128            0.011     
-    bert-384-hid              8              512            0.054     
-    bert-6-lay                 8               8             0.003     
-    bert-6-lay                 8               32            0.004     
-    bert-6-lay                 8              128            0.009     
-    bert-6-lay                 8              512            0.044
-    --------------------------------------------------------------------------------
-
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length      Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base                  8               8             1277
-    bert-base                  8               32            1281
-    bert-base                  8              128            1307     
-    bert-base                  8              512            1539     
-    bert-384-hid              8               8             1005     
-    bert-384-hid              8               32            1027     
-    bert-384-hid              8              128            1035     
-    bert-384-hid              8              512            1255     
-    bert-6-lay                 8               8             1097     
-    bert-6-lay                 8               32            1101     
-    bert-6-lay                 8              128            1127     
-    bert-6-lay                 8              512            1359
-    --------------------------------------------------------------------------------
-
-    ====================        ENVIRONMENT INFORMATION         ====================
-
-    - transformers_version: 2.11.0
-    - framework: PyTorch
-    - use_torchscript: False
-    - framework_version: 1.4.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 09:35:25.143267
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
-
-    >>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
-    >>> config_base = BertConfig()
-    >>> config_384_hid = BertConfig(hidden_size=384)
-    >>> config_6_lay = BertConfig(num_hidden_layers=6)
-
-    >>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay])
-    >>> benchmark.run()
-    ====================       INFERENCE - SPEED - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length       Time in s                  
-    --------------------------------------------------------------------------------
-    bert-base                  8               8             0.005
-    bert-base                  8               32            0.008
-    bert-base                  8              128            0.022
-    bert-base                  8              512            0.106
-    bert-384-hid              8               8             0.005
-    bert-384-hid              8               32            0.007
-    bert-384-hid              8              128            0.018
-    bert-384-hid              8              512            0.064
-    bert-6-lay                 8               8             0.002
-    bert-6-lay                 8               32            0.003
-    bert-6-lay                 8              128            0.0011
-    bert-6-lay                 8              512            0.074
-    --------------------------------------------------------------------------------
-
-    ====================      INFERENCE - MEMORY - RESULT       ====================
-    --------------------------------------------------------------------------------
-    Model Name             Batch Size     Seq Length      Memory in MB 
-    --------------------------------------------------------------------------------
-    bert-base                  8               8             1330
-    bert-base                  8               32            1330
-    bert-base                  8              128            1330
-    bert-base                  8              512            1770
-    bert-384-hid              8               8             1330
-    bert-384-hid              8               32            1330
-    bert-384-hid              8              128            1330
-    bert-384-hid              8              512            1540
-    bert-6-lay                 8               8             1330
-    bert-6-lay                 8               32            1330
-    bert-6-lay                 8              128            1330
-    bert-6-lay                 8              512            1540
-    --------------------------------------------------------------------------------
-
-    ====================        ENVIRONMENT INFORMATION         ====================
-
-    - transformers_version: 2.11.0
-    - framework: Tensorflow
-    - use_xla: False
-    - framework_version: 2.2.0
-    - python_version: 3.6.10
-    - system: Linux
-    - cpu: x86_64
-    - architecture: 64bit
-    - date: 2020-06-29
-    - time: 09:38:15.487125
-    - fp16: False
-    - use_multiprocessing: True
-    - only_pretrain_model: False
-    - cpu_ram_mb: 32088
-    - use_gpu: True
-    - num_gpus: 1
-    - gpu: TITAN RTX
-    - gpu_ram_mb: 24217
-    - gpu_power_watts: 280.0
-    - gpu_performance_state: 2
-    - use_tpu: False
-
-
-Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
-of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
-should be trained.
-
-
-Benchmark best practices
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This section lists a couple of best practices one should be aware of when benchmarking a model.
-
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
-  specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
-  shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
-  memory measurement it is recommended to run each memory benchmark in a separate process by making sure
-  :obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
-  heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
-  useful for the community.
-
-
-Sharing your benchmark
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
-settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
-done across CPUs (except for TensorFlow XLA) and GPUs.
-
-The approach is detailed in the `following blogpost
-<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
-available `here
-<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
-
-With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community
-
- :prefix_link:`PyTorch Benchmarking Results<examples/pytorch/benchmarking/README.md>`.
- :prefix_link:`TensorFlow Benchmarking Results<examples/tensorflow/benchmarking/README.md>`.
--- a/docs/source/custom_datasets.mdx
+++ b/docs/source/custom_datasets.mdx
@ -0,0 +1,681 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# How to fine-tune a model for common downstream tasks
+
+[[open-in-colab]]
+
+This guide will show you how to fine-tune 🤗 Transformers models for common downstream tasks. You will use the 🤗
+Datasets library to quickly load and preprocess the datasets, getting them ready for training with PyTorch and
+TensorFlow.
+
+Before you begin, make sure you have the 🤗 Datasets library installed. For more detailed installation instructions,
+refer to the 🤗 Datasets [installation page](https://huggingface.co/docs/datasets/installation.html). All of the
+examples in this guide will use 🤗 Datasets to load and preprocess a dataset.
+
+```bash
+pip install datasets
+```
+
+Learn how to fine-tune a model for:
+
+- [seq_imdb](#seq_imdb)
+- [tok_ner](#tok_ner)
+- [qa_squad](#qa_squad)
+
+<a id='seq_imdb'></a>
+
+## Sequence classification with IMDb reviews
+
+Sequence classification refers to the task of classifying sequences of text according to a given number of classes. In
+this example, learn how to fine-tune a model on the [IMDb dataset](https://huggingface.co/datasets/imdb) to determine
+whether a review is positive or negative.
+
+<Tip>
+
+For a more in-depth example of how to fine-tune a model for text classification, take a look at the corresponding
+[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
+or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb).
+
+</Tip>
+
+### Load IMDb dataset
+
+The 🤗 Datasets library makes it simple to load a dataset:
+
+```python
+from datasets import load_dataset
+imdb = load_dataset("imdb")
+```
+
+This loads a `DatasetDict` object which you can index into to view an example:
+
+```python
+imdb["train"][0]
+{'label': 1,
+ 'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
+}
+```
+
+### Preprocess
+
+The next step is to tokenize the text into a readable format by the model. It is important to load the same tokenizer a
+model was trained with to ensure appropriately tokenized words. Load the DistilBERT tokenizer with the
+[`AutoTokenizer`] because we will eventually train a classifier using a pretrained [DistilBERT](https://huggingface.co/distilbert-base-uncased) model:
+
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+```
+
+Now that you have instantiated a tokenizer, create a function that will tokenize the text. You should also truncate
+longer sequences in the text to be no longer than the model's maximum input length:
+
+```python
+def preprocess_function(examples):
+    return tokenizer(examples["text"], truncation=True)
+```
+
+Use 🤗 Datasets `map` function to apply the preprocessing function to the entire dataset. You can also set
+`batched=True` to apply the preprocessing function to multiple elements of the dataset at once for faster
+preprocessing:
+
+```python
+tokenized_imdb = imdb.map(preprocess_function, batched=True)
+```
+
+Lastly, pad your text so they are a uniform length. While it is possible to pad your text in the `tokenizer` function
+by setting `padding=True`, it is more efficient to only pad the text to the length of the longest element in its
+batch. This is known as **dynamic padding**. You can do this with the `DataCollatorWithPadding` function:
+
+```python
+from transformers import DataCollatorWithPadding
+data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+```
+
+### Fine-tune with the Trainer API
+
+Now load your model with the [`AutoModelForSequenceClassification`] class along with the number of expected labels:
+
+```python
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
+```
+
+At this point, only three steps remain:
+
+1. Define your training hyperparameters in [`TrainingArguments`].
+2. Pass the training arguments to a [`Trainer`] along with the model, dataset, tokenizer, and data collator.
+3. Call [`Trainer.train()`] to fine-tune your model.
+
+```python
+from transformers import TrainingArguments, Trainer
+
+training_args = TrainingArguments(
+    output_dir='./results',
+    learning_rate=2e-5,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    num_train_epochs=5,
+    weight_decay=0.01,
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_imdb["train"],
+    eval_dataset=tokenized_imdb["test"],
+    tokenizer=tokenizer,
+    data_collator=data_collator,
+)
+
+trainer.train()
+```
+
+### Fine-tune with TensorFlow
+
+Fine-tuning with TensorFlow is just as easy, with only a few differences.
+
+Start by batching the processed examples together with dynamic padding using the [`DataCollatorWithPadding`] function.
+Make sure you set `return_tensors="tf"` to return `tf.Tensor` outputs instead of PyTorch tensors!
+
+```python
+from transformers import DataCollatorWithPadding
+data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
+```
+
+Next, convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`. Specify inputs and labels in the
+`columns` argument:
+
+```python
+tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
+    columns=['attention_mask', 'input_ids', 'label'],
+    shuffle=True,
+    batch_size=16,
+    collate_fn=data_collator,
+)
+
+tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
+    columns=['attention_mask', 'input_ids', 'label'],
+    shuffle=False,
+    batch_size=16,
+    collate_fn=data_collator,
+)
+```
+
+Set up an optimizer function, learning rate schedule, and some training hyperparameters:
+
+```python
+from transformers import create_optimizer
+import tensorflow as tf
+
+batch_size = 16
+num_epochs = 5
+batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
+total_train_steps = int(batches_per_epoch * num_epochs)
+optimizer, schedule = create_optimizer(
+    init_lr=2e-5, 
+    num_warmup_steps=0, 
+    num_train_steps=total_train_steps
+)
+```
+
+Load your model with the [`TFAutoModelForSequenceClassification`] class along with the number of expected labels:
+
+```python
+from transformers import TFAutoModelForSequenceClassification
+model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
+```
+
+Compile the model:
+
+```python
+import tensorflow as tf
+model.compile(optimizer=optimizer)
+```
+
+Finally, fine-tune the model by calling `model.fit`:
+
+```python
+model.fit(
+    tf_train_set,
+    validation_data=tf_validation_set,
+    epochs=num_train_epochs,
+)
+```
+
+<a id='tok_ner'></a>
+
+## Token classification with WNUT emerging entities
+
+Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
+classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
+such as a person, location, or organization. In this example, learn how to fine-tune a model on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities.
+
+<Tip>
+
+For a more in-depth example of how to fine-tune a model for token classification, take a look at the corresponding
+[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
+or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb).
+
+</Tip>
+
+### Load WNUT 17 dataset
+
+Load the WNUT 17 dataset from the 🤗 Datasets library:
+
+```python
+from datasets import load_dataset
+wnut = load_dataset("wnut_17")
+```
+
+A quick look at the dataset shows the labels associated with each word in the sentence:
+
+```python
+wnut["train"][0]
+{'id': '0',
+ 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
+ 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
+}
+```
+
+View the specific NER tags by:
+
+```python
+label_list = wnut["train"].features[f"ner_tags"].feature.names
+label_list
+['O',
+ 'B-corporation',
+ 'I-corporation',
+ 'B-creative-work',
+ 'I-creative-work',
+ 'B-group',
+ 'I-group',
+ 'B-location',
+ 'I-location',
+ 'B-person',
+ 'I-person',
+ 'B-product',
+ 'I-product'
+]
+```
+
+A letter prefixes each NER tag which can mean:
+
+- `B-` indicates the beginning of an entity.
+- `I-` indicates a token is contained inside the same entity (e.g., the `State` token is a part of an entity like
+  `Empire State Building`).
+- `0` indicates the token doesn't correspond to any entity.
+
+### Preprocess
+
+Now you need to tokenize the text. Load the DistilBERT tokenizer with an [`AutoTokenizer`]:
+
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+```
+
+Since the input has already been split into words, set `is_split_into_words=True` to tokenize the words into
+subwords:
+
+```python
+tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
+tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
+tokens
+['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
+```
+
+The addition of the special tokens `[CLS]` and `[SEP]` and subword tokenization creates a mismatch between the
+input and labels. Realign the labels and tokens by:
+
+1. Mapping all tokens to their corresponding word with the `word_ids` method.
+2. Assigning the label `-100` to the special tokens `[CLS]` and ``[SEP]``` so the PyTorch loss function ignores
+   them.
+3. Only labeling the first token of a given word. Assign `-100` to the other subtokens from the same word.
+
+Here is how you can create a function that will realign the labels and tokens:
+
+```python
+def tokenize_and_align_labels(examples):
+    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
+
+    labels = []
+    for i, label in enumerate(examples[f"ner_tags"]):
+        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
+        previous_word_idx = None
+        label_ids = []
+        for word_idx in word_ids:                            # Set the special tokens to -100.
+            if word_idx is None:
+                label_ids.append(-100)
+            elif word_idx != previous_word_idx:              # Only label the first token of a given word.
+                label_ids.append(label[word_idx])
+
+        labels.append(label_ids)
+
+    tokenized_inputs["labels"] = labels
+    return tokenized_inputs
+```
+
+Now tokenize and align the labels over the entire dataset with 🤗 Datasets `map` function:
+
+```python
+tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
+```
+
+Finally, pad your text and labels, so they are a uniform length:
+
+```python
+from transformers import DataCollatorForTokenClassification
+data_collator = DataCollatorForTokenClassification(tokenizer)
+```
+
+### Fine-tune with the Trainer API
+
+Load your model with the [`AutoModelForTokenClassification`] class along with the number of expected labels:
+
+```python
+from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
+model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
+```
+
+Gather your training arguments in [`TrainingArguments`]:
+
+```python
+training_args = TrainingArguments(
+    output_dir='./results',
+    evaluation_strategy="epoch",
+    learning_rate=2e-5,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+```
+
+Collect your model, training arguments, dataset, data collator, and tokenizer in [`Trainer`]:
+
+```python
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_wnut["train"],
+    eval_dataset=tokenized_wnut["test"],
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+)
+```
+
+Fine-tune your model:
+
+```python
+trainer.train()
+```
+
+### Fine-tune with TensorFlow
+
+Batch your examples together and pad your text and labels, so they are a uniform length:
+
+```python
+from transformers import DataCollatorForTokenClassification
+data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
+```
+
+Convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`:
+
+```python
+tf_train_set = tokenized_wnut["train"].to_tf_dataset(
+    columns=["attention_mask", "input_ids", "labels"],
+    shuffle=True,
+    batch_size=16,
+    collate_fn=data_collator,
+)
+
+tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
+    columns=["attention_mask", "input_ids", "labels"],
+    shuffle=False,
+    batch_size=16,
+    collate_fn=data_collator,
+)
+```
+
+Load the model with the [`TFAutoModelForTokenClassification`] class along with the number of expected labels:
+
+```python
+from transformers import TFAutoModelForTokenClassification
+model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
+```
+
+Set up an optimizer function, learning rate schedule, and some training hyperparameters:
+
+```python
+from transformers import create_optimizer
+
+batch_size = 16
+num_train_epochs = 3
+num_train_steps = (len(tokenized_datasets["train"]) // batch_size) * num_train_epochs
+optimizer, lr_schedule = create_optimizer(
+    init_lr=2e-5,
+    num_train_steps=num_train_steps,
+    weight_decay_rate=0.01,
+    num_warmup_steps=0,
+)
+```
+
+Compile the model:
+
+```python
+import tensorflow as tf
+model.compile(optimizer=optimizer)
+```
+
+Call `model.fit` to fine-tune your model:
+
+```python
+model.fit(
+    tf_train_set,
+    validation_data=tf_validation_set,
+    epochs=num_train_epochs,
+)
+```
+
+<a id='qa_squad'></a>
+
+## Question Answering with SQuAD
+
+There are many types of question answering (QA) tasks. Extractive QA focuses on identifying the answer from the text
+given a question. In this example, learn how to fine-tune a model on the [SQuAD](https://huggingface.co/datasets/squad) dataset.
+
+<Tip>
+
+For a more in-depth example of how to fine-tune a model for question answering, take a look at the corresponding
+[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
+or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering-tf.ipynb).
+
+</Tip>
+
+### Load SQuAD dataset
+
+Load the SQuAD dataset from the 🤗 Datasets library:
+
+```python
+from datasets import load_dataset
+squad = load_dataset("squad")
+```
+
+Take a look at an example from the dataset:
+
+```python
+squad["train"][0]
+{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
+ 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
+ 'id': '5733be284776f41900661182',
+ 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
+ 'title': 'University_of_Notre_Dame'
+}
+```
+
+### Preprocess
+
+Load the DistilBERT tokenizer with an [`AutoTokenizer`]:
+
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+```
+
+There are a few things to be aware of when preprocessing text for question answering:
+
+1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. You
+   can deal with this by truncating the `context` and set `truncation="only_second"`.
+2. Next, you need to map the start and end positions of the answer to the original context. Set
+   `return_offset_mapping=True` to handle this.
+3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to
+   find which part of the offset corresponds to the question, and which part of the offset corresponds to the context.
+
+Assemble everything in a preprocessing function as shown below:
+
+```python
+def preprocess_function(examples):
+    questions = [q.strip() for q in examples["question"]]
+    inputs = tokenizer(
+        questions,
+        examples["context"],
+        max_length=384,
+        truncation="only_second",
+        return_offsets_mapping=True,
+        padding="max_length",
+    )
+
+    offset_mapping = inputs.pop("offset_mapping")
+    answers = examples["answers"]
+    start_positions = []
+    end_positions = []
+
+    for i, offset in enumerate(offset_mapping):
+        answer = answers[i]
+        start_char = answer["answer_start"][0]
+        end_char = answer["answer_start"][0] + len(answer["text"][0])
+        sequence_ids = inputs.sequence_ids(i)
+
+        # Find the start and end of the context
+        idx = 0
+        while sequence_ids[idx] != 1:
+            idx += 1
+        context_start = idx
+        while sequence_ids[idx] == 1:
+            idx += 1
+        context_end = idx - 1
+
+        # If the answer is not fully inside the context, label it (0, 0)
+        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
+            start_positions.append(0)
+            end_positions.append(0)
+        else:
+            # Otherwise it's the start and end token positions
+            idx = context_start
+            while idx <= context_end and offset[idx][0] <= start_char:
+                idx += 1
+            start_positions.append(idx - 1)
+
+            idx = context_end
+            while idx >= context_start and offset[idx][1] >= end_char:
+                idx -= 1
+            end_positions.append(idx + 1)
+
+    inputs["start_positions"] = start_positions
+    inputs["end_positions"] = end_positions
+    return inputs
+```
+
+Apply the preprocessing function over the entire dataset with 🤗 Datasets `map` function:
+
+```python
+tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
+```
+
+Batch the processed examples together:
+
+```python
+from transformers import default_data_collator
+data_collator = default_data_collator
+```
+
+### Fine-tune with the Trainer API
+
+Load your model with the [`AutoModelForQuestionAnswering`] class:
+
+```python
+from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
+model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
+```
+
+Gather your training arguments in [`TrainingArguments`]:
+
+```python
+training_args = TrainingArguments(
+    output_dir='./results',
+    evaluation_strategy="epoch",
+    learning_rate=2e-5,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    num_train_epochs=3,
+    weight_decay=0.01,
+)
+```
+
+Collect your model, training arguments, dataset, data collator, and tokenizer in [`Trainer`]:
+
+```python
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_squad["train"],
+    eval_dataset=tokenized_squad["validation"],
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+)
+```
+
+Fine-tune your model:
+
+```python
+trainer.train()
+```
+
+### Fine-tune with TensorFlow
+
+Batch the processed examples together with a TensorFlow default data collator:
+
+```python
+from transformers.data.data_collator import tf_default_collator
+data_collator = tf_default_collator
+```
+
+Convert your datasets to the `tf.data.Dataset` format with the `to_tf_dataset` function:
+
+```python
+tf_train_set = tokenized_squad["train"].to_tf_dataset(
+    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
+    dummy_labels=True,
+    shuffle=True,
+    batch_size=16,
+    collate_fn=data_collator,
+)
+
+tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
+    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
+    dummy_labels=True,
+    shuffle=False,
+    batch_size=16,
+    collate_fn=data_collator,
+)
+```
+
+Set up an optimizer function, learning rate schedule, and some training hyperparameters:
+
+```python
+from transformers import create_optimizer
+
+batch_size = 16
+num_epochs = 2
+total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
+optimizer, schedule = create_optimizer(
+    init_lr=2e-5, 
+    num_warmup_steps=0, 
+    num_train_steps=total_train_steps,
+)
+```
+
+Load your model with the [`TFAutoModelForQuestionAnswering`] class:
+
+```python
+from transformers import TFAutoModelForQuestionAnswering
+model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
+```
+
+Compile the model:
+
+```python
+import tensorflow as tf
+model.compile(optimizer=optimizer)
+```
+
+Call `model.fit` to fine-tune the model:
+
+```python
+model.fit(
+    tf_train_set,
+    validation_data=tf_validation_set,
+    epochs=num_train_epochs,
+)
+```
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@ -1,699 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-How to fine-tune a model for common downstream tasks
-=======================================================================================================================
-
-This guide will show you how to fine-tune 🤗 Transformers models for common downstream tasks. You will use the 🤗
-Datasets library to quickly load and preprocess the datasets, getting them ready for training with PyTorch and
-TensorFlow.
-
-Before you begin, make sure you have the 🤗 Datasets library installed. For more detailed installation instructions,
-refer to the 🤗 Datasets `installation page <https://huggingface.co/docs/datasets/installation.html>`_. All of the
-examples in this guide will use 🤗 Datasets to load and preprocess a dataset.
-
-.. code-block:: bash
-
-    pip install datasets
-
-Learn how to fine-tune a model for:
-
- :ref:`seq_imdb`
- :ref:`tok_ner`
- :ref:`qa_squad`
-
-.. _seq_imdb:
-
-Sequence classification with IMDb reviews
-----------------------------------------------------------------------------------------------------------------------
-
-Sequence classification refers to the task of classifying sequences of text according to a given number of classes. In
-this example, learn how to fine-tune a model on the `IMDb dataset <https://huggingface.co/datasets/imdb>`_ to determine
-whether a review is positive or negative.
-
-.. note::
-
-    For a more in-depth example of how to fine-tune a model for text classification, take a look at the corresponding
-    `PyTorch notebook
-    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb>`__
-    or `TensorFlow notebook
-    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb>`__.
-
-Load IMDb dataset
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The 🤗 Datasets library makes it simple to load a dataset:
-
-.. code-block:: python
-
-    from datasets import load_dataset
-    imdb = load_dataset("imdb")
-
-This loads a ``DatasetDict`` object which you can index into to view an example:
-
-.. code-block:: python
-
-    imdb["train"][0]
-    {'label': 1,
-     'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
-    }
-
-Preprocess
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The next step is to tokenize the text into a readable format by the model. It is important to load the same tokenizer a
-model was trained with to ensure appropriately tokenized words. Load the DistilBERT tokenizer with the
-:class:`~transformers.AutoTokenizer` because we will eventually train a classifier using a pretrained `DistilBERT
-<https://huggingface.co/distilbert-base-uncased>`_ model:
-
-.. code-block:: python
-
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
-
-Now that you have instantiated a tokenizer, create a function that will tokenize the text. You should also truncate
-longer sequences in the text to be no longer than the model's maximum input length:
-
-.. code-block:: python
-
-    def preprocess_function(examples):
-        return tokenizer(examples["text"], truncation=True)
-
-Use 🤗 Datasets ``map`` function to apply the preprocessing function to the entire dataset. You can also set
-``batched=True`` to apply the preprocessing function to multiple elements of the dataset at once for faster
-preprocessing:
-
-.. code-block:: python
-
-    tokenized_imdb = imdb.map(preprocess_function, batched=True)
-
-Lastly, pad your text so they are a uniform length. While it is possible to pad your text in the ``tokenizer`` function
-by setting ``padding=True``, it is more efficient to only pad the text to the length of the longest element in its
-batch. This is known as **dynamic padding**. You can do this with the ``DataCollatorWithPadding`` function:
-
-.. code-block:: python
-
-    from transformers import DataCollatorWithPadding
-    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
-
-Fine-tune with the Trainer API
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Now load your model with the :class:`~transformers.AutoModel` class along with the number of expected labels:
-
-.. code-block:: python
-
-    from transformers import AutoModelForSequenceClassification
-    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
-
-At this point, only three steps remain:
-
-1. Define your training hyperparameters in :class:`~transformers.TrainingArguments`.
-2. Pass the training arguments to a :class:`~transformers.Trainer` along with the model, dataset, tokenizer, and data
-   collator.
-3. Call ``trainer.train`` to fine-tune your model.
-
-.. code-block:: python
-
-    from transformers import TrainingArguments, Trainer
-
-    training_args = TrainingArguments(
-        output_dir='./results',
-        learning_rate=2e-5,
-        per_device_train_batch_size=16,
-        per_device_eval_batch_size=16,
-        num_train_epochs=5,
-        weight_decay=0.01,
-    )
-
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=tokenized_imdb["train"],
-        eval_dataset=tokenized_imdb["test"],
-        tokenizer=tokenizer,
-        data_collator=data_collator,
-    )
-
-    trainer.train()
-
-Fine-tune with TensorFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Fine-tuning with TensorFlow is just as easy, with only a few differences.
-
-Start by batching the processed examples together with dynamic padding using the ``DataCollatorWithPadding`` function.
-Make sure you set ``return_tensors="tf"`` to return ``tf.Tensor`` outputs instead of PyTorch tensors!
-
-.. code-block:: python
-
-    from transformers import DataCollatorWithPadding
-    data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
-
-Next, convert your datasets to the ``tf.data.Dataset`` format with ``to_tf_dataset``. Specify inputs and labels in the
-``columns`` argument:
-
-.. code-block:: python
-
-    tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
-        columns=['attention_mask', 'input_ids', 'label'],
-        shuffle=True,
-        batch_size=16,
-        collate_fn=data_collator,
-    )
-
-    tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
-        columns=['attention_mask', 'input_ids', 'label'],
-        shuffle=False,
-        batch_size=16,
-        collate_fn=data_collator,
-    )
-
-Set up an optimizer function, learning rate schedule, and some training hyperparameters:
-
-.. code-block:: python
-
-    from transformers import create_optimizer
-    import tensorflow as tf
-
-    batch_size = 16
-    num_epochs = 5
-    batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
-    total_train_steps = int(batches_per_epoch * num_epochs)
-    optimizer, schedule = create_optimizer(
-        init_lr=2e-5, 
-        num_warmup_steps=0, 
-        num_train_steps=total_train_steps
-    )
-
-Load your model with the :class:`~transformers.TFAutoModel` class along with the number of expected labels:
-
-.. code-block:: python
-
-    from transformers import TFAutoModelForSequenceClassification
-    model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
-
-Compile the model:
-
-.. code-block:: python
-
-    import tensorflow as tf
-    model.compile(optimizer=optimizer)
-
-Finally, fine-tune the model by calling ``model.fit``:
-
-.. code-block:: python
-
-    model.fit(
-        tf_train_set,
-        validation_data=tf_validation_set,
-        epochs=num_train_epochs,
-    )
-
-.. _tok_ner:
-
-Token classification with WNUT emerging entities
-----------------------------------------------------------------------------------------------------------------------
-
-Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
-classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
-such as a person, location, or organization. In this example, learn how to fine-tune a model on the `WNUT 17
-<https://huggingface.co/datasets/wnut_17>`_ dataset to detect new entities.
-
-.. note::
-
-    For a more in-depth example of how to fine-tune a model for token classification, take a look at the corresponding
-    `PyTorch notebook
-    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb>`__
-    or `TensorFlow notebook
-    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb>`__.
-
-Load WNUT 17 dataset
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Load the WNUT 17 dataset from the 🤗 Datasets library:
-
-.. code-block:: python
-
-    from datasets import load_dataset
-    wnut = load_dataset("wnut_17")
-
-A quick look at the dataset shows the labels associated with each word in the sentence:
-
-.. code-block:: python
-
-    wnut["train"][0]
-    {'id': '0',
-     'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
-     'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
-    }
-
-View the specific NER tags by:
-
-.. code-block:: python
-
-    label_list = wnut["train"].features[f"ner_tags"].feature.names
-    label_list
-    ['O',
-     'B-corporation',
-     'I-corporation',
-     'B-creative-work',
-     'I-creative-work',
-     'B-group',
-     'I-group',
-     'B-location',
-     'I-location',
-     'B-person',
-     'I-person',
-     'B-product',
-     'I-product'
-    ]
-
-A letter prefixes each NER tag which can mean:
-
-* ``B-`` indicates the beginning of an entity.
-* ``I-`` indicates a token is contained inside the same entity (e.g., the ``State`` token is a part of an entity like
-  ``Empire State Building``).
-* ``0`` indicates the token doesn't correspond to any entity.
-
-Preprocess
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Now you need to tokenize the text. Load the DistilBERT tokenizer with an :class:`~transformers.AutoTokenizer`:
-
-.. code-block:: python
-
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
-
-Since the input has already been split into words, set ``is_split_into_words=True`` to tokenize the words into
-subwords:
-
-.. code-block:: python
-
-    tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
-    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
-    tokens
-    ['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
-
-The addition of the special tokens ``[CLS]`` and ``[SEP]`` and subword tokenization creates a mismatch between the
-input and labels. Realign the labels and tokens by:
-
-1. Mapping all tokens to their corresponding word with the ``word_ids`` method.
-2. Assigning the label ``-100`` to the special tokens ``[CLS]`` and ``[SEP]``` so the PyTorch loss function ignores
-   them.
-3. Only labeling the first token of a given word. Assign ``-100`` to the other subtokens from the same word.
-
-Here is how you can create a function that will realign the labels and tokens:
-
-.. code-block:: python
-
-    def tokenize_and_align_labels(examples):
-        tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
-
-        labels = []
-        for i, label in enumerate(examples[f"ner_tags"]):
-            word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
-            previous_word_idx = None
-            label_ids = []
-            for word_idx in word_ids:                            # Set the special tokens to -100.
-                if word_idx is None:
-                    label_ids.append(-100)
-                elif word_idx != previous_word_idx:              # Only label the first token of a given word.
-                    label_ids.append(label[word_idx])
-
-            labels.append(label_ids)
-
-        tokenized_inputs["labels"] = labels
-        return tokenized_inputs
-
-Now tokenize and align the labels over the entire dataset with 🤗 Datasets ``map`` function:
-
-.. code-block:: python
-
-    tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
-
-Finally, pad your text and labels, so they are a uniform length:
-
-.. code-block:: python
-
-    from transformers import DataCollatorForTokenClassification
-    data_collator = DataCollatorForTokenClassification(tokenizer)
-
-Fine-tune with the Trainer API
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Load your model with the :class:`~transformers.AutoModel` class along with the number of expected labels:
-
-.. code-block:: python
-
-    from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
-    model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
-
-Gather your training arguments in :class:`~transformers.TrainingArguments`:
-
-.. code-block:: python
-
-    training_args = TrainingArguments(
-        output_dir='./results',
-        evaluation_strategy="epoch",
-        learning_rate=2e-5,
-        per_device_train_batch_size=16,
-        per_device_eval_batch_size=16,
-        num_train_epochs=3,
-        weight_decay=0.01,
-    )
-
-Collect your model, training arguments, dataset, data collator, and tokenizer in :class:`~transformers.Trainer`:
-
-.. code-block:: python
-
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=tokenized_wnut["train"],
-        eval_dataset=tokenized_wnut["test"],
-        data_collator=data_collator,
-        tokenizer=tokenizer,
-    )
-
-Fine-tune your model:
-
-.. code-block:: python
-
-    trainer.train()
-
-Fine-tune with TensorFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Batch your examples together and pad your text and labels, so they are a uniform length:
-
-.. code-block:: python
-
-    from transformers import DataCollatorForTokenClassification
-    data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
-
-Convert your datasets to the ``tf.data.Dataset`` format with ``to_tf_dataset``:
-
-.. code-block:: python
-
-    tf_train_set = tokenized_wnut["train"].to_tf_dataset(
-        columns=["attention_mask", "input_ids", "labels"],
-        shuffle=True,
-        batch_size=16,
-        collate_fn=data_collator,
-    )
-
-    tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
-        columns=["attention_mask", "input_ids", "labels"],
-        shuffle=False,
-        batch_size=16,
-        collate_fn=data_collator,
-    )
-
-Load the model with the :class:`~transformers.TFAutoModel` class along with the number of expected labels:
-
-.. code-block:: python
-
-    from transformers import TFAutoModelForTokenClassification
-    model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
-
-Set up an optimizer function, learning rate schedule, and some training hyperparameters:
-
-.. code-block:: python
-
-    from transformers import create_optimizer
-
-    batch_size = 16
-    num_train_epochs = 3
-    num_train_steps = (len(tokenized_datasets["train"]) // batch_size) * num_train_epochs
-    optimizer, lr_schedule = create_optimizer(
-        init_lr=2e-5,
-        num_train_steps=num_train_steps,
-        weight_decay_rate=0.01,
-        num_warmup_steps=0,
-    )
-
-Compile the model:
-
-.. code-block:: python
-
-    import tensorflow as tf
-    model.compile(optimizer=optimizer)
-
-Call ``model.fit`` to fine-tune your model:
-
-.. code-block:: python
-
-    model.fit(
-        tf_train_set,
-        validation_data=tf_validation_set,
-        epochs=num_train_epochs,
-    )
-
-.. _qa_squad:
-
-Question Answering with SQuAD
-----------------------------------------------------------------------------------------------------------------------
-
-There are many types of question answering (QA) tasks. Extractive QA focuses on identifying the answer from the text
-given a question. In this example, learn how to fine-tune a model on the `SQuAD
-<https://huggingface.co/datasets/squad>`_ dataset.
-
-.. note::
-
-    For a more in-depth example of how to fine-tune a model for question answering, take a look at the corresponding
-    `PyTorch notebook
-    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb>`__
-    or `TensorFlow notebook
-    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering-tf.ipynb>`__.
-
-Load SQuAD dataset
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Load the SQuAD dataset from the 🤗 Datasets library:
-
-.. code-block:: python
-
-    from datasets import load_dataset
-    squad = load_dataset("squad")
-
-Take a look at an example from the dataset:
-
-.. code-block:: python
-
-    squad["train"][0]
-    {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
-     'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
-     'id': '5733be284776f41900661182',
-     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
-     'title': 'University_of_Notre_Dame'
-    }
-
-Preprocess
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Load the DistilBERT tokenizer with an :class:`~transformers.AutoTokenizer`:
-
-.. code-block:: python
-
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
-
-There are a few things to be aware of when preprocessing text for question answering:
-
-1. Some examples in a dataset may have a very long ``context`` that exceeds the maximum input length of the model. You
-   can deal with this by truncating the ``context`` and set ``truncation="only_second"``.
-2. Next, you need to map the start and end positions of the answer to the original context. Set
-   ``return_offset_mapping=True`` to handle this.
-3. With the mapping in hand, you can find the start and end tokens of the answer. Use the ``sequence_ids`` method to
-   find which part of the offset corresponds to the question, and which part of the offset corresponds to the context.
-
-Assemble everything in a preprocessing function as shown below:
-
-.. code-block:: python
-
-    def preprocess_function(examples):
-        questions = [q.strip() for q in examples["question"]]
-        inputs = tokenizer(
-            questions,
-            examples["context"],
-            max_length=384,
-            truncation="only_second",
-            return_offsets_mapping=True,
-            padding="max_length",
-        )
-
-        offset_mapping = inputs.pop("offset_mapping")
-        answers = examples["answers"]
-        start_positions = []
-        end_positions = []
-
-        for i, offset in enumerate(offset_mapping):
-            answer = answers[i]
-            start_char = answer["answer_start"][0]
-            end_char = answer["answer_start"][0] + len(answer["text"][0])
-            sequence_ids = inputs.sequence_ids(i)
-
-            # Find the start and end of the context
-            idx = 0
-            while sequence_ids[idx] != 1:
-                idx += 1
-            context_start = idx
-            while sequence_ids[idx] == 1:
-                idx += 1
-            context_end = idx - 1
-
-            # If the answer is not fully inside the context, label it (0, 0)
-            if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
-                start_positions.append(0)
-                end_positions.append(0)
-            else:
-                # Otherwise it's the start and end token positions
-                idx = context_start
-                while idx <= context_end and offset[idx][0] <= start_char:
-                    idx += 1
-                start_positions.append(idx - 1)
-
-                idx = context_end
-                while idx >= context_start and offset[idx][1] >= end_char:
-                    idx -= 1
-                end_positions.append(idx + 1)
-
-        inputs["start_positions"] = start_positions
-        inputs["end_positions"] = end_positions
-        return inputs
-
-Apply the preprocessing function over the entire dataset with 🤗 Datasets ``map`` function:
-
-.. code-block:: python
-
-    tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
-
-Batch the processed examples together:
-
-.. code-block:: python
-
-    from transformers import default_data_collator
-    data_collator = default_data_collator
-
-Fine-tune with the Trainer API
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Load your model with the :class:`~transformers.AutoModel` class:
-
-.. code-block:: python
-
-    from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
-    model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
-
-Gather your training arguments in :class:`~transformers.TrainingArguments`:
-
-.. code-block:: python
-
-    training_args = TrainingArguments(
-        output_dir='./results',
-        evaluation_strategy="epoch",
-        learning_rate=2e-5,
-        per_device_train_batch_size=16,
-        per_device_eval_batch_size=16,
-        num_train_epochs=3,
-        weight_decay=0.01,
-    )
-
-Collect your model, training arguments, dataset, data collator, and tokenizer in :class:`~transformers.Trainer`:
-
-.. code-block:: python
-
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=tokenized_squad["train"],
-        eval_dataset=tokenized_squad["validation"],
-        data_collator=data_collator,
-        tokenizer=tokenizer,
-    )
-
-Fine-tune your model:
-
-.. code-block:: python
-
-    trainer.train()
-
-Fine-tune with TensorFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Batch the processed examples together with a TensorFlow default data collator:
-
-.. code-block:: python
-
-    from transformers.data.data_collator import tf_default_collator
-    data_collator = tf_default_collator
-
-Convert your datasets to the ``tf.data.Dataset`` format with the ``to_tf_dataset`` function:
-
-.. code-block:: python
-
-    tf_train_set = tokenized_squad["train"].to_tf_dataset(
-        columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
-        dummy_labels=True,
-        shuffle=True,
-        batch_size=16,
-        collate_fn=data_collator,
-    )
-
-    tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
-        columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
-        dummy_labels=True,
-        shuffle=False,
-        batch_size=16,
-        collate_fn=data_collator,
-    )
-
-Set up an optimizer function, learning rate schedule, and some training hyperparameters:
-
-.. code-block:: python
-
-    from transformers import create_optimizer
-
-    batch_size = 16
-    num_epochs = 2
-    total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
-    optimizer, schedule = create_optimizer(
-        init_lr=2e-5, 
-        num_warmup_steps=0, 
-        num_train_steps=total_train_steps,
-    )
-
-Load your model with the :class:`~transformers.TFAutoModel` class:
-
-.. code-block:: python
-
-    from transformers import TFAutoModelForQuestionAnswering
-    model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
-
-Compile the model:
-
-.. code-block:: python
-
-    import tensorflow as tf
-    model.compile(optimizer=optimizer)
-
-Call ``model.fit`` to fine-tune the model:
-
-.. code-block:: python
-
-    model.fit(
-        tf_train_set,
-        validation_data=tf_validation_set,
-        epochs=num_train_epochs,
-    )
--- a/docs/source/multilingual.mdx
+++ b/docs/source/multilingual.mdx
@ -0,0 +1,118 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Multi-lingual models
+
+[[open-in-colab]]
+
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
+models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
+models.
+
+## XLM
+
+XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
+be split in two categories: the checkpoints that make use of language embeddings, and those that don't
+
+### XLM & Language Embeddings
+
+This section concerns the following checkpoints:
+
+- `xlm-mlm-ende-1024` (Masked language modeling, English-German)
+- `xlm-mlm-enfr-1024` (Masked language modeling, English-French)
+- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian)
+- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages)
+- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + Translation, XNLI languages)
+- `xlm-clm-enfr-1024` (Causal language modeling, English-French)
+- `xlm-clm-ende-1024` (Causal language modeling, English-German)
+
+These checkpoints require language embeddings that will specify the language used at inference time. These language
+embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
+these tensors depend on the language used and are identifiable using the `lang2id` and `id2lang` attributes from
+the tokenizer.
+
+Here is an example using the `xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French):
+
+
+```py
+>>> import torch
+>>> from transformers import XLMTokenizer, XLMWithLMHeadModel
+
+>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
+>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
+```
+
+The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
+`lang2id` attribute:
+
+```py
+>>> print(tokenizer.lang2id)
+{'en': 0, 'fr': 1}
+```
+
+These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
+
+```py
+>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
+```
+
+We should now define the language embedding by using the previously defined language id. We want to create a tensor
+filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
+
+```py
+>>> language_id = tokenizer.lang2id['en']  # 0
+>>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
+
+>>> # We reshape it to be of size (batch_size, sequence_length)
+>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
+```
+
+You can then feed it all as input to your model:
+
+```py
+>>> outputs = model(input_ids, langs=langs)
+```
+
+The example [run_generation.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-generation/run_generation.py) can generate text
+using the CLM checkpoints from XLM, using the language embeddings.
+
+### XLM without Language Embeddings
+
+This section concerns the following checkpoints:
+
+- `xlm-mlm-17-1280` (Masked language modeling, 17 languages)
+- `xlm-mlm-100-1280` (Masked language modeling, 100 languages)
+
+These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
+representations, differently from previously-mentioned XLM checkpoints.
+
+
+## BERT
+
+BERT has two checkpoints that can be used for multi-lingual tasks:
+
+- `bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages)
+- `bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages)
+
+These checkpoints do not require language embeddings at inference time. They should identify the language used in the
+context and infer accordingly.
+
+## XLM-RoBERTa
+
+XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
+over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
+labeling and question answering.
+
+Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
+
+- `xlm-roberta-base` (Masked language modeling, 100 languages)
+- `xlm-roberta-large` (Masked language modeling, 100 languages)
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@ -1,141 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Multi-lingual models
-=======================================================================================================================
-
-Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
-models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
-models.
-
-XLM
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
-be split in two categories: the checkpoints that make use of language embeddings, and those that don't
-
-XLM & Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
-
-This section concerns the following checkpoints:
-
- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German)
- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French)
- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian)
- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages)
- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages)
- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French)
- ``xlm-clm-ende-1024`` (Causal language modeling, English-German)
-
-These checkpoints require language embeddings that will specify the language used at inference time. These language
-embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
-these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
-the tokenizer.
-
-Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
-
-
-.. code-block::
-
-    >>> import torch
-    >>> from transformers import XLMTokenizer, XLMWithLMHeadModel
-
-    >>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
-    >>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
-
-
-The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
-``lang2id`` attribute:
-
-.. code-block::
-
-    >>> print(tokenizer.lang2id)
-    {'en': 0, 'fr': 1}
-
-
-These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
-
-.. code-block::
-
-    >>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
-
-
-We should now define the language embedding by using the previously defined language id. We want to create a tensor
-filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
-
-.. code-block::
-
-    >>> language_id = tokenizer.lang2id['en']  # 0
-    >>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
-
-    >>> # We reshape it to be of size (batch_size, sequence_length)
-    >>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
-
-
-You can then feed it all as input to your model:
-
-.. code-block::
-
-    >>> outputs = model(input_ids, langs=langs)
-
-
-The example :prefix_link:`run_generation.py <examples/pytorch/text-generation/run_generation.py>` can generate text
-using the CLM checkpoints from XLM, using the language embeddings.
-
-XLM without Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
-
-This section concerns the following checkpoints:
-
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
-
-These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
-representations, differently from previously-mentioned XLM checkpoints.
-
-
-BERT
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-BERT has two checkpoints that can be used for multi-lingual tasks:
-
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
-
-These checkpoints do not require language embeddings at inference time. They should identify the language used in the
-context and infer accordingly.
-
-XLM-RoBERTa
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
-over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
-labeling and question answering.
-
-Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
-
- ``xlm-roberta-base`` (Masked language modeling, 100 languages)
- ``xlm-roberta-large`` (Masked language modeling, 100 languages)
-
-mLUKE
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-mLUKE is based on XLM-RoBERTa and further trained on Wikipedia articles in 24 languages with masked language modeling
-as well as masked entity prediction objective.
-
-The model can be used in the same way as other models solely based on word-piece inputs, but also can be used with
-entity representations to achieve further performance gain, with entity-related tasks such as relation extraction,
-named entity recognition and question answering (see :doc:`LUKE <model_doc/luke>`).
-
-Currently, one mLUKE checkpoint is available:
-
- ``studio-ousia/mluke-base`` (Masked language modeling + Masked entity prediction, 100 languages)
--- a/docs/source/perplexity.mdx
+++ b/docs/source/perplexity.mdx
@ -0,0 +1,128 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Perplexity of fixed-length models
+
+[[open-in-colab]]
+
+Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
+that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
+models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)).
+
+Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
+sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is,
+
+$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}$$
+
+where \\(\log p_\theta (x_i|x_{<i})\\) is the log-likelihood of the ith token conditioned on the preceding tokens \\(x_{<i}\\) according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.
+
+This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
+intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
+[fantastic blog post on The Gradient](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/).
+
+## Calculating PPL with fixed-length models
+
+If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
+factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
+
+<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="/imgs/ppl_full.gif"/>
+
+When working with approximate models, however, we typically have a constraint on the number of tokens the model can
+process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
+cannot calculate \\(p_\theta(x_t|x_{<t})\\) directly when \\(t\\) is greater than 1024.
+
+Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
+input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) by conditioning only on the
+\\(k-1\\) tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
+log-likelihoods of each segment independently.
+
+<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="/imgs/ppl_chunked.gif"/>
+
+This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
+approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
+have less context at most of the prediction steps.
+
+Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
+sliding the context window so that the model has more context when making each prediction.
+
+<img width="600" alt="Sliding window PPL taking advantage of all available context" src="/imgs/ppl_sliding.gif"/>
+
+This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
+favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
+practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
+1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
+predictions at each step.
+
+## Example: Calculating perplexity with GPT-2 in 🤗 Transformers
+
+Let's demonstrate this process with GPT-2.
+
+```python
+from transformers import GPT2LMHeadModel, GPT2TokenizerFast
+device = 'cuda'
+model_id = 'gpt2-large'
+model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
+tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
+```
+
+We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
+this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
+dataset in memory.
+
+```python
+from datasets import load_dataset
+test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
+encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
+```
+
+With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative
+log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
+the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
+as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. The following
+is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens
+for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
+available to condition on).
+
+```python
+import torch
+from tqdm import tqdm
+
+max_length = model.config.n_positions
+stride = 512
+
+nlls = []
+for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
+    begin_loc = max(i + stride - max_length, 0)
+    end_loc = min(i + stride, encodings.input_ids.size(1))
+    trg_len = end_loc - i    # may be different from stride on last loop
+    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
+    target_ids = input_ids.clone()
+    target_ids[:,:-trg_len] = -100
+
+    with torch.no_grad():
+        outputs = model(input_ids, labels=target_ids)
+        neg_log_likelihood = outputs[0] * trg_len
+
+    nlls.append(neg_log_likelihood)
+
+ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
+```
+
+Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
+strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
+and the better the reported perplexity will typically be.
+
+When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same
+as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
+strategy, this jumps down to `16.53`. This is not only a more favorable score, but is calculated in a way that is
+closer to the true autoregressive decomposition of a sequence likelihood.
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@ -1,143 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Perplexity of fixed-length models
-=======================================================================================================================
-
-Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
-that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
-models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
-<model_summary>`).
-
-Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
-sequence :math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
-
-.. math::
-
-    \text{PPL}(X)
-    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
-
-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
-:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
-predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
-procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
-different models.
-
-This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
-intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
-`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
-
-Calculating PPL with fixed-length models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
-factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
-
-.. image:: /imgs/ppl_full.gif
-    :width: 600
-    :alt: Full decomposition of a sequence with unlimited context length
-
-When working with approximate models, however, we typically have a constraint on the number of tokens the model can
-process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
-cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
-
-Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
-input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
-:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
-sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
-log-likelihoods of each segment independently.
-
-.. image:: /imgs/ppl_chunked.gif
-    :width: 600
-    :alt: Suboptimal PPL not taking advantage of full available context
-
-This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
-approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
-have less context at most of the prediction steps.
-
-Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
-sliding the context window so that the model has more context when making each prediction.
-
-.. image:: /imgs/ppl_sliding.gif
-    :width: 600
-    :alt: Sliding window PPL taking advantage of all available context
-
-This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
-favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
-practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
-1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
-predictions at each step.
-
-Example: Calculating perplexity with GPT-2 in 🤗 Transformers
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Let's demonstrate this process with GPT-2.
-
-.. code-block:: python
-
-    from transformers import GPT2LMHeadModel, GPT2TokenizerFast
-    device = 'cuda'
-    model_id = 'gpt2-large'
-    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
-    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
-
-We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
-this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
-dataset in memory.
-
-.. code-block:: python
-
-    from datasets import load_dataset
-    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
-    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
-
-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average negative
-log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
-the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
-as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
-is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
-for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
-available to condition on).
-
-.. code-block:: python
-
-    import torch
-    from tqdm import tqdm
-
-    max_length = model.config.n_positions
-    stride = 512
-
-    nlls = []
-    for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
-        begin_loc = max(i + stride - max_length, 0)
-        end_loc = min(i + stride, encodings.input_ids.size(1))
-        trg_len = end_loc - i    # may be different from stride on last loop
-        input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
-        target_ids = input_ids.clone()
-        target_ids[:,:-trg_len] = -100
-
-        with torch.no_grad():
-            outputs = model(input_ids, labels=target_ids)
-            neg_log_likelihood = outputs[0] * trg_len
-
-        nlls.append(neg_log_likelihood)
-
-    ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
-
-Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
-strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
-and the better the reported perplexity will typically be.
-
-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
-as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
-strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
-closer to the true autoregressive decomposition of a sequence likelihood.
--- a/docs/source/preprocessing.mdx
+++ b/docs/source/preprocessing.mdx
@ -0,0 +1,341 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Preprocessing data
+
+[[open-in-colab]]
+
+In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
+call a [tokenizer](main_classes/tokenizer). You can build one using the tokenizer class associated to the model
+you would like to use, or directly with the [`AutoTokenizer`] class.
+
+As we saw in the [quick tour](quicktour), the tokenizer will first split a given text in words (or part of
+words, punctuation symbols, etc.) usually called _tokens_. Then it will convert those _tokens_ into numbers, to be able
+to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
+to work properly.
+
+<Tip>
+
+If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
+the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
+token to index (that we usually call a _vocab_) as during pretraining.
+
+</Tip>
+
+To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
+[`AutoTokenizer.from_pretrained`] method:
+
+```py
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
+```
+
+## Base use
+
+
+<Youtube id="Yffk5aydLzg"/>
+
+A [`PreTrainedTokenizer`] has many methods, but the only one you need to remember for preprocessing
+is its `__call__`: you just need to feed your sentence to your tokenizer object.
+
+```py
+>>> encoded_input = tokenizer("Hello, I'm a single sentence!")
+>>> print(encoded_input)
+{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+This returns a dictionary string to list of ints. The [input_ids](glossary#input-ids) are the indices corresponding
+to each token in our sentence. We will see below what the [attention_mask](glossary#attention-mask) is used for and
+in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](glossary#token-type-ids).
+
+The tokenizer can decode a list of token ids in a proper sentence:
+
+```py
+>>> tokenizer.decode(encoded_input["input_ids"])
+"[CLS] Hello, I'm a single sentence! [SEP]"
+```
+
+As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
+special tokens; for instance, if we had used _gpt2-medium_ instead of _bert-base-cased_ to create our tokenizer, we
+would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
+have added those special tokens yourself) by passing `add_special_tokens=False`.
+
+If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
+tokenizer:
+
+```py
+>>> batch_sentences = ["Hello I'm a single sentence",
+...                    "And another sentence",
+...                    "And the very very last one"]
+>>> encoded_inputs = tokenizer(batch_sentences)
+>>> print(encoded_inputs)
+{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+               [101, 1262, 1330, 5650, 102],
+               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
+ 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 0]],
+ 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                    [1, 1, 1, 1, 1],
+                    [1, 1, 1, 1, 1, 1, 1, 1]]}
+```
+
+We get back a dictionary once again, this time with values being lists of lists of ints.
+
+If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
+probably want:
+
+- To pad each sentence to the maximum length there is in your batch.
+- To truncate each sentence to the maximum length the model can accept (if applicable).
+- To return tensors.
+
+You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
+
+```py
+>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
+>>> print(batch)
+{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                      [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                      [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+ 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+ 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                           [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+===PT-TF-SPLIT===
+>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
+>>> print(batch)
+{'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                      [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                      [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+ 'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+ 'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                           [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+```
+
+It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones
+it should not (because they represent padding in this case).
+
+
+Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
+can safely ignore it. You can also pass `verbose=False` to stop the tokenizer from throwing those kinds of warnings.
+
+<a id='sentence-pairs'></a>
+
+## Preprocessing pairs of sentences
+
+<Youtube id="0u3ioSwev3s"/>
+
+Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
+a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
+is then represented like this: `[CLS] Sequence A [SEP] Sequence B [SEP]`
+
+You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
+(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
+This will once again return a dict string to list of ints:
+
+```py
+>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
+>>> print(encoded_input)
+{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+This shows us what the [token_type_ids](glossary#token-type-ids) are for: they indicate to the model which part of
+the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
+_token_type_ids_ are not required or handled by all models. By default, a tokenizer will only return the inputs that
+its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
+`return_input_ids` or `return_token_type_ids`.
+
+If we decode the token ids we obtained, we will see that the special tokens have been properly added.
+
+```py
+>>> tokenizer.decode(encoded_input["input_ids"])
+"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
+```
+
+If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
+list of first sentences and the list of second sentences:
+
+```py
+>>> batch_sentences = ["Hello I'm a single sentence",
+...                    "And another sentence",
+...                    "And the very very last one"]
+>>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
+...                              "And I should be encoded with the second sentence",
+...                              "And I go with the very last one"]
+>>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
+>>> print(encoded_inputs)
+{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
+               [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
+               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
+'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
+'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
+```
+
+As we can see, it returns a dictionary where each value is a list of lists of ints.
+
+To double-check what is fed to the model, we can decode each list in _input_ids_ one by one:
+
+```py
+>>> for ids in encoded_inputs["input_ids"]:
+>>>     print(tokenizer.decode(ids))
+[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
+[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
+[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
+```
+
+Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
+length the model can accept and return tensors directly with the following:
+
+```py
+batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
+===PT-TF-SPLIT===
+batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
+```
+
+## Everything you always wanted to know about padding and truncation
+
+We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
+truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
+three arguments you need to know for this are `padding`, `truncation` and `max_length`.
+
+- `padding` controls the padding. It can be a boolean or a string which should be:
+
+  - `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
+    a single sequence).
+  - `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
+    by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
+    padding will still be applied to it.
+  - `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
+    behavior.
+
+- `truncation` controls the truncation. It can be a boolean or a string which should be:
+
+  - `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
+    the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
+    truncate token by token, removing a token from the longest sequence in the pair until the proper length is
+    reached.
+  - `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
+    length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
+    the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+  - `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
+    length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
+    the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+  - `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
+    default behavior.
+
+- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
+  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
+  truncation/padding to `max_length` is deactivated.
+
+Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
+any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
+`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+
+| Truncation                           | Padding                           | Instruction                                                                                 |
+|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
+| no truncation                        | no padding                        | `tokenizer(batch_sentences)`                                                           |
+|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True)` or                                          |
+|                                      |                                   | `tokenizer(batch_sentences, padding='longest')`                                        |
+|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')`                                     |
+|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
+| truncation to max model input length | no padding                        | `tokenizer(batch_sentences, truncation=True)` or                                       |
+|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
+|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
+|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
+|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
+|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
+|                                      | padding to specific length        | Not possible                                                                                |
+| truncation to specific length        | no padding                        | `tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
+|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
+|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
+|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
+|                                      | padding to max model input length | Not possible                                                                                |
+|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
+|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
+
+## Pre-tokenized inputs
+
+The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
+predictions in [named entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) or
+[part-of-speech tagging (POS tagging)](https://en.wikipedia.org/wiki/Part-of-speech_tagging).
+
+<Tip warning={true}>
+
+Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
+if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
+like BPE).
+
+</Tip>
+
+If you want to use pre-tokenized inputs, just set `is_split_into_words=True` when passing your inputs to the
+tokenizer. For instance, we have:
+
+```py
+>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
+>>> print(encoded_input)
+{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
+`add_special_tokens=False`.
+
+This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
+like this:
+
+```py
+batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
+                   ["And", "another", "sentence"],
+                   ["And", "the", "very", "very", "last", "one"]]
+encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
+```
+
+or a batch of pair sentences like this:
+
+```py
+batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
+                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
+                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
+encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
+```
+
+And you can add padding, truncation as well as directly return tensors like before:
+
+```py
+batch = tokenizer(batch_sentences,
+                  batch_of_second_sentences,
+                  is_split_into_words=True,
+                  padding=True,
+                  truncation=True,
+                  return_tensors="pt")
+===PT-TF-SPLIT===
+batch = tokenizer(batch_sentences,
+                  batch_of_second_sentences,
+                  is_split_into_words=True,
+                  padding=True,
+                  truncation=True,
+                  return_tensors="tf")
+```
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@ -1,365 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Preprocessing data
-=======================================================================================================================
-
-In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
-call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
-you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
-
-As we saw in the :doc:`quick tour </quicktour>`, the tokenizer will first split a given text in words (or part of
-words, punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able
-to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
-to work properly.
-
-.. note::
-
-    If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
-    the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
-    token to index (that we usually call a `vocab`) as during pretraining.
-
-To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
-:func:`~transformers.AutoTokenizer.from_pretrained` method:
-
-.. code-block::
-
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
-
-Base use
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/Yffk5aydLzg" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
-is its ``__call__``: you just need to feed your sentence to your tokenizer object.
-
-.. code-block::
-
-    >>> encoded_input = tokenizer("Hello, I'm a single sentence!")
-    >>> print(encoded_input)
-    {'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-This returns a dictionary string to list of ints. The `input_ids <glossary#input-ids>`__ are the indices corresponding
-to each token in our sentence. We will see below what the `attention_mask <glossary#attention-mask>`__ is used for and
-in :ref:`the next section <preprocessing-pairs-of-sentences>` the goal of `token_type_ids <glossary#token-type-ids>`__.
-
-The tokenizer can decode a list of token ids in a proper sentence:
-
-.. code-block::
-
-    >>> tokenizer.decode(encoded_input["input_ids"])
-    "[CLS] Hello, I'm a single sentence! [SEP]"
-
-As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
-special tokens; for instance, if we had used `gpt2-medium` instead of `bert-base-cased` to create our tokenizer, we
-would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
-have added those special tokens yourself) by passing ``add_special_tokens=False``.
-
-If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
-tokenizer:
-
-.. code-block::
-
-    >>> batch_sentences = ["Hello I'm a single sentence",
-    ...                    "And another sentence",
-    ...                    "And the very very last one"]
-    >>> encoded_inputs = tokenizer(batch_sentences)
-    >>> print(encoded_inputs)
-    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
-                   [101, 1262, 1330, 5650, 102],
-                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
-     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0, 0, 0, 0]],
-     'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                        [1, 1, 1, 1, 1],
-                        [1, 1, 1, 1, 1, 1, 1, 1]]}
-
-We get back a dictionary once again, this time with values being lists of lists of ints.
-
-If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
-probably want:
-
- To pad each sentence to the maximum length there is in your batch.
- To truncate each sentence to the maximum length the model can accept (if applicable).
- To return tensors.
-
-You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
-    >>> print(batch)
-    {'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
-                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
-                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
-     'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
-     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
-                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-    >>> ## TENSORFLOW CODE
-    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
-    >>> print(batch)
-    {'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
-                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
-                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
-     'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
-     'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
-                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-
-It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask
-<glossary#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which ones
-it should not (because they represent padding in this case).
-
-
-Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
-can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer from throwing those kinds of warnings.
-
-.. _sentence-pairs:
-
-Preprocessing pairs of sentences
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
-a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
-is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
-
-You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
-(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
-This will once again return a dict string to list of ints:
-
-.. code-block::
-
-    >>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
-    >>> print(encoded_input)
-    {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-This shows us what the `token_type_ids <glossary#token-type-ids>`__ are for: they indicate to the model which part of
-the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
-`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
-its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
-``return_input_ids`` or ``return_token_type_ids``.
-
-If we decode the token ids we obtained, we will see that the special tokens have been properly added.
-
-.. code-block::
-
-    >>> tokenizer.decode(encoded_input["input_ids"])
-    "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
-
-If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
-list of first sentences and the list of second sentences:
-
-.. code-block::
-
-    >>> batch_sentences = ["Hello I'm a single sentence",
-    ...                    "And another sentence",
-    ...                    "And the very very last one"]
-    >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
-    ...                              "And I should be encoded with the second sentence",
-    ...                              "And I go with the very last one"]
-    >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
-    >>> print(encoded_inputs)
-    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
-                   [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
-                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
-    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
-    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
-
-As we can see, it returns a dictionary where each value is a list of lists of ints.
-
-To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
-
-.. code-block::
-
-    >>> for ids in encoded_inputs["input_ids"]:
-    >>>     print(tokenizer.decode(ids))
-    [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
-    [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
-    [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
-
-Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
-length the model can accept and return tensors directly with the following:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
-    ## TENSORFLOW CODE
-    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
-
-Everything you always wanted to know about padding and truncation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
-truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
-three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
-
- :obj:`padding` controls the padding. It can be a boolean or a string which should be:
-
-    - :obj:`True` or :obj:`'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
-      a single sequence).
-    - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
-      by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
-      padding will still be applied to it.
-    - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
-      behavior.
-
- :obj:`truncation` controls the truncation. It can be a boolean or a string which should be:
-
-    - :obj:`True` or :obj:`'longest_first'` truncate to a maximum length specified by the :obj:`max_length` argument or
-      the maximum length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will
-      truncate token by token, removing a token from the longest sequence in the pair until the proper length is
-      reached.
-    - :obj:`'only_second'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
-      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will only truncate
-      the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-    - :obj:`'only_first'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
-      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will only truncate
-      the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-    - :obj:`False` or :obj:`'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
-      default behavior.
-
- :obj:`max_length` to control the length of the padding/truncation. It can be an integer or :obj:`None`, in which case
-  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
-  truncation/padding to :obj:`max_length` is deactivated.
-
-Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
-any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in
-:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`truncation=
-'longest_first'` to control how both sequence in the pair are truncated as detailed before.
-
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| Truncation                           | Padding                           | Instruction                                                                                 |
-+======================================+===================================+=============================================================================================+
-| no truncation                        | no padding                        | :obj:`tokenizer(batch_sentences)`                                                           |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True)` or                                          |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='longest')`                                        |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length')`                                     |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| truncation to max model input length | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True)` or                                       |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | Not possible                                                                                |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| truncation to specific length        | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | Not possible                                                                                |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-
-Pre-tokenized inputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
-predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
-`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
-
-.. warning::
-
-    Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
-    if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
-    like BPE).
-
-If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
-tokenizer. For instance, we have:
-
-.. code-block::
-
-    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
-    >>> print(encoded_input)
-    {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
-``add_special_tokens=False``.
-
-This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
-like this:
-
-.. code-block::
-
-    batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
-                       ["And", "another", "sentence"],
-                       ["And", "the", "very", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
-
-or a batch of pair sentences like this:
-
-.. code-block::
-
-    batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
-                                 ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
-                                 ["And", "I", "go", "with", "the", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
-
-And you can add padding, truncation as well as directly return tensors like before:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    batch = tokenizer(batch_sentences,
-                      batch_of_second_sentences,
-                      is_split_into_words=True,
-                      padding=True,
-                      truncation=True,
-                      return_tensors="pt")
-    ## TENSORFLOW CODE
-    batch = tokenizer(batch_sentences,
-                      batch_of_second_sentences,
-                      is_split_into_words=True,
-                      padding=True,
-                      truncation=True,
-                      return_tensors="tf")
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@ -0,0 +1,430 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Quick tour
+
+[[open-in-colab]]
+
+Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
+Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
+such as completing a prompt with new text or translating in another language.
+
+First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
+will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.
+
+<Tip>
+
+All code examples presented in the documentation have a switch on the top left for Pytorch versus TensorFlow. If
+not, the code is expected to work for both backends without any change needed.
+
+</Tip>
+
+## Getting started on a task with a pipeline
+
+The easiest way to use a pretrained model on a given task is to use [`~transformers.pipeline`].
+
+<Youtube id="tiZFewofSLM"/>
+
+🤗 Transformers provides the following tasks out of the box:
+
+- Sentiment analysis: is a text positive or negative?
+- Text generation (in English): provide a prompt and the model will generate what follows.
+- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)
+- Question answering: provide the model with some context and a question, extract the answer from the context.
+- Filling masked text: given a text with masked words (e.g., replaced by `[MASK]`), fill the blanks.
+- Summarization: generate a summary of a long text.
+- Translation: translate a text in another language.
+- Feature extraction: return a tensor representation of the text.
+
+Let's see how this work for sentiment analysis (the other tasks are all covered in the [task summary](task_summary)):
+
+Install the following dependencies (if not already installed):
+
+```bash
+pip install torch
+===PT-TF-SPLIT===
+pip install tensorflow
+```
+
+```py
+>>> from transformers import pipeline
+>>> classifier = pipeline('sentiment-analysis')
+```
+
+When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
+look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is
+then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to
+make them readable. For instance:
+
+
+```py
+>>> classifier('We are very happy to show you the 🤗 Transformers library.')
+[{'label': 'POSITIVE', 'score': 0.9998}]
+```
+
+That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning
+a list of dictionaries like this one:
+
+```py
+>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.",
+...            "We hope you don't hate it."])
+>>> for result in results:
+...     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
+label: POSITIVE, with score: 0.9998
+label: NEGATIVE, with score: 0.5309
+```
+
+To use with a large dataset, look at [iterating over a pipeline](main_classes/pipelines)
+
+You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
+fairly neutral.
+
+By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
+look at its [model page](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) to get more
+information about it. It uses the [DistilBERT architecture](model_doc/distilbert) and has been fine-tuned on a
+dataset called SST-2 for the sentiment analysis task.
+
+Let's say we want to use another model; for instance, one that has been trained on French data. We can search through
+the [model hub](https://huggingface.co/models) that gathers models pretrained on a lot of data by research labs, but
+also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
+"French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
+see how we can use it.
+
+You can directly pass the name of the model to use to [`pipeline`]:
+
+```py
+>>> classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
+```
+
+This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
+replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model
+object and its associated tokenizer.
+
+We will need two classes for this. The first is [`AutoTokenizer`], which we will use to download the tokenizer associated to the model we picked and instantiate it. The second is [`AutoModelForSequenceClassification`] (or [`TFAutoModelForSequenceClassification`] if you are using TensorFlow), which we will use to download the model itself. Note that if we were using the library on an other task, the class of the model would change. The [task summary](task_summary) tutorial summarizes which class is used for which task.
+
+```py
+>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
+===PT-TF-SPLIT===
+>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+```
+
+Now, to download the models and tokenizer we found previously, we just have to use the
+[`~transformers.AutoModelForSequenceClassification.from_pretrained`] method (feel free to replace `model_name` by
+any other model from the model hub):
+
+```py
+>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
+>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
+>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
+>>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
+===PT-TF-SPLIT===
+>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
+>>> # This model only exists in PyTorch, so we use the _from_pt_ flag to import that model in TensorFlow.
+>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
+>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
+>>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
+```
+
+If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
+pretrained model on your data. We provide [example scripts](examples) to do so. Once you're done, don't forget
+to share your fine-tuned model on the hub with the community, using [this tutorial](model_sharing).
+
+<a id='pretrained-model'></a>
+
+## Under the hood: pretrained models
+
+Let's now see what happens beneath the hood when using those pipelines.
+
+<Youtube id="AhChOFRegn4"/>
+
+As we saw, the model and tokenizer are created using the `from_pretrained` method:
+
+```py
+>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
+>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
+>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
+===PT-TF-SPLIT===
+>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
+>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
+```
+
+### Using the tokenizer
+
+We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
+words (or part of words, punctuation symbols, etc.) usually called _tokens_. There are multiple rules that can govern
+that process (you can learn more about them in the [tokenizer summary](tokenizer_summary)), which is why we need
+to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
+pretrained.
+
+The second step is to convert those _tokens_ into numbers, to be able to build a tensor out of them and feed them to
+the model. To do this, the tokenizer has a _vocab_, which is the part we download when we instantiate it with the
+`from_pretrained` method, since we need to use the same _vocab_ as when the model was pretrained.
+
+To apply these steps on a given text, we can just feed it to our tokenizer:
+
+```py
+>>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
+```
+
+This returns a dictionary string to list of ints. It contains the [ids of the tokens](glossary#input-ids), as
+mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
+[attention mask](glossary#attention-mask) that the model will use to have a better understanding of the sequence:
+
+
+```py
+>>> print(inputs)
+{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
+batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
+and get tensors back. You can specify all of that to the tokenizer:
+
+```py
+>>> pt_batch = tokenizer(
+...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
+...     padding=True,
+...     truncation=True,
+...     max_length=512,
+...     return_tensors="pt"
+... )
+===PT-TF-SPLIT===
+>>> tf_batch = tokenizer(
+...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
+...     padding=True,
+...     truncation=True,
+...     max_length=512,
+...     return_tensors="tf"
+... )
+```
+
+The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
+token the model was pretrained with. The attention mask is also adapted to take the padding into account:
+
+```py
+>>> for key, value in pt_batch.items():
+...     print(f"{key}: {value.numpy().tolist()}")
+input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
+attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
+===PT-TF-SPLIT===
+>>> for key, value in tf_batch.items():
+...     print(f"{key}: {value.numpy().tolist()}")
+input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
+attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
+```
+
+You can learn more about tokenizers [here](preprocessing).
+
+### Using the model
+
+Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
+contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
+keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding `**`.
+
+```py
+>>> pt_outputs = pt_model(**pt_batch)
+===PT-TF-SPLIT===
+>>> tf_outputs = tf_model(tf_batch)
+```
+
+In 🤗 Transformers, all outputs are objects that contain the model's final activations along with other metadata. These
+objects are described in greater detail [here](main_classes/output). For now, let's inspect the output ourselves:
+
+```py
+>>> print(pt_outputs)
+SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
+        [ 0.0818, -0.0418]], grad_fn=&amp;lt;AddmmBackward>), hidden_states=None, attentions=None)
+===PT-TF-SPLIT===
+>>> print(tf_outputs)
+TFSequenceClassifierOutput(loss=None, logits=&amp;lt;tf.Tensor: shape=(2, 2), dtype=float32, numpy=
+array([[-4.0833 ,  4.3364  ],
+       [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)
+```
+
+Notice how the output object has a `logits` attribute. You can use this to access the model's final activations.
+
+<Tip>
+
+All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
+function (like SoftMax) since this final activation function is often fused with the loss.
+
+</Tip>
+
+Let's apply the SoftMax activation to get predictions.
+
+```py
+>>> from torch import nn
+>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
+===PT-TF-SPLIT===
+>>> import tensorflow as tf
+>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
+```
+
+We can see we get the numbers from before:
+
+```py
+>>> print(pt_predictions)
+tensor([[2.2043e-04, 9.9978e-01],
+        [5.3086e-01, 4.6914e-01]], grad_fn=&amp;lt;SoftmaxBackward>)
+===PT-TF-SPLIT===
+>>> print(tf_predictions)
+tf.Tensor(
+[[2.2043e-04 9.9978e-01]
+ [5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
+```
+
+If you provide the model with labels in addition to inputs, the model output object will also contain a `loss`
+attribute:
+
+```py
+>>> import torch
+>>> pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
+>>> print(pt_outputs)
+SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=&amp;lt;NllLossBackward>), logits=tensor([[-4.0833,  4.3364],
+        [ 0.0818, -0.0418]], grad_fn=&amp;lt;AddmmBackward>), hidden_states=None, attentions=None)
+===PT-TF-SPLIT===
+>>> import tensorflow as tf
+>>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
+>>> print(tf_outputs)
+TFSequenceClassifierOutput(loss=&amp;lt;tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051e-04, 6.3326e-01], dtype=float32)>, logits=&amp;lt;tf.Tensor: shape=(2, 2), dtype=float32, numpy=
+array([[-4.0833 ,  4.3364  ],
+       [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)
+```
+
+Models are standard [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) so you can use them in your usual training loop. 🤗 Transformers also provides a [`Trainer`] class to help with your training in PyTorch (taking care of things such as distributed training, mixed precision, etc.) whereas you can leverage the `fit()` method in Keras. See the [training tutorial](training) for more details.
+
+<Tip>
+
+Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE.
+They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which
+case the attributes not set (that have `None` values) are ignored.
+
+</Tip>
+
+Once your model is fine-tuned, you can save it with its tokenizer in the following way:
+
+```py
+>>> pt_save_directory = './pt_save_pretrained'
+>>> tokenizer.save_pretrained(pt_save_directory)
+>>> pt_model.save_pretrained(pt_save_directory)
+===PT-TF-SPLIT===
+>>> tf_save_directory = './tf_save_pretrained'
+>>> tokenizer.save_pretrained(tf_save_directory)
+>>> tf_model.save_pretrained(tf_save_directory)
+```
+
+You can then load this model back using the [`AutoModel.from_pretrained`] method by passing the
+directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between
+PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow.
+
+
+If you would like to load your saved model in the other framework, first make sure it is installed:
+
+```bash
+pip install tensorflow
+===PT-TF-SPLIT===
+pip install torch
+```
+
+Then, use the corresponding Auto class to load it like this:
+
+```py
+>>> from transformers import TFAutoModel
+>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
+>>> tf_model = TFAutoModel.from_pretrained(pt_save_directory, from_pt=True)
+===PT-TF-SPLIT===
+>>> from transformers import AutoModel
+>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
+>>> pt_model = AutoModel.from_pretrained(tf_save_directory, from_tf=True)
+```
+
+Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
+
+
+```py
+>>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
+>>> all_hidden_states  = pt_outputs.hidden_states 
+>>> all_attentions = pt_outputs.attentions
+===PT-TF-SPLIT===
+>>> tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
+>>> all_hidden_states =  tf_outputs.hidden_states
+>>> all_attentions = tf_outputs.attentions
+```
+
+### Accessing the code
+
+The [`AutoModel`] and [`AutoTokenizer`] classes are just shortcuts that will automatically work with any
+pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
+code is easy to access and tweak if you need to.
+
+In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
+the [DistilBERT](model_doc/distilbert) architecture. As [`AutoModelForSequenceClassification`] (or [`TFAutoModelForSequenceClassification`] if you are using TensorFlow) was used, the model automatically created is then a [`DistilBertForSequenceClassification`]. You can look at its documentation for all details relevant to that specific model, or browse the source code. This is how you would
+directly instantiate model and tokenizer without the auto magic:
+
+```py
+>>> from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
+>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+>>> model = DistilBertForSequenceClassification.from_pretrained(model_name)
+>>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+===PT-TF-SPLIT===
+>>> from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
+>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+>>> model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
+>>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+```
+
+### Customizing the model
+
+If you want to change how the model itself is built, you can define a custom configuration class. Each architecture
+comes with its own relevant configuration. For example, [`DistilBertConfig`] allows you to specify
+parameters such as the hidden dimension, dropout rate, etc for DistilBERT. If you do core modifications, like changing
+the hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would
+then instantiate the model directly from this configuration.
+
+Below, we load a predefined vocabulary for a tokenizer with the
+[`~DistilBertTokenizer.from_pretrained`] method. However, unlike the tokenizer, we wish to initialize
+the model from scratch. Therefore, we instantiate the model from a configuration instead of using the
+[`DistilBertForSequenceClassification.from_pretrained`] method.
+
+```py
+>>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
+>>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
+>>> tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+>>> model = DistilBertForSequenceClassification(config)
+===PT-TF-SPLIT===
+>>> from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
+>>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
+>>> tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+>>> model = TFDistilBertForSequenceClassification(config)
+```
+
+For something that only changes the head of the model (for instance, the number of labels), you can still use a
+pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
+Instead of creating a new configuration with all the default values just to change the number of labels, we can instead
+pass any argument a configuration would take to the [`from_pretrained`] method and it will update the default
+configuration appropriately:
+
+```py
+>>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
+>>> model_name = "distilbert-base-uncased"
+>>> model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
+>>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+===PT-TF-SPLIT===
+>>> from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
+>>> model_name = "distilbert-base-uncased"
+>>> model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
+>>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+```
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@ -1,471 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Quick tour
-=======================================================================================================================
-
-Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
-Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
-such as completing a prompt with new text or translating in another language.
-
-First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
-will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.
-
-.. note::
-
-    All code examples presented in the documentation have a switch on the top left for Pytorch versus TensorFlow. If
-    not, the code is expected to work for both backends without any change needed.
-
-Getting started on a task with a pipeline
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`.
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-🤗 Transformers provides the following tasks out of the box:
-
- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
-  etc.)
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words (e.g., replaced by ``[MASK]``), fill the blanks.
- Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.
-
-Let's see how this work for sentiment analysis (the other tasks are all covered in the :doc:`task summary
-</task_summary>`):
-
-Install the following dependencies (if not already installed):
-
-.. code-block:: bash
-
-    ## PYTORCH CODE
-    pip install torch
-    ## TENSORFLOW CODE
-    pip install tensorflow
-
-.. code-block::
-
-    >>> from transformers import pipeline
-    >>> classifier = pipeline('sentiment-analysis')
-
-When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
-look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is
-then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to
-make them readable. For instance:
-
-
-.. code-block::
-
-    >>> classifier('We are very happy to show you the 🤗 Transformers library.')
-    [{'label': 'POSITIVE', 'score': 0.9998}]
-
-That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning
-a list of dictionaries like this one:
-
-.. code-block::
-
-    >>> results = classifier(["We are very happy to show you the 🤗 Transformers library.",
-    ...            "We hope you don't hate it."])
-    >>> for result in results:
-    ...     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
-    label: POSITIVE, with score: 0.9998
-    label: NEGATIVE, with score: 0.5309
-
-To use with a large dataset, look at :doc:`iterating over a pipeline <./main_classes/pipelines>`
-
-You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
-fairly neutral.
-
-By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
-look at its `model page <https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english>`__ to get more
-information about it. It uses the :doc:`DistilBERT architecture </model_doc/distilbert>` and has been fine-tuned on a
-dataset called SST-2 for the sentiment analysis task.
-
-Let's say we want to use another model; for instance, one that has been trained on French data. We can search through
-the `model hub <https://huggingface.co/models>`__ that gathers models pretrained on a lot of data by research labs, but
-also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
-"French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
-see how we can use it.
-
-You can directly pass the name of the model to use to :func:`~transformers.pipeline`:
-
-.. code-block::
-
-    >>> classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
-
-This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
-replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model
-object and its associated tokenizer.
-
-We will need two classes for this. The first is :class:`~transformers.AutoTokenizer`, which we will use to download the
-tokenizer associated to the model we picked and instantiate it. The second is
-:class:`~transformers.AutoModelForSequenceClassification` (or
-:class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow), which we will use to download
-the model itself. Note that if we were using the library on an other task, the class of the model would change. The
-:doc:`task summary </task_summary>` tutorial summarizes which class is used for which task.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
-
-Now, to download the models and tokenizer we found previously, we just have to use the
-:func:`~transformers.AutoModelForSequenceClassification.from_pretrained` method (feel free to replace ``model_name`` by
-any other model from the model hub):
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
-    >>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
-    >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
-    >>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
-    >>> ## TENSORFLOW CODE
-    >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
-    >>> # This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
-    >>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
-    >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
-    >>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
-
-If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
-pretrained model on your data. We provide :doc:`example scripts </examples>` to do so. Once you're done, don't forget
-to share your fine-tuned model on the hub with the community, using :doc:`this tutorial </model_sharing>`.
-
-.. _pretrained-model:
-
-Under the hood: pretrained models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Let's now see what happens beneath the hood when using those pipelines.
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/AhChOFRegn4" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-As we saw, the model and tokenizer are created using the :obj:`from_pretrained` method:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
-    >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
-    >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
-    >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
-    >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
-    >>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
-    >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
-
-Using the tokenizer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
-words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
-that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`), which is why we need
-to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
-pretrained.
-
-The second step is to convert those `tokens` into numbers, to be able to build a tensor out of them and feed them to
-the model. To do this, the tokenizer has a `vocab`, which is the part we download when we instantiate it with the
-:obj:`from_pretrained` method, since we need to use the same `vocab` as when the model was pretrained.
-
-To apply these steps on a given text, we can just feed it to our tokenizer:
-
-.. code-block::
-
-    >>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
-
-This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary#input-ids>`__, as
-mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
-`attention mask <glossary#attention-mask>`__ that the model will use to have a better understanding of the sequence:
-
-
-.. code-block::
-
-    >>> print(inputs)
-    {'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
-batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
-and get tensors back. You can specify all of that to the tokenizer:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> pt_batch = tokenizer(
-    ...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
-    ...     padding=True,
-    ...     truncation=True,
-    ...     max_length=512,
-    ...     return_tensors="pt"
-    ... )
-    >>> ## TENSORFLOW CODE
-    >>> tf_batch = tokenizer(
-    ...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
-    ...     padding=True,
-    ...     truncation=True,
-    ...     max_length=512,
-    ...     return_tensors="tf"
-    ... )
-
-The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
-token the model was pretrained with. The attention mask is also adapted to take the padding into account:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> for key, value in pt_batch.items():
-    ...     print(f"{key}: {value.numpy().tolist()}")
-    input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
-    attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
-    >>> ## TENSORFLOW CODE
-    >>> for key, value in tf_batch.items():
-    ...     print(f"{key}: {value.numpy().tolist()}")
-    input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
-    attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
-
-You can learn more about tokenizers :doc:`here <preprocessing>`.
-
-Using the model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
-contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
-keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> pt_outputs = pt_model(**pt_batch)
-    >>> ## TENSORFLOW CODE
-    >>> tf_outputs = tf_model(tf_batch)
-
-In 🤗 Transformers, all outputs are objects that contain the model's final activations along with other metadata. These
-objects are described in greater detail :doc:`here <main_classes/output>`. For now, let's inspect the output ourselves:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> print(pt_outputs)
-    SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
-            [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
-    >>> ## TENSORFLOW CODE
-    >>> print(tf_outputs)
-    TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
-    array([[-4.0833 ,  4.3364  ],
-           [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)
-
-Notice how the output object has a ``logits`` attribute. You can use this to access the model's final activations.
-
-.. note::
-
-    All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
-    function (like SoftMax) since this final activation function is often fused with the loss.
-
-Let's apply the SoftMax activation to get predictions.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from torch import nn
-    >>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
-    >>> ## TENSORFLOW CODE
-    >>> import tensorflow as tf
-    >>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
-
-We can see we get the numbers from before:
-
-.. code-block::
-
-    >>> ## TENSORFLOW CODE
-    >>> print(tf_predictions)
-    tf.Tensor(
-    [[2.2043e-04 9.9978e-01]
-     [5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
-    >>> ## PYTORCH CODE
-    >>> print(pt_predictions)
-    tensor([[2.2043e-04, 9.9978e-01],
-            [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
-
-If you provide the model with labels in addition to inputs, the model output object will also contain a ``loss``
-attribute:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> import torch
-    >>> pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
-    >>> print(pt_outputs)
-    SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833,  4.3364],
-            [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
-    >>> ## TENSORFLOW CODE
-    >>> import tensorflow as tf
-    >>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
-    >>> print(tf_outputs)
-    TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051e-04, 6.3326e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
-    array([[-4.0833 ,  4.3364  ],
-           [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)
-
-Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
-<https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗
-Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if you are using
-TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision,
-etc.). See the :doc:`training tutorial <training>` for more details.
-
-.. note::
-
-    Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE.
-    They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which
-    case the attributes not set (that have :obj:`None` values) are ignored.
-
-Once your model is fine-tuned, you can save it with its tokenizer in the following way:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> pt_save_directory = './pt_save_pretrained'
-    >>> tokenizer.save_pretrained(pt_save_directory)
-    >>> pt_model.save_pretrained(pt_save_directory)
-    >>> ## TENSORFLOW CODE
-    >>> tf_save_directory = './tf_save_pretrained'
-    >>> tokenizer.save_pretrained(tf_save_directory)
-    >>> tf_model.save_pretrained(tf_save_directory)
-
-You can then load this model back using the :func:`~transformers.AutoModel.from_pretrained` method by passing the
-directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between
-PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow.
-
-
-If you would like to load your saved model in the other framework, first make sure it is installed:
-
-.. code-block:: bash
-
-    ## PYTORCH CODE
-    pip install tensorflow
-    ## TENSORFLOW CODE
-    pip install torch
-
-Then, use the corresponding Auto class to load it like this:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    >>> from transformers import TFAutoModel
-    >>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
-    >>> tf_model = TFAutoModel.from_pretrained(pt_save_directory, from_pt=True)
-    ## TENSORFLOW CODE
-    >>> from transformers import AutoModel
-    >>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
-    >>> pt_model = AutoModel.from_pretrained(tf_save_directory, from_tf=True)
-
-
-Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
-
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
-    >>> all_hidden_states  = pt_outputs.hidden_states 
-    >>> all_attentions = pt_outputs.attentions
-    >>> ## TENSORFLOW CODE
-    >>> tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
-    >>> all_hidden_states =  tf_outputs.hidden_states
-    >>> all_attentions = tf_outputs.attentions
-
-Accessing the code
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that will automatically work with any
-pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
-code is easy to access and tweak if you need to.
-
-In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
-the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
-:class:`~transformers.AutoModelForSequenceClassification` (or
-:class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow) was used, the model
-automatically created is then a :class:`~transformers.DistilBertForSequenceClassification`. You can look at its
-documentation for all details relevant to that specific model, or browse the source code. This is how you would
-directly instantiate model and tokenizer without the auto magic:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
-    >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
-    >>> model = DistilBertForSequenceClassification.from_pretrained(model_name)
-    >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
-    >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
-    >>> model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
-    >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
-
-Customizing the model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you want to change how the model itself is built, you can define a custom configuration class. Each architecture
-comes with its own relevant configuration. For example, :class:`~transformers.DistilBertConfig` allows you to specify
-parameters such as the hidden dimension, dropout rate, etc for DistilBERT. If you do core modifications, like changing
-the hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would
-then instantiate the model directly from this configuration.
-
-Below, we load a predefined vocabulary for a tokenizer with the
-:func:`~transformers.DistilBertTokenizer.from_pretrained` method. However, unlike the tokenizer, we wish to initialize
-the model from scratch. Therefore, we instantiate the model from a configuration instead of using the
-:func:`~transformers.DistilBertForSequenceClassification.from_pretrained` method.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
-    >>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
-    >>> tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-    >>> model = DistilBertForSequenceClassification(config)
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
-    >>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
-    >>> tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-    >>> model = TFDistilBertForSequenceClassification(config)
-
-For something that only changes the head of the model (for instance, the number of labels), you can still use a
-pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
-Instead of creating a new configuration with all the default values just to change the number of labels, we can instead
-pass any argument a configuration would take to the :func:`from_pretrained` method and it will update the default
-configuration appropriately:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
-    >>> model_name = "distilbert-base-uncased"
-    >>> model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
-    >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
-    >>> model_name = "distilbert-base-uncased"
-    >>> model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
-    >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
@ -0,0 +1,888 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Summary of the tasks
+
+[[open-in-colab]]
+
+This page shows the most frequent use-cases when using the library. The models available allow for many different
+configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage for
+tasks such as question answering, sequence classification, named entity recognition and others.
+
+These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint,
+automatically selecting the correct model architecture. Please check the [`AutoModel`] documentation
+for more information. Feel free to modify the code to be more specific and adapt it to your specific use-case.
+
+In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These
+checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the
+following:
+
+- Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage
+  one of the *run_$TASK.py* scripts in the [examples](https://github.com/huggingface/transformers/tree/master/examples) directory.
+- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case and
+  domain. As mentioned previously, you may leverage the [examples](https://github.com/huggingface/transformers/tree/master/examples) scripts to fine-tune your model, or you may
+  create your own training script.
+
+In order to do an inference on a task, several mechanisms are made available by the library:
+
+- Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
+- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer
+  (PyTorch/TensorFlow) and full inference capacity.
+
+Both approaches are showcased here.
+
+<Tip>
+
+All tasks presented here leverage pre-trained checkpoints that were fine-tuned on specific tasks. Loading a
+checkpoint that was not fine-tuned on a specific task would load only the base transformer layers and not the
+additional head that is used for the task, initializing the weights of that head randomly.
+
+This would produce random output.
+
+</Tip>
+
+## Sequence Classification
+
+Sequence classification is the task of classifying sequences according to a given number of classes. An example of
+sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
+model on a GLUE sequence classification task, you may leverage the [run_glue.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification/run_glue.py), [run_tf_glue.py](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification/run_tf_glue.py), [run_tf_text_classification.py](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification/run_tf_text_classification.py) or [run_xnli.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification/run_xnli.py) scripts.
+
+Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
+leverages a fine-tuned model on sst2, which is a GLUE task.
+
+This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
+
+```py
+>>> from transformers import pipeline
+
+>>> classifier = pipeline("sentiment-analysis")
+
+>>> result = classifier("I hate you")[0]
+>>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
+label: NEGATIVE, with score: 0.9991
+
+>>> result = classifier("I love you")[0]
+>>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
+label: POSITIVE, with score: 0.9999
+```
+
+Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases of
+each other. The process is the following:
+
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
+   with the weights stored in the checkpoint.
+2. Build a sequence from the two sentences, with the correct model-specific separators, token type ids and attention
+   masks (which will be created automatically by the tokenizer).
+3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
+   paraphrase) and 1 (is a paraphrase).
+4. Compute the softmax of the result to get probabilities over the classes.
+5. Print the results.
+
+```py
+>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
+>>> import torch
+
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
+>>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
+
+>>> classes = ["not paraphrase", "is paraphrase"]
+
+>>> sequence_0 = "The company HuggingFace is based in New York City"
+>>> sequence_1 = "Apples are especially bad for your health"
+>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
+
+>>> # The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
+>>> # the sequence, as well as compute the attention masks.
+>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
+>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
+
+>>> paraphrase_classification_logits = model(**paraphrase).logits
+>>> not_paraphrase_classification_logits = model(**not_paraphrase).logits
+
+>>> paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
+>>> not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
+
+>>> # Should be paraphrase
+>>> for i in range(len(classes)):
+...     print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
+not paraphrase: 10%
+is paraphrase: 90%
+
+>>> # Should not be paraphrase
+>>> for i in range(len(classes)):
+...     print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
+not paraphrase: 94%
+is paraphrase: 6%
+===PT-TF-SPLIT===
+>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+>>> import tensorflow as tf
+
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
+>>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
+
+>>> classes = ["not paraphrase", "is paraphrase"]
+
+>>> sequence_0 = "The company HuggingFace is based in New York City"
+>>> sequence_1 = "Apples are especially bad for your health"
+>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
+
+>>> # The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
+>>> # the sequence, as well as compute the attention masks.
+>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
+>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
+
+>>> paraphrase_classification_logits = model(paraphrase).logits
+>>> not_paraphrase_classification_logits = model(not_paraphrase).logits
+
+>>> paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
+>>> not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
+
+>>> # Should be paraphrase
+>>> for i in range(len(classes)):
+...     print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
+not paraphrase: 10%
+is paraphrase: 90%
+
+>>> # Should not be paraphrase
+>>> for i in range(len(classes)):
+...     print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
+not paraphrase: 94%
+is paraphrase: 6%
+```
+
+## Extractive Question Answering
+
+Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
+question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
+model on a SQuAD task, you may leverage the [run_qa.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering/run_qa.py) and
+[run_tf_squad.py](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/question-answering/run_tf_squad.py)
+scripts.
+
+
+Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
+leverages a fine-tuned model on SQuAD.
+
+```py
+>>> from transformers import pipeline
+
+>>> question_answerer = pipeline("question-answering")
+
+>>> context = r"""
+... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
+... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
+... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
+... """
+```
+
+This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
+positions of the extracted answer in the text.
+
+```py
+>>> result = question_answerer(question="What is extractive question answering?", context=context)
+>>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
+Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
+
+>>> result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
+>>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
+Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
+```
+
+Here is an example of question answering using a model and a tokenizer. The process is the following:
+
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
+   with the weights stored in the checkpoint.
+2. Define a text and a few questions.
+3. Iterate over the questions and build a sequence from the text and the current question, with the correct
+   model-specific separators token type ids and attention masks.
+4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and
+   text), for both the start and end positions.
+5. Compute the softmax of the result to get probabilities over the tokens.
+6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
+7. Print the results.
+
+```py
+>>> from transformers import AutoTokenizer, AutoModelForQuestionAnswering
+>>> import torch
+
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+>>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+
+>>> text = r"""
+... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
+... architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
+... Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
+... TensorFlow 2.0 and PyTorch.
+... """
+
+>>> questions = [
+...     "How many pretrained models are available in 🤗 Transformers?",
+...     "What does 🤗 Transformers provide?",
+...     "🤗 Transformers provides interoperability between which frameworks?",
+... ]
+
+>>> for question in questions:
+...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
+...     input_ids = inputs["input_ids"].tolist()[0]
+...
+...     outputs = model(**inputs)
+...     answer_start_scores = outputs.start_logits
+...     answer_end_scores = outputs.end_logits
+...
+...     # Get the most likely beginning of answer with the argmax of the score
+...     answer_start = torch.argmax(answer_start_scores)
+...     # Get the most likely end of answer with the argmax of the score 
+...     answer_end = torch.argmax(answer_end_scores) + 1
+...
+...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
+...
+...     print(f"Question: {question}")
+...     print(f"Answer: {answer}")
+Question: How many pretrained models are available in 🤗 Transformers?
+Answer: over 32 +
+Question: What does 🤗 Transformers provide?
+Answer: general - purpose architectures
+Question: 🤗 Transformers provides interoperability between which frameworks?
+Answer: tensorflow 2. 0 and pytorch
+===PT-TF-SPLIT===
+>>> from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
+>>> import tensorflow as tf
+
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+>>> model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+
+>>> text = r"""
+... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
+... architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
+... Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
+... TensorFlow 2.0 and PyTorch.
+... """
+
+>>> questions = [
+...     "How many pretrained models are available in 🤗 Transformers?",
+...     "What does 🤗 Transformers provide?",
+...     "🤗 Transformers provides interoperability between which frameworks?",
+... ]
+
+>>> for question in questions:
+...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
+...     input_ids = inputs["input_ids"].numpy()[0]
+...
+...     outputs = model(inputs)
+...     answer_start_scores = outputs.start_logits
+...     answer_end_scores = outputs.end_logits
+...
+...     # Get the most likely beginning of answer with the argmax of the score
+...     answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
+...     # Get the most likely end of answer with the argmax of the score
+...     answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1
+...
+...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
+...
+...     print(f"Question: {question}")
+...     print(f"Answer: {answer}")
+Question: How many pretrained models are available in 🤗 Transformers?
+Answer: over 32 +
+Question: What does 🤗 Transformers provide?
+Answer: general - purpose architectures
+Question: 🤗 Transformers provides interoperability between which frameworks?
+Answer: tensorflow 2. 0 and pytorch
+```
+
+## Language Modeling
+
+Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular
+transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
+GPT-2 with causal language modeling.
+
+Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
+domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
+on scientific papers e.g. [LysandreJik/arxiv-nlp](https://huggingface.co/lysandre/arxiv-nlp).
+
+### Masked Language Modeling
+
+Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
+fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
+right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
+downstream tasks requiring bi-directional context, such as SQuAD (question answering, see [Lewis, Lui, Goyal et al.](https://arxiv.org/abs/1910.13461), part 4.2). If you would like to fine-tune a model on a masked language modeling
+task, you may leverage the [run_mlm.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling/run_mlm.py) script.
+
+Here is an example of using pipelines to replace a mask from a sequence:
+
+```py
+>>> from transformers import pipeline
+
+>>> unmasker = pipeline("fill-mask")
+```
+
+This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer vocabulary:
+
+```py
+>>> from pprint import pprint
+>>> pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."))
+[{'score': 0.1793,
+  'sequence': 'HuggingFace is creating a tool that the community uses to solve '
+              'NLP tasks.',
+  'token': 3944,
+  'token_str': ' tool'},
+ {'score': 0.1135,
+  'sequence': 'HuggingFace is creating a framework that the community uses to '
+              'solve NLP tasks.',
+  'token': 7208,
+  'token_str': ' framework'},
+ {'score': 0.0524,
+  'sequence': 'HuggingFace is creating a library that the community uses to '
+              'solve NLP tasks.',
+  'token': 5560,
+  'token_str': ' library'},
+ {'score': 0.0349,
+  'sequence': 'HuggingFace is creating a database that the community uses to '
+              'solve NLP tasks.',
+  'token': 8503,
+  'token_str': ' database'},
+ {'score': 0.0286,
+  'sequence': 'HuggingFace is creating a prototype that the community uses to '
+              'solve NLP tasks.',
+  'token': 17715,
+  'token_str': ' prototype'}]
+```
+
+Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:
+
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and
+   loads it with the weights stored in the checkpoint.
+2. Define a sequence with a masked token, placing the `tokenizer.mask_token` instead of a word.
+3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
+4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the
+   values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that
+   context.
+5. Retrieve the top 5 tokens using the PyTorch `topk` or TensorFlow `top_k` methods.
+6. Replace the mask token by the tokens and print the results
+
+```py
+>>> from transformers import AutoModelForMaskedLM, AutoTokenizer
+>>> import torch
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
+>>> model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
+
+>>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
+...     f"versions would help {tokenizer.mask_token} our carbon footprint."
+
+>>> inputs = tokenizer(sequence, return_tensors="pt")
+>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
+
+>>> token_logits = model(**inputs).logits
+>>> mask_token_logits = token_logits[0, mask_token_index, :]
+
+>>> top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+
+>>> for token in top_5_tokens:
+...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForMaskedLM, AutoTokenizer
+>>> import tensorflow as tf
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
+>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
+
+>>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
+...     f"versions would help {tokenizer.mask_token} our carbon footprint."
+
+>>> inputs = tokenizer(sequence, return_tensors="tf")
+>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
+
+>>> token_logits = model(**inputs).logits
+>>> mask_token_logits = token_logits[0, mask_token_index, :]
+
+>>> top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()
+
+>>> for token in top_5_tokens:
+...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
+Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
+```
+
+This prints five sequences, with the top 5 tokens predicted by the model.
+
+
+### Causal Language Modeling
+
+Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
+model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
+for generation tasks. If you would like to fine-tune a model on a causal language modeling task, you may leverage the
+[run_clm.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling/run_clm.py) script.
+
+Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
+input sequence.
+
+Here is an example of using the tokenizer and model and leveraging the
+[`PreTrainedModel.top_k_top_p_filtering`] method to sample the next token following an input sequence
+of tokens.
+
+```py
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
+>>> import torch
+>>> from torch import nn
+
+>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
+>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
+
+>>> sequence = f"Hugging Face is based in DUMBO, New York City, and"
+
+>>> inputs = tokenizer(sequence, return_tensors="pt")
+>>> input_ids = inputs["input_ids"]
+
+>>> # get logits of last hidden state
+>>> next_token_logits = model(**inputs).logits[:, -1, :]
+
+>>> # filter
+>>> filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
+
+>>> # sample
+>>> probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
+>>> next_token = torch.multinomial(probs, num_samples=1)
+
+>>> generated = torch.cat([input_ids, next_token], dim=-1)
+
+>>> resulting_string = tokenizer.decode(generated.tolist()[0])
+>>> print(resulting_string)
+Hugging Face is based in DUMBO, New York City, and ...
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
+>>> import tensorflow as tf
+
+>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
+>>> model = TFAutoModelForCausalLM.from_pretrained("gpt2")
+
+>>> sequence = f"Hugging Face is based in DUMBO, New York City, and"
+
+>>> inputs = tokenizer(sequence, return_tensors="tf")
+>>> input_ids = inputs["input_ids"]
+
+>>> # get logits of last hidden state
+>>> next_token_logits = model(**inputs).logits[:, -1, :]
+
+>>> # filter
+>>> filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
+
+>>> # sample
+>>> next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)
+
+>>> generated = tf.concat([input_ids, next_token], axis=1)
+
+>>> resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
+>>> print(resulting_string)
+Hugging Face is based in DUMBO, New York City, and ...
+```
+
+This outputs a (hopefully) coherent next token following the original sequence, which in our case is the word *is* or
+*features*.
+
+In the next section, we show how [`generation_utils.GenerationMixin.generate`] can be used to
+generate multiple tokens up to a specified length instead of one token at a time.
+
+### Text Generation
+
+In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a
+continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text.
+As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations
+(see [gpt-2 config](https://huggingface.co/gpt2/blob/main/config.json) for example).
+
+```py
+>>> from transformers import pipeline
+
+>>> text_generator = pipeline("text-generation")
+>>> print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
+[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
+"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
+```
+
+Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
+concerned, I will"*. Behind the scenes, the pipeline object calls the method
+[`PreTrainedModel.generate`] to generate text. The default arguments for this method can be
+overridden in the pipeline, as is shown above for the arguments `max_length` and `do_sample`.
+
+Below is an example of text generation using `XLNet` and its tokenizer, which includes calling `generate()` directly:
+
+```py
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer
+
+>>> model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
+>>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
+
+>>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
+>>> PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+... (except for Alexei and Maria) are discovered.
+... The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+... remainder of the story. 1883 Western Siberia,
+... a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+... Rasputin has a vision and denounces one of the men as a horse thief. Although his
+... father initially slaps him for making such an accusation, Rasputin watches as the
+... man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+... the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+... with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+>>> prompt = "Today the weather is really nice and I am planning on "
+>>> inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
+
+>>> prompt_length = len(tokenizer.decode(inputs[0]))
+>>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+>>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]
+
+>>> print(generated)
+Today the weather is really nice and I am planning ...
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForCausalLM, AutoTokenizer
+
+>>> model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
+>>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
+
+>>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
+>>> PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+... (except for Alexei and Maria) are discovered.
+... The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+... remainder of the story. 1883 Western Siberia,
+... a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+... Rasputin has a vision and denounces one of the men as a horse thief. Although his
+... father initially slaps him for making such an accusation, Rasputin watches as the
+... man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+... the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+... with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+>>> prompt = "Today the weather is really nice and I am planning on "
+>>> inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")["input_ids"]
+
+>>> prompt_length = len(tokenizer.decode(inputs[0]))
+>>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+>>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]
+
+>>> print(generated)
+Today the weather is really nice and I am planning ...
+```
+
+Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
+PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
+need to be padded to work well. GPT-2 is usually a good choice for *open-ended text generation* because it was trained
+on millions of webpages with a causal language modeling objective.
+
+For more information on how to apply different decoding strategies for text generation, please also refer to our text
+generation blog post [here](https://huggingface.co/blog/how-to-generate).
+
+
+## Named Entity Recognition
+
+Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
+as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
+which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
+[run_ner.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification/run_ner.py) script.
+
+Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
+belonging to one of 9 classes:
+
+- O, Outside of a named entity
+- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
+- I-MIS, Miscellaneous entity
+- B-PER, Beginning of a person's name right after another person's name
+- I-PER, Person's name
+- B-ORG, Beginning of an organisation right after another organisation
+- I-ORG, Organisation
+- B-LOC, Beginning of a location right after another location
+- I-LOC, Location
+
+It leverages a fine-tuned model on CoNLL-2003, fine-tuned by [@stefan-it](https://github.com/stefan-it) from [dbmdz](https://github.com/dbmdz).
+
+```py
+>>> from transformers import pipeline
+
+>>> ner_pipe = pipeline("ner")
+
+>>> sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
+... therefore very close to the Manhattan Bridge which is visible from the window."""
+```
+
+This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
+Here are the expected results:
+
+```py
+>>> for entity in ner_pipe(sequence):
+...     print(entity)
+{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
+{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
+{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
+{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
+{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
+{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
+{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
+{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
+{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
+{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
+{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
+{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
+```
+
+Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
+"DUMBO" and "Manhattan Bridge" have been identified as locations.
+
+Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
+
+1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
+   with the weights stored in the checkpoint.
+2. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
+3. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
+   encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
+4. Encode that sequence into IDs (special tokens are added automatically).
+5. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
+   distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
+   each token.
+6. Zip together each token with its prediction and print it.
+
+```py
+>>> from transformers import AutoModelForTokenClassification, AutoTokenizer
+>>> import torch
+
+>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+
+>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
+...            "therefore very close to the Manhattan Bridge."
+
+>>> inputs = tokenizer(sequence, return_tensors="pt")
+>>> tokens = inputs.tokens()
+
+>>> outputs = model(**inputs).logits
+>>> predictions = torch.argmax(outputs, dim=2)
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
+>>> import tensorflow as tf
+
+>>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+
+>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
+...            "therefore very close to the Manhattan Bridge."
+
+>>> inputs = tokenizer(sequence, return_tensors="tf")
+>>> tokens = inputs.tokens()
+
+>>> outputs = model(**inputs)[0]
+>>> predictions = tf.argmax(outputs, axis=2)
+```
+
+This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
+token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
+token.
+
+In the above example, `predictions` is an integer that corresponds to the predicted class. We can use the
+`model.config.id2label` property in order to recover the class name corresponding to the class number, which is
+illustrated below:
+
+```py
+>>> for token, prediction in zip(tokens, predictions[0].numpy()):
+...     print((token, model.config.id2label[prediction]))
+('[CLS]', 'O')
+('Hu', 'I-ORG')
+('##gging', 'I-ORG')
+('Face', 'I-ORG')
+('Inc', 'I-ORG')
+('.', 'O')
+('is', 'O')
+('a', 'O')
+('company', 'O')
+('based', 'O')
+('in', 'O')
+('New', 'I-LOC')
+('York', 'I-LOC')
+('City', 'I-LOC')
+('.', 'O')
+('Its', 'O')
+('headquarters', 'O')
+('are', 'O')
+('in', 'O')
+('D', 'I-LOC')
+('##UM', 'I-LOC')
+('##BO', 'I-LOC')
+(',', 'O')
+('therefore', 'O')
+('very', 'O')
+('close', 'O')
+('to', 'O')
+('the', 'O')
+('Manhattan', 'I-LOC')
+('Bridge', 'I-LOC')
+('.', 'O')
+('[SEP]', 'O')
+```
+
+## Summarization
+
+Summarization is the task of summarizing a document or an article into a shorter text. If you would like to fine-tune a
+model on a summarization task, you may leverage the [run_summarization.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/run_summarization.py)
+script.
+
+An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
+created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
+approaches are described in this [document](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
+
+Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
+/ Daily Mail data set.
+
+```py
+>>> from transformers import pipeline
+
+>>> summarizer = pipeline("summarization")
+
+>>> ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
+... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
+... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
+... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
+... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
+... 2010 marriage license application, according to court documents.
+... Prosecutors said the marriages were part of an immigration scam.
+... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
+... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
+... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
+... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
+... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
+... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
+... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
+... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
+... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
+... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
+... """
+```
+
+Because the summarization pipeline depends on the `PreTrainedModel.generate()` method, we can override the default
+arguments of `PreTrainedModel.generate()` directly in the pipeline for `max_length` and `min_length` as shown
+below. This outputs the following summary:
+
+```py
+>>> print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
+[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
+the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
+2002 . At one time, she was married to eight men at once, prosecutors say .'}]
+```
+
+Here is an example of doing summarization using a model and a tokenizer. The process is the following:
+
+1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
+   model, such as `Bart` or `T5`.
+2. Define the article that should be summarized.
+3. Add the T5 specific prefix "summarize: ".
+4. Use the `PreTrainedModel.generate()` method to generate the summary.
+
+In this example we use Google's T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
+CNN / Daily Mail), it yields very good results.
+
+```py
+>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+
+>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
+>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
+
+>>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
+>>> inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
+>>> outputs = model.generate(
+...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
+... )
+
+>>> print(tokenizer.decode(outputs[0]))
+<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
+counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
+between 1999 and 2002.</s>
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
+
+>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
+>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
+
+>>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
+>>> inputs = tokenizer("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
+>>> outputs = model.generate(
+...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
+... )
+
+>>> print(tokenizer.decode(outputs[0]))
+<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
+counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
+between 1999 and 2002.
+```
+
+## Translation
+
+Translation is the task of translating a text from one language to another. If you would like to fine-tune a model on a
+translation task, you may leverage the [run_translation.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation/run_translation.py) script.
+
+An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
+data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
+translation task, various approaches are described in this [document](https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation/README.md).
+
+Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
+multi-task mixture dataset (including WMT), yet, yielding impressive translation results.
+
+```py
+>>> from transformers import pipeline
+
+>>> translator = pipeline("translation_en_to_de")
+>>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
+[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
+```
+
+Because the translation pipeline depends on the `PreTrainedModel.generate()` method, we can override the default
+arguments of `PreTrainedModel.generate()` directly in the pipeline as is shown for `max_length` above.
+
+Here is an example of doing translation using a model and a tokenizer. The process is the following:
+
+1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
+   model, such as `Bart` or `T5`.
+2. Define the article that should be summarized.
+3. Add the T5 specific prefix "translate English to German: "
+4. Use the `PreTrainedModel.generate()` method to perform the translation.
+
+```py
+>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+
+>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
+>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
+
+>>> inputs = tokenizer(
+...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
+...     return_tensors="pt"
+... )
+>>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
+
+>>> print(tokenizer.decode(outputs[0]))
+<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
+
+>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
+>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
+
+>>> inputs = tokenizer(
+...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
+...     return_tensors="tf"
+... )
+>>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
+
+>>> print(tokenizer.decode(outputs[0]))
+<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
+```
+
+We get the same translation as with the pipeline example.
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@ -1,927 +0,0 @@
-..
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Summary of the tasks
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This page shows the most frequent use-cases when using the library. The models available allow for many different
-configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage for
-tasks such as question answering, sequence classification, named entity recognition and others.
-
-These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint,
-automatically selecting the correct model architecture. Please check the :class:`~transformers.AutoModel` documentation
-for more information. Feel free to modify the code to be more specific and adapt it to your specific use-case.
-
-In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These
-checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the
-following:
-
- Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage
-  one of the `run_$TASK.py` scripts in the `examples
-  <https://github.com/huggingface/transformers/tree/master/examples>`__ directory.
- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case and
-  domain. As mentioned previously, you may leverage the `examples
-  <https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you may
-  create your own training script.
-
-In order to do an inference on a task, several mechanisms are made available by the library:
-
- Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer
-  (PyTorch/TensorFlow) and full inference capacity.
-
-Both approaches are showcased here.
-
-.. note::
-
-    All tasks presented here leverage pre-trained checkpoints that were fine-tuned on specific tasks. Loading a
-    checkpoint that was not fine-tuned on a specific task would load only the base transformer layers and not the
-    additional head that is used for the task, initializing the weights of that head randomly.
-
-    This would produce random output.
-
-Sequence Classification
-----------------------------------------------------------------------------------------------------------------------
-
-Sequence classification is the task of classifying sequences according to a given number of classes. An example of
-sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
-model on a GLUE sequence classification task, you may leverage the :prefix_link:`run_glue.py
-<examples/pytorch/text-classification/run_glue.py>`, :prefix_link:`run_tf_glue.py
-<examples/tensorflow/text-classification/run_tf_glue.py>`, :prefix_link:`run_tf_text_classification.py
-<examples/tensorflow/text-classification/run_tf_text_classification.py>` or :prefix_link:`run_xnli.py
-<examples/pytorch/text-classification/run_xnli.py>` scripts.
-
-Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
-leverages a fine-tuned model on sst2, which is a GLUE task.
-
-This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> classifier = pipeline("sentiment-analysis")
-
-    >>> result = classifier("I hate you")[0]
-    >>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
-    label: NEGATIVE, with score: 0.9991
-
-    >>> result = classifier("I love you")[0]
-    >>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
-    label: POSITIVE, with score: 0.9999
-
-
-Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases of
-each other. The process is the following:
-
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
-   with the weights stored in the checkpoint.
-2. Build a sequence from the two sentences, with the correct model-specific separators, token type ids and attention
-   masks (which will be created automatically by the tokenizer).
-3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
-   paraphrase) and 1 (is a paraphrase).
-4. Compute the softmax of the result to get probabilities over the classes.
-5. Print the results.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
-    >>> import torch
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
-    >>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
-
-    >>> classes = ["not paraphrase", "is paraphrase"]
-
-    >>> sequence_0 = "The company HuggingFace is based in New York City"
-    >>> sequence_1 = "Apples are especially bad for your health"
-    >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
-
-    >>> # The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
-    >>> # the sequence, as well as compute the attention masks.
-    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
-    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
-
-    >>> paraphrase_classification_logits = model(**paraphrase).logits
-    >>> not_paraphrase_classification_logits = model(**not_paraphrase).logits
-
-    >>> paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
-    >>> not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
-
-    >>> # Should be paraphrase
-    >>> for i in range(len(classes)):
-    ...     print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
-    not paraphrase: 10%
-    is paraphrase: 90%
-
-    >>> # Should not be paraphrase
-    >>> for i in range(len(classes)):
-    ...     print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
-    not paraphrase: 94%
-    is paraphrase: 6%
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
-    >>> import tensorflow as tf
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
-    >>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
-
-    >>> classes = ["not paraphrase", "is paraphrase"]
-
-    >>> sequence_0 = "The company HuggingFace is based in New York City"
-    >>> sequence_1 = "Apples are especially bad for your health"
-    >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
-
-    >>> # The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
-    >>> # the sequence, as well as compute the attention masks.
-    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
-    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
-
-    >>> paraphrase_classification_logits = model(paraphrase).logits
-    >>> not_paraphrase_classification_logits = model(not_paraphrase).logits
-
-    >>> paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
-    >>> not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
-
-    >>> # Should be paraphrase
-    >>> for i in range(len(classes)):
-    ...     print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
-    not paraphrase: 10%
-    is paraphrase: 90%
-
-    >>> # Should not be paraphrase
-    >>> for i in range(len(classes)):
-    ...     print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
-    not paraphrase: 94%
-    is paraphrase: 6%
-
-Extractive Question Answering
-----------------------------------------------------------------------------------------------------------------------
-
-Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
-question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
-model on a SQuAD task, you may leverage the `run_qa.py
-<https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering/run_qa.py>`__ and
-`run_tf_squad.py
-<https://github.com/huggingface/transformers/tree/master/examples/tensorflow/question-answering/run_tf_squad.py>`__
-scripts.
-
-
-Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
-leverages a fine-tuned model on SQuAD.
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> question_answerer = pipeline("question-answering")
-
-    >>> context = r"""
-    ... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
-    ... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
-    ... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
-    ... """
-
-This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
-positions of the extracted answer in the text.
-
-.. code-block::
-
-    >>> result = question_answerer(question="What is extractive question answering?", context=context)
-    >>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
-    Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
-
-    >>> result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
-    >>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
-    Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
-
-
-Here is an example of question answering using a model and a tokenizer. The process is the following:
-
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
-   with the weights stored in the checkpoint.
-2. Define a text and a few questions.
-3. Iterate over the questions and build a sequence from the text and the current question, with the correct
-   model-specific separators token type ids and attention masks.
-4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and
-   text), for both the start and end positions.
-5. Compute the softmax of the result to get probabilities over the tokens.
-6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
-7. Print the results.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoTokenizer, AutoModelForQuestionAnswering
-    >>> import torch
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
-    >>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
-
-    >>> text = r"""
-    ... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
-    ... architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
-    ... Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
-    ... TensorFlow 2.0 and PyTorch.
-    ... """
-
-    >>> questions = [
-    ...     "How many pretrained models are available in 🤗 Transformers?",
-    ...     "What does 🤗 Transformers provide?",
-    ...     "🤗 Transformers provides interoperability between which frameworks?",
-    ... ]
-
-    >>> for question in questions:
-    ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
-    ...     input_ids = inputs["input_ids"].tolist()[0]
-    ...
-    ...     outputs = model(**inputs)
-    ...     answer_start_scores = outputs.start_logits
-    ...     answer_end_scores = outputs.end_logits
-    ...
-    ...     # Get the most likely beginning of answer with the argmax of the score
-    ...     answer_start = torch.argmax(answer_start_scores)
-    ...     # Get the most likely end of answer with the argmax of the score 
-    ...     answer_end = torch.argmax(answer_end_scores) + 1
-    ...
-    ...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
-    ...
-    ...     print(f"Question: {question}")
-    ...     print(f"Answer: {answer}")
-    Question: How many pretrained models are available in 🤗 Transformers?
-    Answer: over 32 +
-    Question: What does 🤗 Transformers provide?
-    Answer: general - purpose architectures
-    Question: 🤗 Transformers provides interoperability between which frameworks?
-    Answer: tensorflow 2. 0 and pytorch
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
-    >>> import tensorflow as tf
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
-    >>> model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
-
-    >>> text = r"""
-    ... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
-    ... architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
-    ... Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
-    ... TensorFlow 2.0 and PyTorch.
-    ... """
-
-    >>> questions = [
-    ...     "How many pretrained models are available in 🤗 Transformers?",
-    ...     "What does 🤗 Transformers provide?",
-    ...     "🤗 Transformers provides interoperability between which frameworks?",
-    ... ]
-
-    >>> for question in questions:
-    ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
-    ...     input_ids = inputs["input_ids"].numpy()[0]
-    ...
-    ...     outputs = model(inputs)
-    ...     answer_start_scores = outputs.start_logits
-    ...     answer_end_scores = outputs.end_logits
-    ...
-    ...     # Get the most likely beginning of answer with the argmax of the score
-    ...     answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
-    ...     # Get the most likely end of answer with the argmax of the score
-    ...     answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1
-    ...
-    ...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
-    ...
-    ...     print(f"Question: {question}")
-    ...     print(f"Answer: {answer}")
-    Question: How many pretrained models are available in 🤗 Transformers?
-    Answer: over 32 +
-    Question: What does 🤗 Transformers provide?
-    Answer: general - purpose architectures
-    Question: 🤗 Transformers provides interoperability between which frameworks?
-    Answer: tensorflow 2. 0 and pytorch
-
-
-
-Language Modeling
-----------------------------------------------------------------------------------------------------------------------
-
-Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular
-transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
-GPT-2 with causal language modeling.
-
-Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
-domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
-on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
-
-Masked Language Modeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
-fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
-right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
-downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
-<https://arxiv.org/abs/1910.13461>`__, part 4.2). If you would like to fine-tune a model on a masked language modeling
-task, you may leverage the :prefix_link:`run_mlm.py <examples/pytorch/language-modeling/run_mlm.py>` script.
-
-Here is an example of using pipelines to replace a mask from a sequence:
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> unmasker = pipeline("fill-mask")
-
-This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer vocabulary:
-
-.. code-block::
-
-    >>> from pprint import pprint
-    >>> pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."))
-    [{'score': 0.1793,
-      'sequence': 'HuggingFace is creating a tool that the community uses to solve '
-                  'NLP tasks.',
-      'token': 3944,
-      'token_str': ' tool'},
-     {'score': 0.1135,
-      'sequence': 'HuggingFace is creating a framework that the community uses to '
-                  'solve NLP tasks.',
-      'token': 7208,
-      'token_str': ' framework'},
-     {'score': 0.0524,
-      'sequence': 'HuggingFace is creating a library that the community uses to '
-                  'solve NLP tasks.',
-      'token': 5560,
-      'token_str': ' library'},
-     {'score': 0.0349,
-      'sequence': 'HuggingFace is creating a database that the community uses to '
-                  'solve NLP tasks.',
-      'token': 8503,
-      'token_str': ' database'},
-     {'score': 0.0286,
-      'sequence': 'HuggingFace is creating a prototype that the community uses to '
-                  'solve NLP tasks.',
-      'token': 17715,
-      'token_str': ' prototype'}]
-
-Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:
-
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and
-   loads it with the weights stored in the checkpoint.
-2. Define a sequence with a masked token, placing the :obj:`tokenizer.mask_token` instead of a word.
-3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
-4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the
-   values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that
-   context.
-5. Retrieve the top 5 tokens using the PyTorch :obj:`topk` or TensorFlow :obj:`top_k` methods.
-6. Replace the mask token by the tokens and print the results
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelForMaskedLM, AutoTokenizer
-    >>> import torch
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
-    >>> model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
-
-    >>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
-    ...     f"versions would help {tokenizer.mask_token} our carbon footprint."
-
-    >>> inputs = tokenizer(sequence, return_tensors="pt")
-    >>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
-
-    >>> token_logits = model(**inputs).logits
-    >>> mask_token_logits = token_logits[0, mask_token_index, :]
-
-    >>> top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
-
-    >>> for token in top_5_tokens:
-    ...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelForMaskedLM, AutoTokenizer
-    >>> import tensorflow as tf
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
-    >>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
-
-    >>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
-    ...     f"versions would help {tokenizer.mask_token} our carbon footprint."
-
-    >>> inputs = tokenizer(sequence, return_tensors="tf")
-    >>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
-
-    >>> token_logits = model(**inputs).logits
-    >>> mask_token_logits = token_logits[0, mask_token_index, :]
-
-    >>> top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()
-
-    >>> for token in top_5_tokens:
-    ...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
-    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
-
-
-This prints five sequences, with the top 5 tokens predicted by the model.
-
-
-Causal Language Modeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
-model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
-for generation tasks. If you would like to fine-tune a model on a causal language modeling task, you may leverage the
-:prefix_link:`run_clm.py <examples/pytorch/language-modeling/run_clm.py>` script.
-
-Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
-input sequence.
-
-Here is an example of using the tokenizer and model and leveraging the
-:func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence
-of tokens.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
-    >>> import torch
-    >>> from torch import nn
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
-
-    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and"
-
-    >>> inputs = tokenizer(sequence, return_tensors="pt")
-    >>> input_ids = inputs["input_ids"]
-
-    >>> # get logits of last hidden state
-    >>> next_token_logits = model(**inputs).logits[:, -1, :]
-
-    >>> # filter
-    >>> filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
-
-    >>> # sample
-    >>> probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
-    >>> next_token = torch.multinomial(probs, num_samples=1)
-
-    >>> generated = torch.cat([input_ids, next_token], dim=-1)
-
-    >>> resulting_string = tokenizer.decode(generated.tolist()[0])
-    >>> print(resulting_string)
-    Hugging Face is based in DUMBO, New York City, and ...
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
-    >>> import tensorflow as tf
-
-    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    >>> model = TFAutoModelForCausalLM.from_pretrained("gpt2")
-
-    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and"
-
-    >>> inputs = tokenizer(sequence, return_tensors="tf")
-    >>> input_ids = inputs["input_ids"]
-
-    >>> # get logits of last hidden state
-    >>> next_token_logits = model(**inputs).logits[:, -1, :]
-
-    >>> # filter
-    >>> filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
-
-    >>> # sample
-    >>> next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)
-
-    >>> generated = tf.concat([input_ids, next_token], axis=1)
-
-    >>> resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
-    >>> print(resulting_string)
-    Hugging Face is based in DUMBO, New York City, and ...
-
-This outputs a (hopefully) coherent next token following the original sequence, which in our case is the word *is* or
-*features*.
-
-In the next section, we show how :func:`~transformers.generation_utils.GenerationMixin.generate` can be used to
-generate multiple tokens up to a specified length instead of one token at a time.
-
-Text Generation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a
-continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text.
-As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations
-(see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> text_generator = pipeline("text-generation")
-    >>> print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
-    [{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
-    "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
-
-
-
-Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
-concerned, I will"*. Behind the scenes, the pipeline object calls the method
-:func:`~transformers.PreTrainedModel.generate` to generate text. The default arguments for this method can be
-overridden in the pipeline, as is shown above for the arguments ``max_length`` and ``do_sample``.
-
-Below is an example of text generation using ``XLNet`` and its tokenizer, which includes calling ``generate`` directly:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelForCausalLM, AutoTokenizer
-
-    >>> model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
-    >>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
-
-    >>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
-    >>> PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
-    ... (except for Alexei and Maria) are discovered.
-    ... The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
-    ... remainder of the story. 1883 Western Siberia,
-    ... a young Grigori Rasputin is asked by his father and a group of men to perform magic.
-    ... Rasputin has a vision and denounces one of the men as a horse thief. Although his
-    ... father initially slaps him for making such an accusation, Rasputin watches as the
-    ... man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
-    ... the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
-    ... with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
-
-    >>> prompt = "Today the weather is really nice and I am planning on "
-    >>> inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
-
-    >>> prompt_length = len(tokenizer.decode(inputs[0]))
-    >>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
-    >>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]
-
-    >>> print(generated)
-    Today the weather is really nice and I am planning ...
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer
-
-    >>> model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
-    >>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
-
-    >>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
-    >>> PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
-    ... (except for Alexei and Maria) are discovered.
-    ... The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
-    ... remainder of the story. 1883 Western Siberia,
-    ... a young Grigori Rasputin is asked by his father and a group of men to perform magic.
-    ... Rasputin has a vision and denounces one of the men as a horse thief. Although his
-    ... father initially slaps him for making such an accusation, Rasputin watches as the
-    ... man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
-    ... the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
-    ... with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
-
-    >>> prompt = "Today the weather is really nice and I am planning on "
-    >>> inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")["input_ids"]
-
-    >>> prompt_length = len(tokenizer.decode(inputs[0]))
-    >>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
-    >>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]
-
-    >>> print(generated)
-    Today the weather is really nice and I am planning ...
-
-
-Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
-PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
-need to be padded to work well. GPT-2 is usually a good choice for *open-ended text generation* because it was trained
-on millions of webpages with a causal language modeling objective.
-
-For more information on how to apply different decoding strategies for text generation, please also refer to our text
-generation blog post `here <https://huggingface.co/blog/how-to-generate>`__.
-
-
-Named Entity Recognition
-----------------------------------------------------------------------------------------------------------------------
-
-Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
-as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
-which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
-:prefix_link:`run_ner.py <examples/pytorch/token-classification/run_ner.py>` script.
-
-Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
-belonging to one of 9 classes:
-
- O, Outside of a named entity
- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
- I-MIS, Miscellaneous entity
- B-PER, Beginning of a person's name right after another person's name
- I-PER, Person's name
- B-ORG, Beginning of an organisation right after another organisation
- I-ORG, Organisation
- B-LOC, Beginning of a location right after another location
- I-LOC, Location
-
-It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from `dbmdz
-<https://github.com/dbmdz>`__.
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> ner_pipe = pipeline("ner")
-
-    >>> sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
-    ... therefore very close to the Manhattan Bridge which is visible from the window."""
-
-
-This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
-Here are the expected results:
-
-.. code-block::
-
-    >>> for entity in ner_pipe(sequence):
-    ...     print(entity)
-    {'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
-    {'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
-    {'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
-    {'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
-    {'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
-    {'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
-    {'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
-    {'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
-    {'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
-    {'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
-    {'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
-    {'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
-
-Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
-"DUMBO" and "Manhattan Bridge" have been identified as locations.
-
-Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
-
-1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
-   with the weights stored in the checkpoint.
-2. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
-3. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
-   encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
-4. Encode that sequence into IDs (special tokens are added automatically).
-5. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
-   distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
-   each token.
-6. Zip together each token with its prediction and print it.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelForTokenClassification, AutoTokenizer
-    >>> import torch
-
-    >>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
-    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
-
-    >>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
-    ...            "therefore very close to the Manhattan Bridge."
-
-    >>> inputs = tokenizer(sequence, return_tensors="pt")
-    >>> tokens = inputs.tokens()
-
-    >>> outputs = model(**inputs).logits
-    >>> predictions = torch.argmax(outputs, dim=2)
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
-    >>> import tensorflow as tf
-
-    >>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
-    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
-
-    >>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
-    ...            "therefore very close to the Manhattan Bridge."
-
-    >>> inputs = tokenizer(sequence, return_tensors="tf")
-    >>> tokens = inputs.tokens()
-
-    >>> outputs = model(**inputs)[0]
-    >>> predictions = tf.argmax(outputs, axis=2)
-
-
-This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
-token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
-token.
-
-In the above example, ``predictions`` is an integer that corresponds to the predicted class. We can use the
-``model.config.id2label`` property in order to recover the class name corresponding to the class number, which is
-illustrated below:
-
-.. code-block::
-
-    >>> for token, prediction in zip(tokens, predictions[0].numpy()):
-    ...     print((token, model.config.id2label[prediction]))
-    ('[CLS]', 'O')
-    ('Hu', 'I-ORG')
-    ('##gging', 'I-ORG')
-    ('Face', 'I-ORG')
-    ('Inc', 'I-ORG')
-    ('.', 'O')
-    ('is', 'O')
-    ('a', 'O')
-    ('company', 'O')
-    ('based', 'O')
-    ('in', 'O')
-    ('New', 'I-LOC')
-    ('York', 'I-LOC')
-    ('City', 'I-LOC')
-    ('.', 'O')
-    ('Its', 'O')
-    ('headquarters', 'O')
-    ('are', 'O')
-    ('in', 'O')
-    ('D', 'I-LOC')
-    ('##UM', 'I-LOC')
-    ('##BO', 'I-LOC')
-    (',', 'O')
-    ('therefore', 'O')
-    ('very', 'O')
-    ('close', 'O')
-    ('to', 'O')
-    ('the', 'O')
-    ('Manhattan', 'I-LOC')
-    ('Bridge', 'I-LOC')
-    ('.', 'O')
-    ('[SEP]', 'O')
-
-
-Summarization
-----------------------------------------------------------------------------------------------------------------------
-
-Summarization is the task of summarizing a document or an article into a shorter text. If you would like to fine-tune a
-model on a summarization task, you may leverage the `run_summarization.py
-<https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/run_summarization.py>`__
-script.
-
-An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
-created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
-approaches are described in this :prefix_link:`document <examples/pytorch/summarization/README.md>`.
-
-Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
-/ Daily Mail data set.
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> summarizer = pipeline("summarization")
-
-    >>> ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
-    ... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
-    ... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
-    ... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
-    ... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
-    ... 2010 marriage license application, according to court documents.
-    ... Prosecutors said the marriages were part of an immigration scam.
-    ... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
-    ... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
-    ... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
-    ... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
-    ... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
-    ... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
-    ... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
-    ... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
-    ... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
-    ... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
-    ... """
-
-Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
-arguments of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown
-below. This outputs the following summary:
-
-.. code-block::
-
-    >>> print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
-    [{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
-    the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
-    2002 . At one time, she was married to eight men at once, prosecutors say .'}]
-
-Here is an example of doing summarization using a model and a tokenizer. The process is the following:
-
-1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
-   model, such as ``Bart`` or ``T5``.
-2. Define the article that should be summarized.
-3. Add the T5 specific prefix "summarize: ".
-4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
-
-In this example we use Google's T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
-CNN / Daily Mail), it yields very good results.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
-    >>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
-    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
-
-    >>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
-    >>> inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
-    >>> outputs = model.generate(
-    ...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
-    ... )
-
-    >>> print(tokenizer.decode(outputs[0]))
-    <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
-    counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
-    between 1999 and 2002.</s>
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
-
-    >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
-    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
-
-    >>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
-    >>> inputs = tokenizer("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
-    >>> outputs = model.generate(
-    ...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
-    ... )
-
-    >>> print(tokenizer.decode(outputs[0]))
-    <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
-    counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
-    between 1999 and 2002.
-
-
-Translation
-----------------------------------------------------------------------------------------------------------------------
-
-Translation is the task of translating a text from one language to another. If you would like to fine-tune a model on a
-translation task, you may leverage the `run_translation.py
-<https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation/run_translation.py>`__ script.
-
-An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
-data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
-translation task, various approaches are described in this :prefix_link:`document
-<examples/pytorch/translation/README.md>`.
-
-Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
-multi-task mixture dataset (including WMT), yet, yielding impressive translation results.
-
-.. code-block::
-
-    >>> from transformers import pipeline
-
-    >>> translator = pipeline("translation_en_to_de")
-    >>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
-    [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
-
-Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
-arguments of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
-
-Here is an example of doing translation using a model and a tokenizer. The process is the following:
-
-1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
-   model, such as ``Bart`` or ``T5``.
-2. Define the article that should be summarized.
-3. Add the T5 specific prefix "translate English to German: "
-4. Use the ``PreTrainedModel.generate()`` method to perform the translation.
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
-    >>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
-    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
-
-    >>> inputs = tokenizer(
-    ...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
-    ...     return_tensors="pt"
-    ... )
-    >>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
-
-    >>> print(tokenizer.decode(outputs[0]))
-    <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
-    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
-
-    >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
-    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
-
-    >>> inputs = tokenizer(
-    ...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
-    ...     return_tensors="tf"
-    ... )
-    >>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
-
-    >>> print(tokenizer.decode(outputs[0]))
-    <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
-
-We get the same translation as with the pipeline example.
--- a/docs/source/tokenizer_summary.mdx
+++ b/docs/source/tokenizer_summary.mdx
@ -0,0 +1,276 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Summary of the tokenizers
+
+[[open-in-colab]]
+
+On this page, we will have a closer look at tokenization.
+
+<Youtube id="VFp38yj8h3A"/>
+
+As we saw in [the preprocessing tutorial](preprocessing), tokenizing a text is splitting it into words or
+subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
+straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
+More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: [Byte-Pair Encoding
+(BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), and [SentencePiece](#sentencepiece), and show examples
+of which tokenizer type is used by which model.
+
+Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
+type was used by the pretrained model. For instance, if we look at [`BertTokenizer`], we can see
+that the model uses [WordPiece](#wordpiece).
+
+## Introduction
+
+Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
+For instance, let's look at the sentence `"Don't you love 🤗 Transformers? We sure do."`
+
+<Youtube id="nhJxYji1aho"/>
+
+A simple way of tokenizing this text is to split it by spaces, which would give:
+
+```
+["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
+```
+
+This is a sensible first step, but if we look at the tokens `"Transformers?"` and `"do."`, we notice that the
+punctuation is attached to the words `"Transformer"` and `"do"`, which is suboptimal. We should take the
+punctuation into account so that a model does not have to learn a different representation of a word and every possible
+punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
+Taking punctuation into account, tokenizing our exemplary text would give:
+
+```
+["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+```
+
+Better. However, it is disadvantageous, how the tokenization dealt with the word `"Don't"`. `"Don't"` stands for
+`"do not"`, so it would be better tokenized as `["Do", "n't"]`. This is where things start getting complicated, and
+part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
+different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
+input that was tokenized with the same rules that were used to tokenize its training data.
+
+[spaCy](https://spacy.io/) and [Moses](http://www.statmt.org/moses/?n=Development.GetStarted) are two popular
+rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:
+
+```
+["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+```
+
+As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
+punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
+as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
+tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
+usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transformerxl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
+
+Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
+causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
+greater than 50,000, especially if they are pretrained only on a single language.
+
+So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
+
+<Youtube id="ssLq_EK2jLE"/>
+
+While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
+for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
+representation for the letter `"t"` is much harder than learning a context-independent representation for the word
+`"today"`. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
+both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
+tokenization.
+
+### Subword tokenization
+
+<Youtube id="zHvTiHr506c"/>
+
+Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
+subwords, but rare words should be decomposed into meaningful subwords. For instance `"annoyingly"` might be
+considered a rare word and could be decomposed into `"annoying"` and `"ly"`. Both `"annoying"` and `"ly"` as
+stand-alone subwords would appear more frequently while at the same time the meaning of `"annoyingly"` is kept by the
+composite meaning of `"annoying"` and `"ly"`. This is especially useful in agglutinative languages such as Turkish,
+where you can form (almost) arbitrarily long complex words by stringing together subwords.
+
+Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
+context-independent representations. In addition, subword tokenization enables the model to process words it has never
+seen before, by decomposing them into known subwords. For instance, the [`~transformers.BertTokenizer`] tokenizes
+`"I have a new GPU!"` as follows:
+
+```py
+>>> from transformers import BertTokenizer
+>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+>>> tokenizer.tokenize("I have a new GPU!")
+["i", "have", "a", "new", "gp", "##u", "!"]
+```
+
+Because we are considering the uncased model, the sentence was lowercased first. We can see that the words `["i", "have", "a", "new"]` are present in the tokenizer's vocabulary, but the word `"gpu"` is not. Consequently, the
+tokenizer splits `"gpu"` into known subwords: `["gp" and "##u"]`. `"##"` means that the rest of the token should
+be attached to the previous one, without space (for decoding or reversal of the tokenization).
+
+As another example, [`~transformers.XLNetTokenizer`] tokenizes our previously exemplary text as follows:
+
+```py
+>>> from transformers import XLNetTokenizer
+>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
+>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
+["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
+```
+
+We'll get back to the meaning of those `"▁"` when we look at [SentencePiece](#sentencepiece). As one can see,
+the rare word `"Transformers"` has been split into the more frequent subwords `"Transform"` and `"ers"`.
+
+Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
+algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
+on.
+
+<a id='byte-pair-encoding'></a>
+
+## Byte-Pair Encoding (BPE)
+
+Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et
+al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into
+words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [Roberta](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm),
+[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses
+Spacy and ftfy, to count the frequency of each word in the training corpus.
+
+After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
+training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
+of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
+the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
+define before training the tokenizer.
+
+As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
+determined:
+
+```
+("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
+```
+
+Consequently, the base vocabulary is `["b", "g", "h", "n", "p", "s", "u"]`. Splitting all words into symbols of the
+base vocabulary, we obtain:
+
+```
+("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
+```
+
+BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
+the example above `"h"` followed by `"u"` is present _10 + 5 = 15_ times (10 times in the 10 occurrences of
+`"hug"`, 5 times in the 5 occurrences of `"hugs"`). However, the most frequent symbol pair is `"u"` followed by
+`"g"`, occurring _10 + 5 + 5 = 20_ times in total. Thus, the first merge rule the tokenizer learns is to group all
+`"u"` symbols followed by a `"g"` symbol together. Next, `"ug"` is added to the vocabulary. The set of words then
+becomes
+
+```
+("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
+```
+
+BPE then identifies the next most common symbol pair. It's `"u"` followed by `"n"`, which occurs 16 times. `"u"`,
+`"n"` is merged to `"un"` and added to the vocabulary. The next most frequent symbol pair is `"h"` followed by
+`"ug"`, occurring 15 times. Again the pair is merged and `"hug"` can be added to the vocabulary.
+
+At this stage, the vocabulary is `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` and our set of unique words
+is represented as
+
+```
+("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
+```
+
+Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
+to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
+the word `"bug"` would be tokenized to `["b", "ug"]` but `"mug"` would be tokenized as `["<unk>", "ug"]` since
+the symbol `"m"` is not in the base vocabulary. In general, single letters such as `"m"` are not replaced by the
+`"<unk>"` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
+to happen for very special characters like emojis.
+
+As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
+to choose. For instance [GPT](model_doc/gpt) has a vocabulary size of 40,478 since they have 478 base characters
+and chose to stop training after 40,000 merges.
+
+### Byte-level BPE
+
+A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
+considered as base characters. To have a better base vocabulary, [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) uses bytes
+as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
+every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
+tokenizer can tokenize every text without the need for the <unk> symbol. [GPT-2](model_doc/gpt) has a vocabulary
+size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
+with 50,000 merges.
+
+<a id='wordpiece'></a>
+
+#### WordPiece
+
+WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean
+Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and is very similar to
+BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
+progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
+symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
+
+So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
+equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
+its second symbol is the greatest among all symbol pairs. *E.g.* `"u"`, followed by `"g"` would have only been
+merged if the probability of `"ug"` divided by `"u"`, `"g"` would have been greater than for any other symbol
+pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it _loses_ by merging two symbols
+to make ensure it's _worth it_.
+
+<a id='unigram'></a>
+
+#### Unigram
+
+Unigram is a subword tokenization algorithm introduced in [Subword Regularization: Improving Neural Network Translation
+Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf). In contrast to BPE or
+WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
+symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
+the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
+conjunction with [SentencePiece](#sentencepiece).
+
+At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
+data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
+computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
+removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
+symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
+reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
+
+Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
+tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
+
+```
+["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
+```
+
+`"hugs"` could be tokenized both as `["hug", "s"]`, `["h", "ug", "s"]` or `["h", "u", "g", "s"]`. So which one
+to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
+the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
+likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
+probabilities.
+
+Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
+the words \\(x_{1}, \dots, x_{N}\\) and that the set of all possible tokenizations for a word \\(x_{i}\\) is
+defined as \\(S(x_{i})\\), then the overall loss is defined as
+
+$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$
+
+<a id='sentencepiece'></a>
+
+#### SentencePiece
+
+All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
+separate words. However, not all languages use spaces to separate words. One possible solution is to use language
+specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer).
+To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and
+detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input
+as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
+algorithm to construct the appropriate vocabulary.
+
+The [`XLNetTokenizer`] uses SentencePiece for example, which is also why in the example earlier the
+`"▁"` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
+concatenated and `"▁"` is replaced by a space.
+
+All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
+using SentencePiece are [ALBERT](model_doc/albert), [XLNet](model_doc/xlnet), [Marian](model_doc/marian), and [T5](model_doc/t5).
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@ -1,306 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Summary of the tokenizers
-----------------------------------------------------------------------------------------------------------------------
-
-On this page, we will have a closer look at tokenization.
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-As we saw in :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or
-subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
-straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
-More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding
-(BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`, and :ref:`SentencePiece <sentencepiece>`, and show examples
-of which tokenizer type is used by which model.
-
-Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
-type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
-that the model uses :ref:`WordPiece <wordpiece>`.
-
-Introduction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
-For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."``
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/nhJxYji1aho" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-A simple way of tokenizing this text is to split it by spaces, which would give:
-
-.. code-block::
-
-    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
-
-This is a sensible first step, but if we look at the tokens ``"Transformers?"`` and ``"do."``, we notice that the
-punctuation is attached to the words ``"Transformer"`` and ``"do"``, which is suboptimal. We should take the
-punctuation into account so that a model does not have to learn a different representation of a word and every possible
-punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
-Taking punctuation into account, tokenizing our exemplary text would give:
-
-.. code-block::
-
-    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-
-Better. However, it is disadvantageous, how the tokenization dealt with the word ``"Don't"``. ``"Don't"`` stands for
-``"do not"``, so it would be better tokenized as ``["Do", "n't"]``. This is where things start getting complicated, and
-part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
-different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
-input that was tokenized with the same rules that were used to tokenize its training data.
-
-`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
-rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:
-
-.. code-block::
-
-    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-
-As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
-punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
-as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
-tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
-usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, :doc:`Transformer XL
-<model_doc/transformerxl>` uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
-
-Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
-causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
-greater than 50,000, especially if they are pretrained only on a single language.
-
-So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/ssLq_EK2jLE" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
-for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
-representation for the letter ``"t"`` is much harder than learning a context-independent representation for the word
-``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
-both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
-tokenization.
-
-Subword tokenization
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/zHvTiHr506c" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
-subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
-considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
-stand-alone subwords would appear more frequently while at the same time the meaning of ``"annoyingly"`` is kept by the
-composite meaning of ``"annoying"`` and ``"ly"``. This is especially useful in agglutinative languages such as Turkish,
-where you can form (almost) arbitrarily long complex words by stringing together subwords.
-
-Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
-context-independent representations. In addition, subword tokenization enables the model to process words it has never
-seen before, by decomposing them into known subwords. For instance, the :class:`~transformers.BertTokenizer` tokenizes
-``"I have a new GPU!"`` as follows:
-
-.. code-block::
-
-    >>> from transformers import BertTokenizer
-    >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
-    >>> tokenizer.tokenize("I have a new GPU!")
-    ["i", "have", "a", "new", "gp", "##u", "!"]
-
-Because we are considering the uncased model, the sentence was lowercased first. We can see that the words ``["i",
-"have", "a", "new"]`` are present in the tokenizer's vocabulary, but the word ``"gpu"`` is not. Consequently, the
-tokenizer splits ``"gpu"`` into known subwords: ``["gp" and "##u"]``. ``"##"`` means that the rest of the token should
-be attached to the previous one, without space (for decoding or reversal of the tokenization).
-
-As another example, :class:`~transformers.XLNetTokenizer` tokenizes our previously exemplary text as follows:
-
-.. code-block::
-
-    >>> from transformers import XLNetTokenizer
-    >>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
-    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
-    ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
-
-We'll get back to the meaning of those ``"▁"`` when we look at :ref:`SentencePiece <sentencepiece>`. As one can see,
-the rare word ``"Transformers"`` has been split into the more frequent subwords ``"Transform"`` and ``"ers"``.
-
-Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
-algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
-on.
-
-.. _byte-pair-encoding:
-
-Byte-Pair Encoding (BPE)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Byte-Pair Encoding (BPE) was introduced in `Neural Machine Translation of Rare Words with Subword Units (Sennrich et
-al., 2015) <https://arxiv.org/abs/1508.07909>`__. BPE relies on a pre-tokenizer that splits the training data into
-words. Pretokenization can be as simple as space tokenization, e.g. :doc:`GPT-2 <model_doc/gpt2>`, :doc:`Roberta
-<model_doc/roberta>`. More advanced pre-tokenization include rule-based tokenization, e.g. :doc:`XLM <model_doc/xlm>`,
-:doc:`FlauBERT <model_doc/flaubert>` which uses Moses for most languages, or :doc:`GPT <model_doc/gpt>` which uses
-Spacy and ftfy, to count the frequency of each word in the training corpus.
-
-After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
-training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
-of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
-the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
-define before training the tokenizer.
-
-As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
-determined:
-
-.. code-block::
-
-    ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
-
-Consequently, the base vocabulary is ``["b", "g", "h", "n", "p", "s", "u"]``. Splitting all words into symbols of the
-base vocabulary, we obtain:
-
-.. code-block::
-
-    ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
-
-BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
-the example above ``"h"`` followed by ``"u"`` is present `10 + 5 = 15` times (10 times in the 10 occurrences of
-``"hug"``, 5 times in the 5 occurrences of ``"hugs"``). However, the most frequent symbol pair is ``"u"`` followed by
-``"g"``, occurring `10 + 5 + 5 = 20` times in total. Thus, the first merge rule the tokenizer learns is to group all
-``"u"`` symbols followed by a ``"g"`` symbol together. Next, ``"ug"`` is added to the vocabulary. The set of words then
-becomes
-
-.. code-block::
-
-    ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
-
-BPE then identifies the next most common symbol pair. It's ``"u"`` followed by ``"n"``, which occurs 16 times. ``"u"``,
-``"n"`` is merged to ``"un"`` and added to the vocabulary. The next most frequent symbol pair is ``"h"`` followed by
-``"ug"``, occurring 15 times. Again the pair is merged and ``"hug"`` can be added to the vocabulary.
-
-At this stage, the vocabulary is ``["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`` and our set of unique words
-is represented as
-
-.. code-block::
-
-    ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
-
-Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
-to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
-the word ``"bug"`` would be tokenized to ``["b", "ug"]`` but ``"mug"`` would be tokenized as ``["<unk>", "ug"]`` since
-the symbol ``"m"`` is not in the base vocabulary. In general, single letters such as ``"m"`` are not replaced by the
-``"<unk>"`` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
-to happen for very special characters like emojis.
-
-As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
-to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
-and chose to stop training after 40,000 merges.
-
-Byte-level BPE
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
-considered as base characters. To have a better base vocabulary, `GPT-2
-<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ uses bytes
-as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
-every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
-tokenizer can tokenize every text without the need for the <unk> symbol. :doc:`GPT-2 <model_doc/gpt>` has a vocabulary
-size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
-with 50,000 merges.
-
-.. _wordpiece:
-
-WordPiece
-=======================================================================================================================
-
-WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
-<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
-Voice Search (Schuster et al., 2012)
-<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
-BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
-progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
-symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
-
-So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
-equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
-its second symbol is the greatest among all symbol pairs. *E.g.* ``"u"``, followed by ``"g"`` would have only been
-merged if the probability of ``"ug"`` divided by ``"u"``, ``"g"`` would have been greater than for any other symbol
-pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it `loses` by merging two symbols
-to make ensure it's `worth it`.
-
-.. _unigram:
-
-Unigram
-=======================================================================================================================
-
-Unigram is a subword tokenization algorithm introduced in `Subword Regularization: Improving Neural Network Translation
-Models with Multiple Subword Candidates (Kudo, 2018) <https://arxiv.org/pdf/1804.10959.pdf>`__. In contrast to BPE or
-WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
-symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
-the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
-conjunction with :ref:`SentencePiece <sentencepiece>`.
-
-At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
-data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
-computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
-removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
-symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
-reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
-
-Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
-tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
-
-.. code-block::
-
-    ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
-
-``"hugs"`` could be tokenized both as ``["hug", "s"]``, ``["h", "ug", "s"]`` or ``["h", "u", "g", "s"]``. So which one
-to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
-the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
-likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
-probabilities.
-
-Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
-the words :math:`x_{1}, \dots, x_{N}` and that the set of all possible tokenizations for a word :math:`x_{i}` is
-defined as :math:`S(x_{i})`, then the overall loss is defined as
-
-.. math::
-    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
-
-.. _sentencepiece:
-
-SentencePiece
-=======================================================================================================================
-
-All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
-separate words. However, not all languages use spaces to separate words. One possible solution is to use language
-specific pre-tokenizers, *e.g.* :doc:`XLM <model_doc/xlm>` uses a specific Chinese, Japanese, and Thai pre-tokenizer).
-To solve this problem more generally, `SentencePiece: A simple and language independent subword tokenizer and
-detokenizer for Neural Text Processing (Kudo et al., 2018) <https://arxiv.org/pdf/1808.06226.pdf>`__ treats the input
-as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
-algorithm to construct the appropriate vocabulary.
-
-The :class:`~transformers.XLNetTokenizer` uses SentencePiece for example, which is also why in the example earlier the
-``"▁"`` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
-concatenated and ``"▁"`` is replaced by a space.
-
-All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
-using SentencePiece are :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>`, :doc:`Marian
-<model_doc/marian>`, and :doc:`T5 <model_doc/t5>`.
--- a/docs/source/training.mdx
+++ b/docs/source/training.mdx
@ -0,0 +1,399 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Fine-tuning a pretrained model
+
+[[open-in-colab]]
+
+In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow,
+models can be directly trained using Keras and the `fit` method. In PyTorch, there is no generic training loop so
+the 🤗 Transformers library provides an API with the class [`Trainer`] to let you fine-tune or train
+a model from scratch easily. Then we will show you how to alternatively write the whole training loop in PyTorch.
+
+Before we can fine-tune a model, we need a dataset. In this tutorial, we will show you how to fine-tune BERT on the
+[IMDB dataset](https://www.imdb.com/interfaces/): the task is to classify whether movie reviews are positive or
+negative. For examples of other tasks, refer to the [additional-resources](#additional-resources) section!
+
+<a id='data-processing'></a>
+
+## Preparing the datasets
+
+<Youtube id="_BZearw7f0w"/>
+
+We will use the [🤗 Datasets](https://github.com/huggingface/datasets/) library to download and preprocess the IMDB
+datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
+to the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/) or the [preprocessing](preprocessing) tutorial for
+more information.
+
+First, we can use the `load_dataset` function to download and cache the dataset:
+
+```python
+from datasets import load_dataset
+
+raw_datasets = load_dataset("imdb")
+```
+
+This works like the `from_pretrained` method we saw for the models and tokenizers (except the cache directory is
+_~/.cache/huggingface/dataset_ by default).
+
+The `raw_datasets` object is a dictionary with three keys: `"train"`, `"test"` and `"unsupervised"`
+(which correspond to the three splits of that dataset). We will use the `"train"` split for training and the
+`"test"` split for validation.
+
+To preprocess our data, we will need a tokenizer:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+```
+
+As we saw in [preprocessing](preprocessing), we can prepare the text inputs for the model with the following command (this is an
+example, not a command you can execute):
+
+```python
+inputs = tokenizer(sentences, padding="max_length", truncation=True)
+```
+
+This will make all the samples have the maximum length the model can accept (here 512), either by padding or truncating
+them.
+
+However, we can instead apply these preprocessing steps to all the splits of our dataset at once by using the
+`map` method:
+
+```python
+def tokenize_function(examples):
+    return tokenizer(examples["text"], padding="max_length", truncation=True)
+
+tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
+```
+
+You can learn more about the map method or the other ways to preprocess the data in the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/).
+
+Next we will generate a small subset of the training and validation set, to enable faster training:
+
+```python
+small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
+small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
+full_train_dataset = tokenized_datasets["train"]
+full_eval_dataset = tokenized_datasets["test"]
+```
+
+In all the examples below, we will always use `small_train_dataset` and `small_eval_dataset`. Just replace
+them by their _full_ equivalent to train or evaluate on the full dataset.
+
+<a id='trainer'></a>
+
+## Fine-tuning in PyTorch with the Trainer API
+
+<Youtube id="nvBXf7s7vTI"/>
+
+Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a [`Trainer`]
+API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
+logging, gradient accumulation, and mixed precision.
+
+First, let's define our model:
+
+```python
+from transformers import AutoModelForSequenceClassification
+
+model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
+```
+
+This will issue a warning about some of the pretrained weights not being used and some weights being randomly
+initialized. That's because we are throwing away the pretraining head of the BERT model to replace it with a
+classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge
+of the pretrained model to it (which is why doing this is called transfer learning).
+
+Then, to define our [`Trainer`], we will need to instantiate a
+[`TrainingArguments`]. This class contains all the hyperparameters we can tune for the
+[`Trainer`] or the flags to activate the different training options it supports. Let's begin by
+using all the defaults, the only thing we then have to provide is a directory in which the checkpoints will be saved:
+
+```python
+from transformers import TrainingArguments
+
+training_args = TrainingArguments("test_trainer")
+```
+
+Then we can instantiate a [`Trainer`] like this:
+
+```python
+from transformers import Trainer
+
+trainer = Trainer(
+    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
+)
+```
+
+To fine-tune our model, we just need to call
+
+```python
+trainer.train()
+```
+
+which will start a training that you can follow with a progress bar, which should take a couple of minutes to complete
+(as long as you have access to a GPU). It won't actually tell you anything useful about how well (or badly) your model
+is performing however as by default, there is no evaluation during training, and we didn't tell the
+[`Trainer`] to compute any metrics. Let's have a look on how to do that now!
+
+To have the [`Trainer`] compute and report metrics, we need to give it a `compute_metrics`
+function that takes predictions and labels (grouped in a namedtuple called [`EvalPrediction`]) and
+return a dictionary with string items (the metric names) and float values (the metric values).
+
+The 🤗 Datasets library provides an easy way to get the common metrics used in NLP with the `load_metric` function.
+here we simply use accuracy. Then we define the `compute_metrics` function that just convert logits to predictions
+(remember that all 🤗 Transformers models return the logits) and feed them to `compute` method of this metric.
+
+```python
+import numpy as np
+from datasets import load_metric
+
+metric = load_metric("accuracy")
+
+def compute_metrics(eval_pred):
+    logits, labels = eval_pred
+    predictions = np.argmax(logits, axis=-1)
+    return metric.compute(predictions=predictions, references=labels)
+```
+
+The compute function needs to receive a tuple (with logits and labels) and has to return a dictionary with string keys
+(the name of the metric) and float values. It will be called at the end of each evaluation phase on the whole arrays of
+predictions/labels.
+
+To check if this works on practice, let's create a new [`Trainer`] with our fine-tuned model:
+
+```python
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=small_train_dataset,
+    eval_dataset=small_eval_dataset,
+    compute_metrics=compute_metrics,
+)
+trainer.evaluate()
+```
+
+which showed an accuracy of 87.5% in our case.
+
+If you want to fine-tune your model and regularly report the evaluation metrics (for instance at the end of each
+epoch), here is how you should define your training arguments:
+
+```python
+from transformers import TrainingArguments
+
+training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")
+```
+
+See the documentation of [`TrainingArguments`] for more options.
+
+
+<a id='keras'></a>
+
+## Fine-tuning with Keras
+
+<Youtube id="rnTGBy2ax1c"/>
+
+Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
+
+```python
+import tensorflow as tf
+from transformers import TFAutoModelForSequenceClassification
+
+model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
+```
+
+Then we will need to convert our datasets from before in standard `tf.data.Dataset`. Since we have fixed shapes,
+it can easily be done like this. First we remove the _"text"_ column from our datasets and set them in TensorFlow
+format:
+
+```python
+tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
+tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")
+```
+
+Then we convert everything in big tensors and use the `tf.data.Dataset.from_tensor_slices` method:
+
+```python
+train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
+train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
+train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)
+
+eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
+eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
+eval_tf_dataset = eval_tf_dataset.batch(8)
+```
+
+With this done, the model can then be compiled and trained as any Keras model:
+
+```python
+model.compile(
+    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
+    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+    metrics=tf.metrics.SparseCategoricalAccuracy(),
+)
+
+model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)
+```
+
+With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
+as a PyTorch model (or vice-versa):
+
+```python
+from transformers import AutoModelForSequenceClassification
+
+model.save_pretrained("my_imdb_model")
+pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
+```
+
+<a id='pytorch_native'></a>
+
+## Fine-tuning in native PyTorch
+
+<Youtube id="Dh9CL8fyG80"/>
+
+You might need to restart your notebook at this stage to free some memory, or execute the following code:
+
+```python
+del model
+del pytorch_model
+del trainer
+torch.cuda.empty_cache()
+```
+
+Let's now see how to achieve the same results as in [trainer section](#trainer) in PyTorch. First we need to
+define the dataloaders, which we will use to iterate over batches. We just need to apply a bit of post-processing to
+our `tokenized_datasets` before doing that to:
+
+- remove the columns corresponding to values the model does not expect (here the `"text"` column)
+- rename the column `"label"` to `"labels"` (because the model expect the argument to be named `labels`)
+- set the format of the datasets so they return PyTorch Tensors instead of lists.
+
+Our _tokenized_datasets_ has one method for each of those steps:
+
+```python
+tokenized_datasets = tokenized_datasets.remove_columns(["text"])
+tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+tokenized_datasets.set_format("torch")
+
+small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
+small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
+```
+
+Now that this is done, we can easily define our dataloaders:
+
+```python
+from torch.utils.data import DataLoader
+
+train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
+eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
+```
+
+Next, we define our model:
+
+```python
+from transformers import AutoModelForSequenceClassification
+
+model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
+```
+
+We are almost ready to write our training loop, the only two things are missing are an optimizer and a learning rate
+scheduler. The default optimizer used by the [`Trainer`] is [`AdamW`]:
+
+```python
+from transformers import AdamW
+
+optimizer = AdamW(model.parameters(), lr=5e-5)
+```
+
+Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5 here) to 0:
+
+```python
+from transformers import get_scheduler
+
+num_epochs = 3
+num_training_steps = num_epochs * len(train_dataloader)
+lr_scheduler = get_scheduler(
+    "linear",
+    optimizer=optimizer,
+    num_warmup_steps=0,
+    num_training_steps=num_training_steps
+)
+```
+
+One last thing, we will want to use the GPU if we have access to one (otherwise training might take several hours
+instead of a couple of minutes). To do this, we define a `device` we will put our model and our batches on.
+
+```python
+import torch
+
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+model.to(device)
+```
+
+We now are ready to train! To get some sense of when it will be finished, we add a progress bar over our number of
+training steps, using the _tqdm_ library.
+
+```python
+from tqdm.auto import tqdm
+
+progress_bar = tqdm(range(num_training_steps))
+
+model.train()
+for epoch in range(num_epochs):
+    for batch in train_dataloader:
+        batch = {k: v.to(device) for k, v in batch.items()}
+        outputs = model(**batch)
+        loss = outputs.loss
+        loss.backward()
+
+        optimizer.step()
+        lr_scheduler.step()
+        optimizer.zero_grad()
+        progress_bar.update(1)
+```
+
+Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a
+bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better
+this way for Transformers model (so this is not an oversight on our side). If you're not familiar with what "freezing
+the body" of the model means, forget you read this paragraph.
+
+Now to check the results, we need to write the evaluation loop. Like in the [trainer section](#trainer) we will
+use a metric from the datasets library. Here we accumulate the predictions at each batch before computing the final
+result when the loop is finished.
+
+```python
+metric= load_metric("accuracy")
+model.eval()
+for batch in eval_dataloader:
+    batch = {k: v.to(device) for k, v in batch.items()}
+    with torch.no_grad():
+        outputs = model(**batch)
+
+    logits = outputs.logits
+    predictions = torch.argmax(logits, dim=-1)
+    metric.add_batch(predictions=predictions, references=batch["labels"])
+
+metric.compute()
+```
+
+<a id='additional-resources'></a>
+
+## Additional resources
+
+To look at more fine-tuning examples you can refer to:
+
+- [🤗 Transformers Examples](https://github.com/huggingface/transformers/tree/master/examples) which includes scripts
+  to train on all common NLP tasks in PyTorch and TensorFlow.
+
+- [🤗 Transformers Notebooks](notebooks) which contains various notebooks and in particular one per task (look for
+  the _how to finetune a model on xxx_).
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@ -1,421 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Fine-tuning a pretrained model
-=======================================================================================================================
-
-In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow,
-models can be directly trained using Keras and the :obj:`fit` method. In PyTorch, there is no generic training loop so
-the 🤗 Transformers library provides an API with the class :class:`~transformers.Trainer` to let you fine-tune or train
-a model from scratch easily. Then we will show you how to alternatively write the whole training loop in PyTorch.
-
-Before we can fine-tune a model, we need a dataset. In this tutorial, we will show you how to fine-tune BERT on the
-`IMDB dataset <https://www.imdb.com/interfaces/>`__: the task is to classify whether movie reviews are positive or
-negative. For examples of other tasks, refer to the :ref:`additional-resources` section!
-
-.. _data-processing:
-
-Preparing the datasets
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/_BZearw7f0w" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-We will use the `🤗 Datasets <https://github.com/huggingface/datasets/>`__ library to download and preprocess the IMDB
-datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
-to the 🤗 Datasets `documentation <https://huggingface.co/docs/datasets/>`__ or the :doc:`preprocessing` tutorial for
-more information.
-
-First, we can use the :obj:`load_dataset` function to download and cache the dataset:
-
-.. code-block:: python
-
-    from datasets import load_dataset
-
-    raw_datasets = load_dataset("imdb")
-
-This works like the :obj:`from_pretrained` method we saw for the models and tokenizers (except the cache directory is
-`~/.cache/huggingface/dataset` by default).
-
-The :obj:`raw_datasets` object is a dictionary with three keys: :obj:`"train"`, :obj:`"test"` and :obj:`"unsupervised"`
-(which correspond to the three splits of that dataset). We will use the :obj:`"train"` split for training and the
-:obj:`"test"` split for validation.
-
-To preprocess our data, we will need a tokenizer:
-
-.. code-block:: python
-
-    from transformers import AutoTokenizer
-
-    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
-
-As we saw in :doc:`preprocessing`, we can prepare the text inputs for the model with the following command (this is an
-example, not a command you can execute):
-
-.. code-block:: python
-
-    inputs = tokenizer(sentences, padding="max_length", truncation=True)
-
-This will make all the samples have the maximum length the model can accept (here 512), either by padding or truncating
-them.
-
-However, we can instead apply these preprocessing steps to all the splits of our dataset at once by using the
-:obj:`map` method:
-
-.. code-block:: python
-
-    def tokenize_function(examples):
-        return tokenizer(examples["text"], padding="max_length", truncation=True)
-
-    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
-
-You can learn more about the map method or the other ways to preprocess the data in the 🤗 Datasets `documentation
-<https://huggingface.co/docs/datasets/>`__.
-
-Next we will generate a small subset of the training and validation set, to enable faster training:
-
-.. code-block:: python
-
-    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
-    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
-    full_train_dataset = tokenized_datasets["train"]
-    full_eval_dataset = tokenized_datasets["test"]
-
-In all the examples below, we will always use :obj:`small_train_dataset` and :obj:`small_eval_dataset`. Just replace
-them by their `full` equivalent to train or evaluate on the full dataset.
-
-.. _trainer:
-
-Fine-tuning in PyTorch with the Trainer API
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/nvBXf7s7vTI" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a :class:`~transformers.Trainer`
-API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
-logging, gradient accumulation, and mixed precision.
-
-First, let's define our model:
-
-.. code-block:: python
-
-    from transformers import AutoModelForSequenceClassification
-
-    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
-
-This will issue a warning about some of the pretrained weights not being used and some weights being randomly
-initialized. That's because we are throwing away the pretraining head of the BERT model to replace it with a
-classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge
-of the pretrained model to it (which is why doing this is called transfer learning).
-
-Then, to define our :class:`~transformers.Trainer`, we will need to instantiate a
-:class:`~transformers.TrainingArguments`. This class contains all the hyperparameters we can tune for the
-:class:`~transformers.Trainer` or the flags to activate the different training options it supports. Let's begin by
-using all the defaults, the only thing we then have to provide is a directory in which the checkpoints will be saved:
-
-.. code-block:: python
-
-    from transformers import TrainingArguments
-
-    training_args = TrainingArguments("test_trainer")
-
-Then we can instantiate a :class:`~transformers.Trainer` like this:
-
-.. code-block:: python
-
-    from transformers import Trainer
-
-    trainer = Trainer(
-        model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
-    )
-
-To fine-tune our model, we just need to call
-
-.. code-block:: python
-
-    trainer.train()
-
-which will start a training that you can follow with a progress bar, which should take a couple of minutes to complete
-(as long as you have access to a GPU). It won't actually tell you anything useful about how well (or badly) your model
-is performing however as by default, there is no evaluation during training, and we didn't tell the
-:class:`~transformers.Trainer` to compute any metrics. Let's have a look on how to do that now!
-
-To have the :class:`~transformers.Trainer` compute and report metrics, we need to give it a :obj:`compute_metrics`
-function that takes predictions and labels (grouped in a namedtuple called :class:`~transformers.EvalPrediction`) and
-return a dictionary with string items (the metric names) and float values (the metric values).
-
-The 🤗 Datasets library provides an easy way to get the common metrics used in NLP with the :obj:`load_metric` function.
-here we simply use accuracy. Then we define the :obj:`compute_metrics` function that just convert logits to predictions
-(remember that all 🤗 Transformers models return the logits) and feed them to :obj:`compute` method of this metric.
-
-.. code-block:: python
-
-    import numpy as np
-    from datasets import load_metric
-
-    metric = load_metric("accuracy")
-
-    def compute_metrics(eval_pred):
-        logits, labels = eval_pred
-        predictions = np.argmax(logits, axis=-1)
-        return metric.compute(predictions=predictions, references=labels)
-
-The compute function needs to receive a tuple (with logits and labels) and has to return a dictionary with string keys
-(the name of the metric) and float values. It will be called at the end of each evaluation phase on the whole arrays of
-predictions/labels.
-
-To check if this works on practice, let's create a new :class:`~transformers.Trainer` with our fine-tuned model:
-
-.. code-block:: python
-
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=small_train_dataset,
-        eval_dataset=small_eval_dataset,
-        compute_metrics=compute_metrics,
-    )
-    trainer.evaluate()
-
-which showed an accuracy of 87.5% in our case.
-
-If you want to fine-tune your model and regularly report the evaluation metrics (for instance at the end of each
-epoch), here is how you should define your training arguments:
-
-.. code-block:: python
-
-    from transformers import TrainingArguments
-
-    training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")
-
-See the documentation of :class:`~transformers.TrainingArguments` for more options.
-
-
-.. _keras:
-
-Fine-tuning with Keras
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/rnTGBy2ax1c" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
-
-.. code-block:: python
-
-    import tensorflow as tf
-    from transformers import TFAutoModelForSequenceClassification
-
-    model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
-
-Then we will need to convert our datasets from before in standard :obj:`tf.data.Dataset`. Since we have fixed shapes,
-it can easily be done like this. First we remove the `"text"` column from our datasets and set them in TensorFlow
-format:
-
-.. code-block:: python
-
-    tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
-    tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")
-
-Then we convert everything in big tensors and use the :obj:`tf.data.Dataset.from_tensor_slices` method:
-
-.. code-block:: python
-
-    train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
-    train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
-    train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)
-
-    eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
-    eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
-    eval_tf_dataset = eval_tf_dataset.batch(8)
-
-With this done, the model can then be compiled and trained as any Keras model:
-
-.. code-block:: python
-
-    model.compile(
-        optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
-        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
-        metrics=tf.metrics.SparseCategoricalAccuracy(),
-    )
-
-    model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)
-
-With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
-as a PyTorch model (or vice-versa):
-
-.. code-block:: python
-
-    from transformers import AutoModelForSequenceClassification
-
-    model.save_pretrained("my_imdb_model")
-    pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
-
-.. _pytorch_native:
-
-Fine-tuning in native PyTorch
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. raw:: html
-
-   <iframe width="560" height="315" src="https://www.youtube.com/embed/Dh9CL8fyG80" title="YouTube video player"
-   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
-   picture-in-picture" allowfullscreen></iframe>
-
-You might need to restart your notebook at this stage to free some memory, or execute the following code:
-
-.. code-block:: python
-
-    del model
-    del pytorch_model
-    del trainer
-    torch.cuda.empty_cache()
-
-Let's now see how to achieve the same results as in :ref:`trainer section <trainer>` in PyTorch. First we need to
-define the dataloaders, which we will use to iterate over batches. We just need to apply a bit of post-processing to
-our :obj:`tokenized_datasets` before doing that to:
-
- remove the columns corresponding to values the model does not expect (here the :obj:`"text"` column)
- rename the column :obj:`"label"` to :obj:`"labels"` (because the model expect the argument to be named :obj:`labels`)
- set the format of the datasets so they return PyTorch Tensors instead of lists.
-
-Our `tokenized_datasets` has one method for each of those steps:
-
-.. code-block:: python
-
-    tokenized_datasets = tokenized_datasets.remove_columns(["text"])
-    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
-    tokenized_datasets.set_format("torch")
-
-    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
-    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
-
-Now that this is done, we can easily define our dataloaders:
-
-.. code-block:: python
-
-    from torch.utils.data import DataLoader
-
-    train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
-    eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
-
-Next, we define our model:
-
-.. code-block:: python
-
-    from transformers import AutoModelForSequenceClassification
-
-    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
-
-We are almost ready to write our training loop, the only two things are missing are an optimizer and a learning rate
-scheduler. The default optimizer used by the :class:`~transformers.Trainer` is :class:`~transformers.AdamW`:
-
-.. code-block:: python
-
-    from transformers import AdamW
-
-    optimizer = AdamW(model.parameters(), lr=5e-5)
-
-Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5 here) to 0:
-
-.. code-block:: python
-
-    from transformers import get_scheduler
-
-    num_epochs = 3
-    num_training_steps = num_epochs * len(train_dataloader)
-    lr_scheduler = get_scheduler(
-        "linear",
-        optimizer=optimizer,
-        num_warmup_steps=0,
-        num_training_steps=num_training_steps
-    )
-
-One last thing, we will want to use the GPU if we have access to one (otherwise training might take several hours
-instead of a couple of minutes). To do this, we define a :obj:`device` we will put our model and our batches on.
-
-.. code-block:: python
-
-    import torch
-
-    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
-    model.to(device)
-
-We now are ready to train! To get some sense of when it will be finished, we add a progress bar over our number of
-training steps, using the `tqdm` library.
-
-.. code-block:: python
-
-    from tqdm.auto import tqdm
-
-    progress_bar = tqdm(range(num_training_steps))
-
-    model.train()
-    for epoch in range(num_epochs):
-        for batch in train_dataloader:
-            batch = {k: v.to(device) for k, v in batch.items()}
-            outputs = model(**batch)
-            loss = outputs.loss
-            loss.backward()
-
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.zero_grad()
-            progress_bar.update(1)
-
-Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a
-bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better
-this way for Transformers model (so this is not an oversight on our side). If you're not familiar with what "freezing
-the body" of the model means, forget you read this paragraph.
-
-Now to check the results, we need to write the evaluation loop. Like in the :ref:`trainer section <trainer>` we will
-use a metric from the datasets library. Here we accumulate the predictions at each batch before computing the final
-result when the loop is finished.
-
-.. code-block:: python
-
-    metric= load_metric("accuracy")
-    model.eval()
-    for batch in eval_dataloader:
-        batch = {k: v.to(device) for k, v in batch.items()}
-        with torch.no_grad():
-            outputs = model(**batch)
-
-        logits = outputs.logits
-        predictions = torch.argmax(logits, dim=-1)
-        metric.add_batch(predictions=predictions, references=batch["labels"])
-
-    metric.compute()
-
-
-.. _additional-resources:
-
-Additional resources
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-To look at more fine-tuning examples you can refer to:
-
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`__ which includes scripts
-  to train on all common NLP tasks in PyTorch and TensorFlow.
-
- `🤗 Transformers Notebooks <notebooks>`__ which contains various notebooks and in particular one per task (look for
-  the `how to finetune a model on xxx`).