Examples reorg (#11350)

* Base move * Examples reorganization * Update references * Put back test data * Move conftest * More fixes * Move test data to test fixtures * Update path * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments and clean Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2025-07-03 12:50:06 +06:00 · 2021-04-21 11:11:20 -04:00 · 2021-04-21 11:11:20 -04:00 · dabeb15292
commit dabeb15292
parent ca7ff64f5b
105 changed files with 1062 additions and 560 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -306,12 +306,12 @@ jobs:
                      - v0.4-{{ checksum "setup.py" }}
            - run: pip install --upgrade pip
            - run: pip install .[sklearn,torch,sentencepiece,testing]
-            - run: pip install -r examples/_tests_requirements.txt
+            - run: pip install -r examples/pytorch/_tests_requirements.txt
            - save_cache:
                  key: v0.4-torch_examples-{{ checksum "setup.py" }}
                  paths:
                      - '~/.cache/pip'
-            - run: TRANSFORMERS_IS_CI=1 python -m pytest -n 8 --dist=loadfile -s --make-reports=examples_torch ./examples/ | tee examples_output.txt
+            - run: TRANSFORMERS_IS_CI=1 python -m pytest -n 8 --dist=loadfile -s --make-reports=examples_torch ./examples/pytorch/ | tee examples_output.txt
            - store_artifacts:
                  path: ~/transformers/examples_output.txt
            - store_artifacts:
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@ -59,7 +59,7 @@ jobs:
          HF_HOME: /mnt/cache
          TRANSFORMERS_IS_CI: yes
        run: |
-          pip install -r examples/_tests_requirements.txt
+          pip install -r examples/pytorch/_tests_requirements.txt
          python -m pytest -n 1 --dist=loadfile --make-reports=examples_torch_gpu examples
      - name: Failure short reports
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -285,7 +285,7 @@ $ python -m pytest -n auto --dist=loadfile -s -v ./tests/
 and for the examples:
 ```bash
-$ pip install -r examples/requirements.txt  # only needed the first time
+$ pip install -r examples/xxx/requirements.txt  # only needed the first time
 $ python -m pytest -n auto --dist=loadfile -s -v ./examples/
 ```
 In fact, that's how `make test` and `make test-examples` are implemented (sans the `pip install` line)!
--- a/2
+++ b/2
@ -73,7 +73,7 @@ test:
 # Run tests for examples
 test-examples:
-	python -m pytest -n auto --dist=loadfile -s -v ./examples/
+	python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/
 # Run tests for SageMaker DLC release
--- a/docker/transformers-pytorch-tpu/Dockerfile
+++ b/docker/transformers-pytorch-tpu/Dockerfile
@ -53,7 +53,7 @@ RUN git clone https://github.com/huggingface/transformers.git && \
    git checkout CI && \
    cd .. && \
    pip install ./transformers && \
-    pip install -r ./transformers/examples/requirements.txt && \
+    pip install -r ./transformers/examples/pytorch/_test_requirements.txt && \
    pip install pytest
 RUN python -c "import torch_xla; print(torch_xla.__version__)"
--- a/docker/transformers-pytorch-tpu/bert-base-cased.jsonnet
+++ b/docker/transformers-pytorch-tpu/bert-base-cased.jsonnet
@ -27,7 +27,7 @@ local bertBaseCased = base.BaseTest {
  },
  command: utils.scriptCommand(
    |||
-      python -m pytest -s transformers/examples/test_xla_examples.py -v
+      python -m pytest -s transformers/examples/pytorch/test_xla_examples.py -v
      test_exit_code=$?
      echo "\nFinished running commands.\n"
      test $test_exit_code -eq 0
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
@ -65,10 +65,10 @@ respectively.
 .. code-block:: bash
    ## PYTORCH CODE
-    python examples/benchmarking/run_benchmark.py --help
+    python examples/pytorch/benchmarking/run_benchmark.py --help
    ## TENSORFLOW CODE
-    python examples/benchmarking/run_benchmark_tf.py --help
+    python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
 An instantiated benchmark object can then simply be run by calling ``benchmark.run()``.
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@ -33,8 +33,8 @@ You can convert any TensorFlow checkpoint for BERT (in particular `the pre-train
 This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
 configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
 from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
-can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , `run_glue.py
+can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , :prefix_link:`run_glue.py
-<https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py>`_\ ).
+<examples/pytorch/text-classification/run_glue.py>` \ ).
 You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
 checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@ -168,13 +168,13 @@ Here is an example of how this can be used on a filesystem that is shared betwee
 On the instance with the normal network run your program which will download and cache models (and optionally datasets if you use 🤗 Datasets). For example:
 ```
-python examples/seq2seq/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
+python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```
 and then with the same filesystem you can now run the same program on a firewalled instance:
 ```
 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
-python examples/seq2seq/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
+python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```
 and it should succeed without any hanging waiting to timeout.
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@ -68,8 +68,8 @@ Additionally, the following method can be used to load values from a data file a
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-An example using these processors is given in the `run_glue.py
+An example using these processors is given in the :prefix_link:`run_glue.py
-<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
+<examples/legacy/text-classification/run_glue.py>` script.
 XNLI
@ -89,8 +89,8 @@ This library hosts the processor to load the XNLI data:
 Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
-An example using these processors is given in the `run_xnli.py
+An example using these processors is given in the :prefix_link:`run_xnli.py
-<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
+<examples/legacy/text-classification/run_xnli.py>` script.
 SQuAD
@ -169,4 +169,4 @@ Using `tensorflow_datasets` is as easy as using a data file:
 Another example using these processors is given in the :prefix_link:`run_squad.py
-<examples/question-answering/run_squad.py>` script.
+<examples/legacy/question-answering/run_squad.py>` script.
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@ -338,7 +338,7 @@ For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
 .. code-block:: bash
-    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py \
+    python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
@ -363,7 +363,7 @@ For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
 .. code-block:: bash
-    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py \
+    python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
@ -540,7 +540,7 @@ Here is an example of running ``run_translation.py`` under DeepSpeed deploying a
 .. code-block:: bash
-    deepspeed examples/seq2seq/run_translation.py \
+    deepspeed examples/pytorch/translation/run_translation.py \
    --deepspeed tests/deepspeed/ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir --fp16 \
@ -565,7 +565,7 @@ To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` comma
 .. code-block:: bash
-    deepspeed --num_gpus=1 examples/seq2seq/run_translation.py \
+    deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
    --deepspeed tests/deepspeed/ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir --fp16 \
@ -617,7 +617,7 @@ Notes:
   .. code-block:: bash
-       deepspeed --include localhost:1 examples/seq2seq/run_translation.py ...
+       deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
   In this example, we tell DeepSpeed to use GPU 1 (second gpu).
@ -711,7 +711,7 @@ shell from a cell. For example, to use ``run_translation.py`` you would launch i
 .. code-block::
    !git clone https://github.com/huggingface/transformers
-    !cd transformers; deepspeed examples/seq2seq/run_translation.py ...
+    !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
 or with ``%%bash`` magic, where you can write a multi-line code for the shell program to run:
@ -721,7 +721,7 @@ or with ``%%bash`` magic, where you can write a multi-line code for the shell pr
    git clone https://github.com/huggingface/transformers
    cd transformers
-    deepspeed examples/seq2seq/run_translation.py ...
+    deepspeed examples/pytorch/translation/run_translation.py ...
 In such case you don't need any of the code presented at the beginning of this section.
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@ -43,7 +43,7 @@ Examples
 _______________________________________________________________________________________________________________________
 - Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
-  :prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
+  :prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
 - An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
  object can be found in this `forum discussion
  <https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
--- a/docs/source/model_doc/barthez.rst
+++ b/docs/source/model_doc/barthez.rst
@ -43,7 +43,7 @@ Examples
 _______________________________________________________________________________________________________________________
 - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
-  :prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
+  :prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
 BarthezTokenizer
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
@ -44,8 +44,8 @@ Tips:
 - DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
  necessary though, just let us know if you need this option.
-This model was contributed by `victorsanh <https://huggingface.co/victorsanh>`__. The original code can be found `here
+This model was contributed by `victorsanh <https://huggingface.co/victorsanh>`__. The original code can be found
-<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
+:prefix_link:`here <examples/research-projects/distillation>`.
 DistilBertConfig
--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@ -53,7 +53,8 @@ Examples
 _______________________________________________________________________________________________________________________
 - :prefix_link:`Script <examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh>` to fine-tune pegasus
-  on the XSUM dataset. Data download instructions at :prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
+  on the XSUM dataset. Data download instructions at :prefix_link:`examples/pytorch/summarization/
  <examples/pytorch/summarization/README.md>`.
 - FP16 is not supported (help/ideas on this appreciated!).
 - The adafactor optimizer is recommended for pegasus fine-tuning.
--- a/docs/source/model_doc/retribert.rst
+++ b/docs/source/model_doc/retribert.rst
@ -21,7 +21,7 @@ Question Answering <https://yjernite.github.io/lfqa.html>`__. RetriBERT is a sma
 pair of BERT encoders with lower-dimension projection for dense semantic indexing of text.
 This model was contributed by `yjernite <https://huggingface.co/yjernite>`__. Code to train and use the model can be
-found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
+found :prefix_link:`here <examples/research-projects/distillation>`.
 RetriBertConfig
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@ -41,7 +41,7 @@ Tips:
  using only a sub-set of the output tokens as target which are selected with the :obj:`target_mapping` input.
 - To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the :obj:`perm_mask` and
  :obj:`target_mapping` inputs to control the attention span and outputs (see examples in
-  `examples/text-generation/run_generation.py`)
+  `examples/pytorch/text-generation/run_generation.py`)
 - XLNet is one of the few models that has no sequence length limit.
 This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@ -682,7 +682,8 @@ The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-e
 romanian translation.
 The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other
-translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
+translation and summarization tasks, using code in ```examples/pytorch/translation/``` , but is not very useful without
 finetuning.
 ProphetNet
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@ -90,8 +90,8 @@ You can then feed it all as input to your model:
    >>> outputs = model(input_ids, langs=langs)
-The example :prefix_link:`run_generation.py <examples/text-generation/run_generation.py>` can generate text using the
+The example :prefix_link:`run_generation.py <examples/pytorch/text-generation/run_generation.py>` can generate text
-CLM checkpoints from XLM, using the language embeddings.
+using the CLM checkpoints from XLM, using the language embeddings.
 XLM without Language Embeddings
 -----------------------------------------------------------------------------------------------------------------------
--- a/docs/source/sagemaker.md
+++ b/docs/source/sagemaker.md
@ -325,7 +325,7 @@ When you create a `HuggingFace` Estimator, you can specify a [training script th
 If you are using `git_config` to run the [🤗 Transformers examples scripts](https://github.com/huggingface/transformers/tree/master/examples) keep in mind that you need to configure the right `'branch'` for you `transformers_version`, e.g. if you use `transformers_version='4.4.2` you have to use `'branch':'v4.4.2'`. 
-As an example to use `git_config` with an [example script from the transformers repository](https://github.com/huggingface/transformers/tree/master/examples/text-classification).
+As an example to use `git_config` with an [example script from the transformers repository](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification).
 _Tip: define `output_dir` as `/opt/ml/model` in the hyperparameter for the script to save your model to S3 after training._
@ -338,7 +338,7 @@ git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch'
 # create the Estimator
 huggingface_estimator = HuggingFace(
        entry_point='run_glue.py',
-        source_dir='./examples/text-classification',
+        source_dir='./examples/pytorch/text-classification',
        git_config=git_config,
        instance_type='ml.p3.2xlarge',
        instance_count=1,
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@ -55,10 +55,10 @@ Sequence Classification
 Sequence classification is the task of classifying sequences according to a given number of classes. An example of
 sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
 model on a GLUE sequence classification task, you may leverage the :prefix_link:`run_glue.py
-<examples/text-classification/run_glue.py>`, :prefix_link:`run_tf_glue.py
+<examples/pytorch/text-classification/run_glue.py>`, :prefix_link:`run_tf_glue.py
-<examples/text-classification/run_tf_glue.py>`, :prefix_link:`run_tf_text_classification.py
+<examples/tensorflow/text-classification/run_tf_glue.py>`, :prefix_link:`run_tf_text_classification.py
-<examples/text-classification/run_tf_text_classification.py>` or :prefix_link:`run_xnli.py
+<examples/tensorflow/text-classification/run_tf_text_classification.py>` or :prefix_link:`run_xnli.py
-<examples/text-classification/run_xnli.py>` scripts.
+<examples/pytorch/text-classification/run_xnli.py>` scripts.
 Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
 leverages a fine-tuned model on sst2, which is a GLUE task.
@ -168,8 +168,10 @@ Extractive Question Answering
 Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
 question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
 model on a SQuAD task, you may leverage the `run_qa.py
-<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_qa.py>`__ and `run_tf_squad.py
+<https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering/run_qa.py>`__ and
-<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
+`run_tf_squad.py
 <https://github.com/huggingface/transformers/tree/master/examples/tensorflow/question-answering/run_tf_squad.py>`__
 scripts.
 Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
@ -184,7 +186,7 @@ leverages a fine-tuned model on SQuAD.
    >>> context = r"""
    ... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
    ... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
-    ... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
+    ... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
    ... """
 This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
@ -325,8 +327,7 @@ fill that mask with an appropriate token. This allows the model to attend to bot
 right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
 downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
 <https://arxiv.org/abs/1910.13461>`__, part 4.2). If you would like to fine-tune a model on a masked language modeling
-task, you may leverage the `run_mlm.py
+task, you may leverage the :prefix_link:`run_mlm.py <examples/pytorch/language-modeling/run_mlm.py>` script.
 <https://github.com/huggingface/transformers/tree/master/examples/language-modeling/run_mlm.py>`__ script.
 Here is an example of using pipelines to replace a mask from a sequence:
@ -435,7 +436,7 @@ Causal Language Modeling
 Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
 model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
 for generation tasks. If you would like to fine-tune a model on a causal language modeling task, you may leverage the
-`run_clm.py <https://github.com/huggingface/transformers/tree/master/examples/language-modeling/run_clm.py>`__ script.
+:prefix_link:`run_clm.py <examples/pytorch/language-modeling/run_clm.py>` script.
 Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
 input sequence.
@ -602,8 +603,7 @@ Named Entity Recognition
 Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
 as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
 which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
-`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__
+:prefix_link:`run_ner.py <examples/pytorch/token-classification/run_ner.py>` script.
 script.
 Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
 belonging to one of 9 classes:
@ -743,11 +743,12 @@ Summarization
 Summarization is the task of summarizing a document or an article into a shorter text. If you would like to fine-tune a
 model on a summarization task, you may leverage the `run_summarization.py
-<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_summarization.py>`__ script.
+<https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/run_summarization.py>`__
 script.
 An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
 created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
-approaches are described in this :prefix_link:`document <examples/seq2seq/README.md>`.
+approaches are described in this :prefix_link:`document <examples/pytorch/summarization/README.md>`.
 Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
 / Daily Mail data set.
@ -794,7 +795,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
 3. Add the T5 specific prefix "summarize: ".
 4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
-In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
+In this example we use Google's T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
 CNN / Daily Mail), it yields very good results.
 .. code-block::
@ -823,11 +824,12 @@ Translation
 Translation is the task of translating a text from one language to another. If you would like to fine-tune a model on a
 translation task, you may leverage the `run_translation.py
-<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_translation.py>`__ script.
+<https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation/run_translation.py>`__ script.
 An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
 data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
-translation task, various approaches are described in this :prefix_link:`document <examples/seq2seq/README.md>`.
+translation task, various approaches are described in this :prefix_link:`document
 <examples/pytorch.translation/README.md>`.
 Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
 multi-task mixture dataset (including WMT), yet, yielding impressive translation results.
--- a/examples/README.md
+++ b/examples/README.md
@ -15,9 +15,9 @@ limitations under the License.
 # Examples
-This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects) or to the [legacy](https://github.com/huggingface/transformers/tree/master/examples/legacy) subfolder.
+This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to be in this folder, it may have moved to the corresponding framework subfolder (pytorch, tensorflow or flax), our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects) or to the [legacy](https://github.com/huggingface/transformers/tree/master/examples/legacy) subfolder.
-While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, all the PyTorch versions of the examples fully expose the preprocessing of the data. This way, you can easily tweak them.
+While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data. This way, you can easily tweak them.
 This is similar if you want the scripts to report another metric than the one they currently use: look at the `compute_metrics` function inside the script. It takes the full arrays of predictions and labels and has to return a dictionary of string keys and float values. Just change it to add (or replace) your own metric to the ones already reported.
@ -42,7 +42,8 @@ To browse the examples corresponding to released versions of 🤗 Transformers,
 <details>
  <summary>Examples for older versions of 🤗 Transformers</summary>
-
+  - [v4.5.1](https://github.com/huggingface/transformers/tree/v4.5.1/examples)
  - [v4.4.2](https://github.com/huggingface/transformers/tree/v4.4.2/examples)
  - [v4.3.3](https://github.com/huggingface/transformers/tree/v4.3.3/examples)
  - [v4.2.2](https://github.com/huggingface/transformers/tree/v4.2.2/examples)
  - [v4.1.1](https://github.com/huggingface/transformers/tree/v4.1.1/examples)
@ -75,193 +76,3 @@ Alternatively, you can find switch your cloned 🤗 Transformers to a specific v
 git checkout tags/v3.5.1
 ```
 and run the example command as usual afterward.
 ## The Big Table of Tasks
 Here is the list of all our examples:
 - with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
  just lack some features),
 - whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
 - links to **Colab notebooks** to walk through the scripts and run them easily,
 <!--
 Coming soon!
 - links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
 -->
 | Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
 |---|---|:---:|:---:|:---:|:---:|
 | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling)       | WikiText-2      | ✅ | -  | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)
 | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice)           | SWAG            | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/multiple_choice.ipynb)
 | [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
 | [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                     |  XSum           | ✅ | -  | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/summarization.ipynb)
 | [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE            | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
 | [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)           | -               | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
 | [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
 | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                       | WMT             | ✅  | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/translation.ipynb)
 ## Running quick tests
 Most examples are equipped with a mechanism to truncate the number of dataset samples to the desired length. This is useful for debugging purposes, for example to quickly check that all stages of the programs can complete, before running the same setup on the full dataset which may take hours to complete.
 For example here is how to truncate all three splits to just 50 samples each:
 ```
 examples/token-classification/run_ner.py \
 --max_train_samples 50 \
 --max_val_samples 50 \
 --max_test_samples 50 \
 [...]
 ```
 Most example scripts should have the first two command line arguments and some have the third one. You can quickly check if a given example supports any of these by passing a `-h` option, e.g.:
 ```
 examples/token-classification/run_ner.py -h
 ```
 ## Resuming training
 You can resume training from a previous checkpoint like this:
 1. Pass `--output_dir previous_output_dir` without `--overwrite_output_dir` to resume training from the latest checkpoint in `output_dir` (what you would use if the training was interrupted, for instance).
 2. Pass `--model_name_or_path path_to_a_specific_checkpoint` to resume training from that checkpoint folder.
 Should you want to turn an example into a notebook where you'd no longer have access to the command
 line, 🤗 Trainer supports resuming from a checkpoint via `trainer.train(resume_from_checkpoint)`.
 1. If `resume_from_checkpoint` is `True` it will look for the last checkpoint in the value of `output_dir` passed via `TrainingArguments`.
 2. If `resume_from_checkpoint` is a path to a specific checkpoint it will use that saved checkpoint folder to resume the training from.
 ## Distributed training and mixed precision
 All the PyTorch scripts mentioned above work out of the box with distributed training and mixed precision, thanks to
 the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html). To launch one of them on _n_ GPUS,
 use the following command:
 ```bash
 python -m torch.distributed.launch \
    --nproc_per_node number_of_gpu_you_have path_to_script.py \
 	--all_arguments_of_the_script
 ```
 As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
 classification MNLI task using the `run_glue` script, with 8 GPUs:
 ```bash
 python -m torch.distributed.launch \
    --nproc_per_node 8 text-classification/run_glue.py \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name mnli \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mnli_output/
 ```
 If you have a GPU with mixed precision capabilities (architecture Pascal or more recent), you can use mixed precision
 training with PyTorch 1.6.0 or latest, or by installing the [Apex](https://github.com/NVIDIA/apex) library for previous
 versions. Just add the flag `--fp16` to your command launching one of the scripts mentioned above!
 Using mixed precision training usually results in 2x-speedup for training with the same final results (as shown in
 [this table](https://github.com/huggingface/transformers/tree/master/examples/text-classification#mixed-precision-training)
 for text classification).
 ## Running on TPUs
 When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
 When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
 very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
 In this repo, we provide a very simple launcher script named
 [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our
 example scripts on multiple TPU cores without any boilerplate. Just pass a `--num_cores` flag to this script, then your
 regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for
 `torch.distributed`):
 ```bash
 python xla_spawn.py --num_cores num_tpu_you_have \
    path_to_script.py \
 	--all_arguments_of_the_script
 ```
 As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
 classification MNLI task using the `run_glue` script, with 8 TPUs:
 ```bash
 python xla_spawn.py --num_cores 8 \
    text-classification/run_glue.py \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name mnli \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mnli_output/
 ```
 ## Logging & Experiment tracking
 You can easily log and monitor your runs code. The following are currently supported:
 * [TensorBoard](https://www.tensorflow.org/tensorboard)
 * [Weights & Biases](https://docs.wandb.ai/integrations/huggingface)
 * [Comet ML](https://www.comet.ml/docs/python-sdk/huggingface/)
 ### Weights & Biases
 To use Weights & Biases, install the wandb package with:
 ```bash
 pip install wandb
 ```
 Then log in the command line:
 ```bash
 wandb login
 ```
 If you are in Jupyter or Colab, you should login with:
 ```python
 import wandb
 wandb.login()
 ```
 To enable logging to W&B, include `"wandb"` in the `report_to` of your `TrainingArguments` or script. Or just pass along `--report_to all` if you have `wandb` installed.
 Whenever you use `Trainer` or `TFTrainer` classes, your losses, evaluation metrics, model topology and gradients (for `Trainer` only) will automatically be logged.
 Advanced configuration is possible by setting environment variables:
 | Environment Variable | Value |
 |---|---|
 | WANDB_LOG_MODEL | Log the model as artifact (log the model as artifact at the end of training (`false` by default) |
 | WANDB_WATCH | one of `gradients` (default) to log histograms of gradients, `all` to log histograms of both gradients and parameters, or `false` for no histogram logging |
 | WANDB_PROJECT | Organize runs by project |
 Set run names with `run_name` argument present in scripts or as part of `TrainingArguments`.
 Additional configuration options are available through generic [wandb environment variables](https://docs.wandb.com/library/environment-variables).
 Refer to related [documentation & examples](https://docs.wandb.ai/integrations/huggingface).
 ### Comet.ml
 To use `comet_ml`, install the Python package with:
 ```bash
 pip install comet_ml
 ```
 or if in a Conda environment:
 ```bash
 conda install -c comet_ml -c anaconda -c conda-forge comet_ml
 ```
--- a/examples/benchmarking/requirements.txt
+++ b/examples/benchmarking/requirements.txt
--- a/examples/flax/language-modeling/run_mlm_flax.py
+++ b/examples/flax/language-modeling/run_mlm_flax.py
--- a/examples/pytorch/README.md
+++ b/examples/pytorch/README.md
@ -0,0 +1,237 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # Examples
 This folder contains actively maintained examples of use of 🤗 Transformers using the PyTorch backend, organized along NLP tasks.
 ## The Big Table of Tasks
 Here is the list of all our examples:
 - with information on whether they are **built on top of `Trainer``** (if not, they still work, they might
  just lack some features),
 - whether or not they have a version using the [🤗 Accelerate](https://github.com/huggingface/accelerate) library.
 - whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
 - links to **Colab notebooks** to walk through the scripts and run them easily,
 <!--
 Coming soon!
 - links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
 -->
 | Task | Example datasets | Trainer support | 🤗 Accelerate | 🤗 Datasets | Colab
 |---|---|:---:|:---:|:---:|:---:|
 | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling) | WikiText-2 | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)
 | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/multiple-choice) | SWAG | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/multiple_choice.ipynb)
 | [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
 | [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization) |  XSum | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/summarization.ipynb)
 | [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification) | GLUE | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
 | [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-generation) | - | n/a | - | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
 | [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification) | CoNLL NER | ✅ |✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
 | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation) | WMT | ✅ | ✅ |✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/translation.ipynb)
 ## Running quick tests
 Most examples are equipped with a mechanism to truncate the number of dataset samples to the desired length. This is useful for debugging purposes, for example to quickly check that all stages of the programs can complete, before running the same setup on the full dataset which may take hours to complete.
 For example here is how to truncate all three splits to just 50 samples each:
 ```
 examples/pytorch/token-classification/run_ner.py \
 --max_train_samples 50 \
 --max_val_samples 50 \
 --max_test_samples 50 \
 [...]
 ```
 Most example scripts should have the first two command line arguments and some have the third one. You can quickly check if a given example supports any of these by passing a `-h` option, e.g.:
 ```
 examples/pytorch/token-classification/run_ner.py -h
 ```
 ## Resuming training
 You can resume training from a previous checkpoint like this:
 1. Pass `--output_dir previous_output_dir` without `--overwrite_output_dir` to resume training from the latest checkpoint in `output_dir` (what you would use if the training was interrupted, for instance).
 2. Pass `--model_name_or_path path_to_a_specific_checkpoint` to resume training from that checkpoint folder.
 Should you want to turn an example into a notebook where you'd no longer have access to the command
 line, 🤗 Trainer supports resuming from a checkpoint via `trainer.train(resume_from_checkpoint)`.
 1. If `resume_from_checkpoint` is `True` it will look for the last checkpoint in the value of `output_dir` passed via `TrainingArguments`.
 2. If `resume_from_checkpoint` is a path to a specific checkpoint it will use that saved checkpoint folder to resume the training from.
 ## Distributed training and mixed precision
 All the PyTorch scripts mentioned above work out of the box with distributed training and mixed precision, thanks to
 the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html). To launch one of them on _n_ GPUS,
 use the following command:
 ```bash
 python -m torch.distributed.launch \
    --nproc_per_node number_of_gpu_you_have path_to_script.py \
 	--all_arguments_of_the_script
 ```
 As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
 classification MNLI task using the `run_glue` script, with 8 GPUs:
 ```bash
 python -m torch.distributed.launch \
    --nproc_per_node 8 pytorch/text-classification/run_glue.py \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name mnli \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mnli_output/
 ```
 If you have a GPU with mixed precision capabilities (architecture Pascal or more recent), you can use mixed precision
 training with PyTorch 1.6.0 or latest, or by installing the [Apex](https://github.com/NVIDIA/apex) library for previous
 versions. Just add the flag `--fp16` to your command launching one of the scripts mentioned above!
 Using mixed precision training usually results in 2x-speedup for training with the same final results (as shown in
 [this table](https://github.com/huggingface/transformers/tree/master/examples/text-classification#mixed-precision-training)
 for text classification).
 ## Running on TPUs
 When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
 When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
 very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
 In this repo, we provide a very simple launcher script named
 [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our
 example scripts on multiple TPU cores without any boilerplate. Just pass a `--num_cores` flag to this script, then your
 regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for
 `torch.distributed`):
 ```bash
 python xla_spawn.py --num_cores num_tpu_you_have \
    path_to_script.py \
 	--all_arguments_of_the_script
 ```
 As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
 classification MNLI task using the `run_glue` script, with 8 TPUs (from this folder):
 ```bash
 python xla_spawn.py --num_cores 8 \
    text-classification/run_glue.py \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name mnli \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mnli_output/
 ```
 ## Using Accelerate
 Most PyTorch example scripts have a version using the [🤗 Accelerate](https://github.com/huggingface/accelerate) library
 that exposes the training loop so it's easy for you to customize or tweak them to your needs. They all require you to
 install `accelerate` with
 ```bash
 pip install accelerate
 ```
 Then you can easily launch any of the scripts by running
 ```bash
 accelerate config
 ```
 and reply to the questions asked. Then
 ```bash
 accelerate test
 ```
 that will check everything is ready for training. Finally, you cam launch training with
 ```bash
 accelerate launch path_to_script.py --args_to_script
 ```
 ## Logging & Experiment tracking
 You can easily log and monitor your runs code. The following are currently supported:
 * [TensorBoard](https://www.tensorflow.org/tensorboard)
 * [Weights & Biases](https://docs.wandb.ai/integrations/huggingface)
 * [Comet ML](https://www.comet.ml/docs/python-sdk/huggingface/)
 ### Weights & Biases
 To use Weights & Biases, install the wandb package with:
 ```bash
 pip install wandb
 ```
 Then log in the command line:
 ```bash
 wandb login
 ```
 If you are in Jupyter or Colab, you should login with:
 ```python
 import wandb
 wandb.login()
 ```
 To enable logging to W&B, include `"wandb"` in the `report_to` of your `TrainingArguments` or script. Or just pass along `--report_to all` if you have `wandb` installed.
 Whenever you use `Trainer` or `TFTrainer` classes, your losses, evaluation metrics, model topology and gradients (for `Trainer` only) will automatically be logged.
 Advanced configuration is possible by setting environment variables:
 | Environment Variable | Value |
 |---|---|
 | WANDB_LOG_MODEL | Log the model as artifact (log the model as artifact at the end of training (`false` by default) |
 | WANDB_WATCH | one of `gradients` (default) to log histograms of gradients, `all` to log histograms of both gradients and parameters, or `false` for no histogram logging |
 | WANDB_PROJECT | Organize runs by project |
 Set run names with `run_name` argument present in scripts or as part of `TrainingArguments`.
 Additional configuration options are available through generic [wandb environment variables](https://docs.wandb.com/library/environment-variables).
 Refer to related [documentation & examples](https://docs.wandb.ai/integrations/huggingface).
 ### Comet.ml
 To use `comet_ml`, install the Python package with:
 ```bash
 pip install comet_ml
 ```
 or if in a Conda environment:
 ```bash
 conda install -c comet_ml -c anaconda -c conda-forge comet_ml
 ```
--- a/examples/pytorch/_tests_requirements.txt
+++ b/examples/pytorch/_tests_requirements.txt
--- a/examples/pytorch/benchmarking/README.md
+++ b/examples/pytorch/benchmarking/README.md
--- a/examples/pytorch/benchmarking/plot_csv_file.py
+++ b/examples/pytorch/benchmarking/plot_csv_file.py
--- a/examples/pytorch/benchmarking/requirements.txt
+++ b/examples/pytorch/benchmarking/requirements.txt
@ -0,0 +1 @@
 torch >= 1.3
--- a/examples/pytorch/benchmarking/run_benchmark.py
+++ b/examples/pytorch/benchmarking/run_benchmark.py
--- a/examples/pytorch/conftest.py
+++ b/examples/pytorch/conftest.py
--- a/examples/pytorch/language-modeling/README.md
+++ b/examples/pytorch/language-modeling/README.md
--- a/examples/pytorch/language-modeling/requirements.txt
+++ b/examples/pytorch/language-modeling/requirements.txt
--- a/examples/pytorch/language-modeling/run_clm.py
+++ b/examples/pytorch/language-modeling/run_clm.py
--- a/examples/pytorch/language-modeling/run_clm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_clm_no_trainer.py
--- a/examples/pytorch/language-modeling/run_mlm.py
+++ b/examples/pytorch/language-modeling/run_mlm.py
--- a/examples/pytorch/language-modeling/run_mlm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_mlm_no_trainer.py
--- a/examples/pytorch/language-modeling/run_plm.py
+++ b/examples/pytorch/language-modeling/run_plm.py
--- a/examples/pytorch/multiple-choice/README.md
+++ b/examples/pytorch/multiple-choice/README.md
@ -16,9 +16,7 @@ limitations under the License.
 # Multiple Choice
-Based on the script [`run_swag.py`]().
+## Fine-tuning on SWAG with the Trainer
 ## PyTorch script: fine-tuning on SWAG
 `run_swag` allows you to fine-tune any model from our [hub](https://huggingface.co/models) (as long as its architecture as a `ForMultipleChoice` version in the library) on the SWAG dataset or your own csv/jsonlines files as long as they are structured the same way. To make it works on another dataset, you will need to tweak the `preprocess_function` inside the script.
@ -41,9 +39,9 @@ eval_acc = 0.8338998300509847
 eval_loss = 0.44457291918821606
 ```
-## PyTorch version, no Trainer
+## With Accelerate
-Based on the script [run_ner_no_trainer.py](https://github.com/huggingface/transformers/blob/master/examples/multiple-choice/run_swag_no_trainer.py).
+Based on the script [run_swag_no_trainer.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/multiple-choice/run_swag_no_trainer.py).
 Like `run_swag.py`, this script allows you to fine-tune any of the models on the [hub](https://huggingface.co/models) (as long as its architecture as a `ForMultipleChoice` version in the library) on
 the SWAG dataset or your own data in a csv or a JSON file. The main difference is that this
@ -108,24 +106,3 @@ This command is the same and will work for:
 - a training on TPUs
 Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
 ## Tensorflow
 ```bash
 export SWAG_DIR=/path/to/swag_data_dir
 python ./examples/multiple-choice/run_tf_multiple_choice.py \
 --task_name swag \
 --model_name_or_path bert-base-cased \
 --do_train \
 --do_eval \
 --data_dir $SWAG_DIR \
 --learning_rate 5e-5 \
 --num_train_epochs 3 \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
 --per_device_train_batch_size=16 \
 --logging-dir logs \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
--- a/examples/pytorch/multiple-choice/requirements.txt
+++ b/examples/pytorch/multiple-choice/requirements.txt
@ -1,2 +1,3 @@
 sentencepiece != 0.1.92
 protobuf
 torch >= 1.3
--- a/examples/pytorch/multiple-choice/run_no_trainer.sh
+++ b/examples/pytorch/multiple-choice/run_no_trainer.sh
--- a/examples/pytorch/multiple-choice/run_swag.py
+++ b/examples/pytorch/multiple-choice/run_swag.py
--- a/examples/pytorch/multiple-choice/run_swag_no_trainer.py
+++ b/examples/pytorch/multiple-choice/run_swag_no_trainer.py
--- a/examples/pytorch/question-answering/README.md
+++ b/examples/pytorch/question-answering/README.md
@ -14,9 +14,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
-## SQuAD
+# SQuAD
-Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py).
+Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_qa.py).
 **Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
 uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
@ -29,7 +29,9 @@ The old version of this script can be found [here](https://github.com/huggingfac
 Note that if your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag `--version_2_with_negative`.
-#### Fine-tuning BERT on SQuAD1.0
+## Trainer-based scripts
 ### Fine-tuning BERT on SQuAD1.0
 This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
 on a single tesla V100 16GB.
@ -57,7 +59,6 @@ exact_match = 81.22
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
 ```bash
@ -128,6 +129,71 @@ python run_qa_beam_search.py \
    --save_steps 5000
 ```
 ## With Accelerate
 Based on the script `run_qa_no_trainer.py` and `run_qa_beam_search_no_trainer.py`.
 Like `run_qa.py` and `run_qa_beam_search.py`, these scripts allow you to fine-tune any of the models supported on a
 SQUAD or a similar dataset, the main difference is that this
 script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
 It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
 or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
 the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
 after installing it:
 ```bash
 pip install accelerate
 ```
 then
 ```bash
 python run_qa_no_trainer.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ~/tmp/debug_squad
 ```
 You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
 ```bash
 accelerate config
 ```
 and reply to the questions asked. Then
 ```bash
 accelerate test
 ```
 that will check everything is ready for training. Finally, you cna launch training with
 ```bash
 export TASK_NAME=mrpc
 accelerate launch run_qa_no_trainer.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ~/tmp/debug_squad
 ```
 This command is the same and will work for:
 - a CPU-only setup
 - a setup with one GPU
 - a distributed training with several GPUs (single or multi node)
 - a training on TPUs
 Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
 ## Results
 Larger batch size may improve the performance while costing more memory.
 ##### Results for SQuAD1.0 with the previously defined hyper-parameters:
@ -223,22 +289,3 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer
 ```
 Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for 
 `bert-large-uncased-whole-word-masking`.
 ## SQuAD with the Tensorflow Trainer
 ```bash
 python run_tf_squad.py \
    --model_name_or_path bert-base-uncased \
    --output_dir model \
    --max_seq_length 384 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 16 \
    --do_train \
    --logging_dir logs \    
    --logging_steps 10 \
    --learning_rate 3e-5 \
    --doc_stride 128    
 ```
 For the moment evaluation is not available in the Tensorflow Trainer only the training.
--- a/examples/pytorch/question-answering/requirements.txt
+++ b/examples/pytorch/question-answering/requirements.txt
@ -1 +1,2 @@
 datasets >= 1.4.0
 torch >= 1.3.0
--- a/examples/pytorch/question-answering/run_qa.py
+++ b/examples/pytorch/question-answering/run_qa.py
--- a/examples/pytorch/question-answering/run_qa_beam_search.py
+++ b/examples/pytorch/question-answering/run_qa_beam_search.py
--- a/examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py
+++ b/examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py
--- a/examples/pytorch/question-answering/run_qa_no_trainer.py
+++ b/examples/pytorch/question-answering/run_qa_no_trainer.py
--- a/examples/pytorch/question-answering/trainer_qa.py
+++ b/examples/pytorch/question-answering/trainer_qa.py
--- a/examples/pytorch/question-answering/utils_qa.py
+++ b/examples/pytorch/question-answering/utils_qa.py
--- a/examples/pytorch/summarization/README.md
+++ b/examples/pytorch/summarization/README.md
@ -14,9 +14,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
-## Sequence to Sequence Training and Evaluation
+## Summarization
-This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
+This directory contains examples for finetuning and evaluating transformers on summarization  tasks.
 Please tag @patil-suraj with any issues/unexpected behaviors, or send a PR!
 For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.com/huggingface/transformers/blob/master/examples/research_projects/bertabs/README.md).
 For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2seq`](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq).
@ -30,16 +30,16 @@ For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2s
 - `PegasusForConditionalGeneration`
 - `T5ForConditionalGeneration`
-`run_summarization.py` and `run_translation.py` are lightweight examples of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
+`run_summarization.py` is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
 For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files
 and you also will find examples of these below.
-### Summarization
+## With Trainer
 Here is an example on a summarization task:
 ```bash
-python examples/seq2seq/run_summarization.py \
+python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
@ -63,7 +63,7 @@ And here is how you would use it on your own files, after adjusting the values f
 `--train_file`, `--validation_file`, `--text_column` and `--summary_column` to match your setup:
 ```bash
-python examples/seq2seq/run_summarization.py \
+python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
@ -134,115 +134,64 @@ And as with the CSV files, you can specify which values to select from the file,
    --summary_column summary \
 ```
 ## With Accelerate
 Based on the script [`run_summarization_no_trainer.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/summarization/run_summarization_no_trainer.py).
-### Translation
+Like `run_summarization.py`, this script allows you to fine-tune any of the models supported on a
 summarization task, the main difference is that this
 script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
-Here is an example of a translation fine-tuning with a MarianMT model:
+It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
 or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
 the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
 after installing it:
 ```bash
-python examples/seq2seq/run_translation.py \
+pip install accelerate
    --model_name_or_path Helsinki-NLP/opus-mt-en-ro \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
-MBart and some T5 models require special handling.
+then
 T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
 ```bash
-python examples/seq2seq/run_translation.py \
+python run_summarization_no_trainer.py \
    --model_name_or_path t5-small \
-    --do_train \
+    --dataset_name cnn_dailymail \
-    --do_eval \
+    --dataset_config "3.0.0" \
-    --source_lang en \
+    --source_prefix "summarize: " \
-    --target_lang ro \
+    --output_dir ~/tmp/tst-summarization
    --source_prefix "translate English to Romanian: " \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
-If you get a terrible BLEU score, make sure that you didn't forget to use the `--source_prefix` argument.
+You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
 For the aforementioned group of T5 models it's important to remember that if you switch to a different language pair, make sure to adjust the source and target values in all 3 language-specific command line argument: `--source_lang`, `--target_lang` and `--source_prefix`.
 MBart models require a different format for `--source_lang` and `--target_lang` values, e.g. instead of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be found [here](https://huggingface.co/facebook/mbart-large-cc25). For example:
 ```bash
-python examples/seq2seq/run_translation.py \
+accelerate config
-    --model_name_or_path facebook/mbart-large-en-ro  \
+```
    --do_train \
    --do_eval \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --source_lang en_XX \
    --target_lang ro_RO \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
-And here is how you would use the translation finetuning on your own files, after adjusting the
+and reply to the questions asked. Then
 values for the arguments `--train_file`, `--validation_file` to match your setup:
 ```bash
-python examples/seq2seq/run_translation.py \
+accelerate test
 ```
 that will check everything is ready for training. Finally, you cna launch training with
 ```bash
 export TASK_NAME=mrpc
 accelerate launch run_summarization_no_trainer.py \
    --model_name_or_path t5-small \
-    --do_train \
+    --dataset_name cnn_dailymail \
-    --do_eval \
+    --dataset_config "3.0.0" \
-    --source_lang en \
+    --source_prefix "summarize: " \
-    --target_lang ro \
+    --output_dir ~/tmp/tst-summarization
    --source_prefix "translate English to Romanian: " \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --train_file path_to_jsonlines_file \
    --validation_file path_to_jsonlines_file \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
-The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key `"translation"` and its value another dictionary whose keys is the language pair. For example:
+This command is the same and will work for:
-```json
+- a CPU-only setup
-{ "translation": { "en": "Others have dismissed him as a joke.", "ro": "Alții l-au numit o glumă." } }
+- a setup with one GPU
-{ "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alții așteaptă implozia." } }
+- a distributed training with several GPUs (single or multi node)
-```
+- a training on TPUs
 Here the languages are Romanian (`ro`) and English (`en`).
-If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as following:
+Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
 ```bash
 python examples/seq2seq/run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang de \
    --source_prefix "translate English to German: " \
    --dataset_name stas/wmt14-en-de-pre-processed \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
--- a/examples/pytorch/summarization/requirements.txt
+++ b/examples/pytorch/summarization/requirements.txt
@ -0,0 +1,7 @@
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
 rouge-score
 nltk
 py7zr
 torch >= 1.3
--- a/examples/pytorch/summarization/run_summarization.py
+++ b/examples/pytorch/summarization/run_summarization.py
--- a/examples/pytorch/summarization/run_summarization_no_trainer.py
+++ b/examples/pytorch/summarization/run_summarization_no_trainer.py
--- a/examples/pytorch/test_examples.py
+++ b/examples/pytorch/test_examples.py
@ -36,7 +36,8 @@ SRC_DIRS = [
        "language-modeling",
        "multiple-choice",
        "question-answering",
-        "seq2seq",
+        "summarization",
        "translation",
    ]
 ]
 sys.path.extend(SRC_DIRS)
--- a/examples/pytorch/test_xla_examples.py
+++ b/examples/pytorch/test_xla_examples.py
--- a/examples/pytorch/text-classification/README.md
+++ b/examples/pytorch/text-classification/README.md
@ -16,7 +16,7 @@ limitations under the License.
 # Text classification examples
-## PyTorch version
+## GLUE tasks
 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
@ -129,7 +129,7 @@ and reply to the questions asked. Then
 accelerate test
 ```
-that will check everything is ready for training. Finally, you cna launch training with
+that will check everything is ready for training. Finally, you can launch training with
 ```bash
 export TASK_NAME=mrpc
@ -152,84 +152,3 @@ This command is the same and will work for:
 - a training on TPUs
 Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
 ## TensorFlow 2.0 version
 Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
 This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
 Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
 These options and the below benchmark are provided by @tlkh.
 Quick benchmarks from the script (no other modifications):
 | GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
 | --------- | -------- | ----------------------- | ----------------------|
 | Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
 | Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
 | V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
 | V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
 | 1080 Ti | FP32 | 55s | - |
 Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
 ## Run generic text classification script in TensorFlow
 The script [run_tf_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_text_classification.py) allows users to run a text classification on their own CSV files. For now there are few restrictions, the CSV files must have a header corresponding to the column names and not more than three columns: one column for the id, one column for the text and another column for a second piece of text in case of an entailment classification for example.
 To use the script, one as to run the following command line:
 ```bash
 python run_tf_text_classification.py \
  --train_file train.csv \ ### training dataset file location (mandatory if running with --do_train option)
  --dev_file dev.csv \ ### development dataset file location (mandatory if running with --do_eval option)
  --test_file test.csv \ ### test dataset file location (mandatory if running with --do_predict option)
  --label_column_id 0 \ ### which column corresponds to the labels
  --model_name_or_path bert-base-multilingual-uncased \
  --output_dir model \
  --num_train_epochs 4 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 32 \
  --do_train \
  --do_eval \
  --do_predict \
  --logging_steps 10 \
  --evaluation_strategy steps \
  --save_steps 10 \
  --overwrite_output_dir \
  --max_seq_length 128
 ```
 ## XNLI
 Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
 [XNLI](https://www.nyu.edu/projects/bowman/xnli/) is a crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
 #### Fine-tuning on XNLI
 This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins on a single tesla V100 16GB.
 ```bash
 python run_xnli.py \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
  --output_dir /tmp/debug_xnli/ \
  --save_steps -1
 ```
 Training with the previously defined hyper-parameters yields the following results on the **test** set:
 ```bash
 acc = 0.7093812375249501
 ```
--- a/examples/pytorch/text-classification/requirements.txt
+++ b/examples/pytorch/text-classification/requirements.txt
@ -2,3 +2,4 @@ accelerate
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
 torch >= 1.3
--- a/examples/pytorch/text-classification/run_glue.py
+++ b/examples/pytorch/text-classification/run_glue.py
--- a/examples/pytorch/text-classification/run_glue_no_trainer.py
+++ b/examples/pytorch/text-classification/run_glue_no_trainer.py
--- a/examples/pytorch/text-classification/run_xnli.py
+++ b/examples/pytorch/text-classification/run_xnli.py
--- a/examples/pytorch/text-generation/README.md
+++ b/examples/pytorch/text-generation/README.md
--- a/examples/pytorch/text-generation/requirements.txt
+++ b/examples/pytorch/text-generation/requirements.txt
@ -1,2 +1,3 @@
 sentencepiece != 0.1.92
 protobuf
 torch >= 1.3
--- a/examples/pytorch/text-generation/run_generation.py
+++ b/examples/pytorch/text-generation/run_generation.py
--- a/examples/pytorch/token-classification/README.md
+++ b/examples/pytorch/token-classification/README.md
@ -61,7 +61,7 @@ You can find the old version of the PyTorch script [here](https://github.com/hug
 ## Pytorch version, no Trainer
-Based on the script [run_ner_no_trainer.py](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_no_trainer.py).
+Based on the script [run_ner_no_trainer.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner_no_trainer.py).
 Like `run_ner.py`, this script allows you to fine-tune any of the models on the [hub](https://huggingface.co/models) on a
 token classification task, either NER, POS or CHUNKS tasks or your own data in a csv or a JSON file. The main difference is that this
@ -126,66 +126,3 @@ This command is the same and will work for:
 - a training on TPUs
 Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
 ### TensorFlow version
 The following examples are covered in this section:
 * NER on the GermEval 2014 (German NER) dataset
 * Emerging and Rare Entities task: WNUT’17 (English NER) dataset
 Details and results for the fine-tuning provided by @stefan-it.
 ### GermEval 2014 (German NER) dataset
 #### Data (Download and pre-processing steps)
 Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
 Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
 ```bash
 curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
 | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
 curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
 | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
 curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
 | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
 ```
 The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`.
 One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s.
 The `preprocess.py` script located in the `scripts` folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
 Let's define some variables that we need for further pre-processing steps and training the model:
 ```bash
 export MAX_LENGTH=128
 export BERT_MODEL=bert-base-multilingual-cased
 ```
 Run the pre-processing script on training, dev and test datasets:
 ```bash
 python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
 python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
 python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
 ```
 The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
 ```bash
 cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
 ```
 #### Prepare the run
 Additional environment variables must be set:
 ```bash
 export OUTPUT_DIR=germeval-model
 export BATCH_SIZE=32
 export NUM_EPOCHS=3
 export SAVE_STEPS=750
 export SEED=1
 ```
--- a/examples/pytorch/token-classification/requirements.txt
+++ b/examples/pytorch/token-classification/requirements.txt
@ -1,2 +1,3 @@
 seqeval
 datasets >= 1.1.3
 torch >= 1.3
--- a/examples/pytorch/token-classification/run.sh
+++ b/examples/pytorch/token-classification/run.sh
--- a/examples/pytorch/token-classification/run_ner.py
+++ b/examples/pytorch/token-classification/run_ner.py
--- a/examples/pytorch/token-classification/run_ner_no_trainer.py
+++ b/examples/pytorch/token-classification/run_ner_no_trainer.py
--- a/examples/pytorch/token-classification/run_no_trainer.sh
+++ b/examples/pytorch/token-classification/run_no_trainer.sh
--- a/examples/pytorch/translation/README.md
+++ b/examples/pytorch/translation/README.md
@ -0,0 +1,212 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 ## Translation
 This directory contains examples for finetuning and evaluating transformers on translation tasks.
 Please tag @patil-suraj with any issues/unexpected behaviors, or send a PR!
 For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.com/huggingface/transformers/blob/master/examples/research_projects/bertabs/README.md).
 For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2seq`](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq).
 ### Supported Architectures
 - `BartForConditionalGeneration`
 - `FSMTForConditionalGeneration` (translation only)
 - `MBartForConditionalGeneration`
 - `MarianMTModel`
 - `PegasusForConditionalGeneration`
 - `T5ForConditionalGeneration`
 `run_translation.py` is a lightweight examples of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
 For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files
 and you also will find examples of these below.
 ## With Trainer
 Here is an example of a translation fine-tuning with a MarianMT model:
 ```bash
 python examples/pytorch/seq2seq/run_translation.py \
    --model_name_or_path Helsinki-NLP/opus-mt-en-ro \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
 MBart and some T5 models require special handling.
 T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
 ```bash
 python examples/pytorch/seq2seq/run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --source_prefix "translate English to Romanian: " \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
 If you get a terrible BLEU score, make sure that you didn't forget to use the `--source_prefix` argument.
 For the aforementioned group of T5 models it's important to remember that if you switch to a different language pair, make sure to adjust the source and target values in all 3 language-specific command line argument: `--source_lang`, `--target_lang` and `--source_prefix`.
 MBart models require a different format for `--source_lang` and `--target_lang` values, e.g. instead of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be found [here](https://huggingface.co/facebook/mbart-large-cc25). For example:
 ```bash
 python examples/pytorch/seq2seq/run_translation.py \
    --model_name_or_path facebook/mbart-large-en-ro  \
    --do_train \
    --do_eval \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --source_lang en_XX \
    --target_lang ro_RO \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
 And here is how you would use the translation finetuning on your own files, after adjusting the
 values for the arguments `--train_file`, `--validation_file` to match your setup:
 ```bash
 python examples/pytorch/seq2seq/run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --source_prefix "translate English to Romanian: " \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --train_file path_to_jsonlines_file \
    --validation_file path_to_jsonlines_file \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
 The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key `"translation"` and its value another dictionary whose keys is the language pair. For example:
 ```json
 { "translation": { "en": "Others have dismissed him as a joke.", "ro": "Alții l-au numit o glumă." } }
 { "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alții așteaptă implozia." } }
 ```
 Here the languages are Romanian (`ro`) and English (`en`).
 If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as following:
 ```bash
 python examples/pytorch/seq2seq/run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang de \
    --source_prefix "translate English to German: " \
    --dataset_name stas/wmt14-en-de-pre-processed \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
 ## With Accelerate
 Based on the script [`run_translation_no_trainer.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translationn_no_trainer.py).
 Like `run_translation.py`, this script allows you to fine-tune any of the models supported on a
 translation task, the main difference is that this
 script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
 It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
 or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
 the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
 after installing it:
 ```bash
 pip install accelerate
 ```
 then
 ```bash
 python run_tranlation_no_trainer.py \
    --model_name_or_path Helsinki-NLP/opus-mt-en-ro \
    --source_lang en \
    --target_lang ro \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir ~/tmp/tst-translation
 ```
 You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
 ```bash
 accelerate config
 ```
 and reply to the questions asked. Then
 ```bash
 accelerate test
 ```
 that will check everything is ready for training. Finally, you cna launch training with
 ```bash
 export TASK_NAME=mrpc
 accelerate launch run_translation_no_trainer.py \
    --model_name_or_path Helsinki-NLP/opus-mt-en-ro \
    --source_lang en \
    --target_lang ro \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir ~/tmp/tst-translation
 ```
 This command is the same and will work for:
 - a CPU-only setup
 - a setup with one GPU
 - a distributed training with several GPUs (single or multi node)
 - a training on TPUs
 Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
--- a/examples/pytorch/translation/requirements.txt
+++ b/examples/pytorch/translation/requirements.txt
@ -2,6 +2,5 @@ datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
 sacrebleu >= 1.4.12
 rouge-score
 nltk
 py7zr
 torch >= 1.3
--- a/examples/pytorch/translation/run_translation.py
+++ b/examples/pytorch/translation/run_translation.py
--- a/examples/pytorch/translation/run_translation_no_trainer.py
+++ b/examples/pytorch/translation/run_translation_no_trainer.py
--- a/examples/pytorch/xla_spawn.py
+++ b/examples/pytorch/xla_spawn.py
--- a/examples/tensorflow/README.md
+++ b/examples/tensorflow/README.md
@ -0,0 +1,42 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # Examples
 This folder contains actively maintained examples of use of 🤗 Transformers using the TensorFlow backend, organized along NLP tasks. It is under construction so we thank you for your patience!
 ## The Big Table of Tasks
 Here is the list of all our examples:
 - with information on whether they are **built on top of `Keras`** (if not, they still work, they might
  just lack some features),
 - whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
 - links to **Colab notebooks** to walk through the scripts and run them easily,
 <!--
 Coming soon!
 - links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
 -->
 | Task | Example datasets | Keras support | 🤗 Datasets | Colab
 |---|---|:---:|:---:|:---:|
 | **`language-modeling`** | WikiText-2 | - | - | -
 | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/multiple-choice) | SWAG | - | - | -
 | [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/question-answering) | SQuAD | - | - | -
 | **`summarization`** | XSum | - | -  | -
 | [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification) | GLUE | - | - | -
 | **`text-generation`** | n/a | - | n/a | -
 | **`token-classification`** | CoNLL NER | - | - | - 
 | **`translation`** | WMT | -  | - | -
--- a/examples/tensorflow/benchmarking/README.md
+++ b/examples/tensorflow/benchmarking/README.md
@ -0,0 +1,26 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # 🤗 Benchmark results
 Here, you can find a list of the different benchmark results created by the community.
 If you would like to list benchmark results on your favorite models of the [model hub](https://huggingface.co/models) here, please open a Pull Request and add it below.
 | Benchmark description | Results | Environment info |      Author      |
 |:----------|:-------------|:-------------|------:|
 | PyTorch Benchmark on inference for `bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
 | PyTorch Benchmark on inference for `bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
--- a/examples/tensorflow/benchmarking/plot_csv_file.py
+++ b/examples/tensorflow/benchmarking/plot_csv_file.py
@ -0,0 +1,178 @@
 # Copyright 2020 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import csv
 from collections import defaultdict
 from dataclasses import dataclass, field
 from typing import List, Optional
 import matplotlib.pyplot as plt
 import numpy as np
 from matplotlib.ticker import ScalarFormatter
 from transformers import HfArgumentParser
 def list_field(default=None, metadata=None):
    return field(default_factory=lambda: default, metadata=metadata)
@dataclass
 class PlotArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
    """
    csv_file: str = field(
        metadata={"help": "The csv file to plot."},
    )
    plot_along_batch: bool = field(
        default=False,
        metadata={"help": "Whether to plot along batch size or sequence length. Defaults to sequence length."},
    )
    is_time: bool = field(
        default=False,
        metadata={"help": "Whether the csv file has time results or memory results. Defaults to memory results."},
    )
    no_log_scale: bool = field(
        default=False,
        metadata={"help": "Disable logarithmic scale when plotting"},
    )
    is_train: bool = field(
        default=False,
        metadata={
            "help": "Whether the csv file has training results or inference results. Defaults to inference results."
        },
    )
    figure_png_file: Optional[str] = field(
        default=None,
        metadata={"help": "Filename under which the plot will be saved. If unused no plot is saved."},
    )
    short_model_names: Optional[List[str]] = list_field(
        default=None, metadata={"help": "List of model names that are used instead of the ones in the csv file."}
    )
 def can_convert_to_int(string):
    try:
        int(string)
        return True
    except ValueError:
        return False
 def can_convert_to_float(string):
    try:
        float(string)
        return True
    except ValueError:
        return False
 class Plot:
    def __init__(self, args):
        self.args = args
        self.result_dict = defaultdict(lambda: dict(bsz=[], seq_len=[], result={}))
        with open(self.args.csv_file, newline="") as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                model_name = row["model"]
                self.result_dict[model_name]["bsz"].append(int(row["batch_size"]))
                self.result_dict[model_name]["seq_len"].append(int(row["sequence_length"]))
                if can_convert_to_int(row["result"]):
                    # value is not None
                    self.result_dict[model_name]["result"][
                        (int(row["batch_size"]), int(row["sequence_length"]))
                    ] = int(row["result"])
                elif can_convert_to_float(row["result"]):
                    # value is not None
                    self.result_dict[model_name]["result"][
                        (int(row["batch_size"]), int(row["sequence_length"]))
                    ] = float(row["result"])
    def plot(self):
        fig, ax = plt.subplots()
        title_str = "Time usage" if self.args.is_time else "Memory usage"
        title_str = title_str + " for training" if self.args.is_train else title_str + " for inference"
        if not self.args.no_log_scale:
            # set logarithm scales
            ax.set_xscale("log")
            ax.set_yscale("log")
        for axis in [ax.xaxis, ax.yaxis]:
            axis.set_major_formatter(ScalarFormatter())
        for model_name_idx, model_name in enumerate(self.result_dict.keys()):
            batch_sizes = sorted(list(set(self.result_dict[model_name]["bsz"])))
            sequence_lengths = sorted(list(set(self.result_dict[model_name]["seq_len"])))
            results = self.result_dict[model_name]["result"]
            (x_axis_array, inner_loop_array) = (
                (batch_sizes, sequence_lengths) if self.args.plot_along_batch else (sequence_lengths, batch_sizes)
            )
            label_model_name = (
                model_name if self.args.short_model_names is None else self.args.short_model_names[model_name_idx]
            )
            for inner_loop_value in inner_loop_array:
                if self.args.plot_along_batch:
                    y_axis_array = np.asarray(
                        [results[(x, inner_loop_value)] for x in x_axis_array if (x, inner_loop_value) in results],
                        dtype=np.int,
                    )
                else:
                    y_axis_array = np.asarray(
                        [results[(inner_loop_value, x)] for x in x_axis_array if (inner_loop_value, x) in results],
                        dtype=np.float32,
                    )
                (x_axis_label, inner_loop_label) = (
                    ("batch_size", "len") if self.args.plot_along_batch else ("in #tokens", "bsz")
                )
                x_axis_array = np.asarray(x_axis_array, np.int)[: len(y_axis_array)]
                plt.scatter(
                    x_axis_array, y_axis_array, label=f"{label_model_name} - {inner_loop_label}: {inner_loop_value}"
                )
                plt.plot(x_axis_array, y_axis_array, "--")
            title_str += f" {label_model_name} vs."
        title_str = title_str[:-4]
        y_axis_label = "Time in s" if self.args.is_time else "Memory in MB"
        # plot
        plt.title(title_str)
        plt.xlabel(x_axis_label)
        plt.ylabel(y_axis_label)
        plt.legend()
        if self.args.figure_png_file is not None:
            plt.savefig(self.args.figure_png_file)
        else:
            plt.show()
 def main():
    parser = HfArgumentParser(PlotArguments)
    plot_args = parser.parse_args_into_dataclasses()[0]
    plot = Plot(args=plot_args)
    plot.plot()
 if __name__ == "__main__":
    main()
--- a/examples/tensorflow/benchmarking/requirements.txt
+++ b/examples/tensorflow/benchmarking/requirements.txt
@ -0,0 +1 @@
 tensorflow >= 2.3
--- a/examples/tensorflow/benchmarking/run_benchmark_tf.py
+++ b/examples/tensorflow/benchmarking/run_benchmark_tf.py
--- a/examples/tensorflow/multiple-choice/README.md
+++ b/examples/tensorflow/multiple-choice/README.md
@ -0,0 +1,38 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # Multiple Choice
 ## Fine-tuning on SWAG
 ```bash
 export SWAG_DIR=/path/to/swag_data_dir
 python ./examples/multiple-choice/run_tf_multiple_choice.py \
 --task_name swag \
 --model_name_or_path bert-base-cased \
 --do_train \
 --do_eval \
 --data_dir $SWAG_DIR \
 --learning_rate 5e-5 \
 --num_train_epochs 3 \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
 --per_device_train_batch_size=16 \
 --logging-dir logs \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
--- a/examples/tensorflow/multiple-choice/requirements.txt
+++ b/examples/tensorflow/multiple-choice/requirements.txt
@ -0,0 +1,3 @@
 sentencepiece != 0.1.92
 protobuf
 tensorflow >= 2.3
--- a/examples/tensorflow/multiple-choice/run_tf_multiple_choice.py
+++ b/examples/tensorflow/multiple-choice/run_tf_multiple_choice.py
--- a/examples/tensorflow/multiple-choice/utils_multiple_choice.py
+++ b/examples/tensorflow/multiple-choice/utils_multiple_choice.py
--- a/examples/tensorflow/question-answering/README.md
+++ b/examples/tensorflow/question-answering/README.md
@ -0,0 +1,34 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 ## SQuAD with the Tensorflow Trainer
 ```bash
 python run_tf_squad.py \
    --model_name_or_path bert-base-uncased \
    --output_dir model \
    --max_seq_length 384 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 16 \
    --do_train \
    --logging_dir logs \    
    --logging_steps 10 \
    --learning_rate 3e-5 \
    --doc_stride 128    
 ```
 For the moment evaluation is not available in the Tensorflow Trainer only the training.
--- a/examples/tensorflow/question-answering/requirements.txt
+++ b/examples/tensorflow/question-answering/requirements.txt
@ -0,0 +1,2 @@
 datasets >= 1.4.0
 tensorflow >= 2.3.0
--- a/examples/tensorflow/question-answering/run_tf_squad.py
+++ b/examples/tensorflow/question-answering/run_tf_squad.py
--- a/examples/tensorflow/text-classification/README.md
+++ b/examples/tensorflow/text-classification/README.md
@ -0,0 +1,67 @@
 <!---
 Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # Text classification examples
 ## GLUE tasks
 Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/tensorflow/text-classification/run_tf_glue.py).
 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
 This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
 Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
 These options and the below benchmark are provided by @tlkh.
 Quick benchmarks from the script (no other modifications):
 | GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
 | --------- | -------- | ----------------------- | ----------------------|
 | Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
 | Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
 | V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
 | V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
 | 1080 Ti | FP32 | 55s | - |
 Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
 ## Run generic text classification script in TensorFlow
 The script [run_tf_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/tensorflow/text-classification/run_tf_text_classification.py) allows users to run a text classification on their own CSV files. For now there are few restrictions, the CSV files must have a header corresponding to the column names and not more than three columns: one column for the id, one column for the text and another column for a second piece of text in case of an entailment classification for example.
 To use the script, one as to run the following command line:
 ```bash
 python run_tf_text_classification.py \
  --train_file train.csv \ ### training dataset file location (mandatory if running with --do_train option)
  --dev_file dev.csv \ ### development dataset file location (mandatory if running with --do_eval option)
  --test_file test.csv \ ### test dataset file location (mandatory if running with --do_predict option)
  --label_column_id 0 \ ### which column corresponds to the labels
  --model_name_or_path bert-base-multilingual-uncased \
  --output_dir model \
  --num_train_epochs 4 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 32 \
  --do_train \
  --do_eval \
  --do_predict \
  --logging_steps 10 \
  --evaluation_strategy steps \
  --save_steps 10 \
  --overwrite_output_dir \
  --max_seq_length 128
 ```
--- a/examples/tensorflow/text-classification/requirements.txt
+++ b/examples/tensorflow/text-classification/requirements.txt
@ -0,0 +1,5 @@
 accelerate
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
 tensorflow >= 2.3
--- a/examples/tensorflow/text-classification/run_tf_glue.py
+++ b/examples/tensorflow/text-classification/run_tf_glue.py
--- a/examples/tensorflow/text-classification/run_tf_text_classification.py
+++ b/examples/tensorflow/text-classification/run_tf_text_classification.py
--- a/src/transformers/data/datasets/glue.py
+++ b/src/transformers/data/datasets/glue.py
@ -87,7 +87,7 @@ class GlueDataset(Dataset):
        warnings.warn(
            "This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
            "library. You can have a look at this example script for pointers: "
-            "https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py",
+            "https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py",
            FutureWarning,
        )
        self.args = args
--- a/src/transformers/data/datasets/language_modeling.py
+++ b/src/transformers/data/datasets/language_modeling.py
@ -53,7 +53,7 @@ class TextDataset(Dataset):
    ):
        warnings.warn(
            DEPRECATION_WARNING.format(
-                "https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
+                "https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
            ),
            FutureWarning,
        )
@ -119,7 +119,7 @@ class LineByLineTextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
        warnings.warn(
            DEPRECATION_WARNING.format(
-                "https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
+                "https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
            ),
            FutureWarning,
        )
@ -151,7 +151,7 @@ class LineByLineWithRefDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, ref_path: str):
        warnings.warn(
            DEPRECATION_WARNING.format(
-                "https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm_wwm.py"
+                "https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm_wwm.py"
            ),
            FutureWarning,
        )
@ -193,7 +193,7 @@ class LineByLineWithSOPTextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, file_dir: str, block_size: int):
        warnings.warn(
            DEPRECATION_WARNING.format(
-                "https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
+                "https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
            ),
            FutureWarning,
        )
@ -348,7 +348,7 @@ class TextDatasetForNextSentencePrediction(Dataset):
    ):
        warnings.warn(
            DEPRECATION_WARNING.format(
-                "https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py"
+                "https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
            ),
            FutureWarning,
        )
--- a/src/transformers/data/metrics/init.py
+++ b/src/transformers/data/metrics/init.py
@ -28,7 +28,7 @@ if is_sklearn_available():
 DEPRECATION_WARNING = (
    "This metric will be removed from the library soon, metrics should be handled with the 🤗 Datasets "
    "library. You can have a look at this example script for pointers: "
-    "https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py"
+    "https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py"
 )
--- a/src/transformers/data/processors/glue.py
+++ b/src/transformers/data/processors/glue.py
@ -35,7 +35,7 @@ logger = logging.get_logger(__name__)
 DEPRECATION_WARNING = (
    "This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
    "library. You can have a look at this example script for pointers: "
-    "https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py"
+    "https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py"
 )
--- a/tests/deepspeed/test_deepspeed.py
+++ b/tests/deepspeed/test_deepspeed.py
@ -536,7 +536,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
        remove_args_str: str = None,
    ):
        max_len = 32
-        data_dir = self.examples_dir / "test_data/wmt_en_ro"
+        data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
        output_dir = self.get_auto_remove_tmp_dir()
        args = f"""
            --model_name_or_path {model_name}
@ -594,7 +594,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
            args = [x for x in args if x not in remove_args]
        ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
-        script = [f"{self.examples_dir_str}/seq2seq/run_translation.py"]
+        script = [f"{self.examples_dir_str}/pytorch/translation/run_translation.py"]
        launcher = self.get_launcher(distributed)
        cmd = launcher + script + args + ds_args
@ -629,7 +629,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
            """.split()
        ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
-        script = [f"{self.examples_dir_str}/language-modeling/run_clm.py"]
+        script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
        launcher = self.get_launcher(distributed=True)
        cmd = launcher + script + args + ds_args
--- a/tests/extended/test_trainer_ext.py
+++ b/tests/extended/test_trainer_ext.py
@ -35,7 +35,7 @@ from transformers.trainer_utils import set_seed
 bindir = os.path.abspath(os.path.dirname(__file__))
-with ExtendSysPath(f"{bindir}/../../examples/seq2seq"):
+with ExtendSysPath(f"{bindir}/../../examples/pytorch/translation"):
    from run_translation import main  # noqa
@ -181,7 +181,7 @@ class TestTrainerExt(TestCasePlus):
        extra_args_str: str = None,
        predict_with_generate: bool = True,
    ):
-        data_dir = self.examples_dir / "test_data/wmt_en_ro"
+        data_dir = self.test_file_dir / "../fixtures/tests_samples/wmt_en_ro"
        output_dir = self.get_auto_remove_tmp_dir()
        args = f"""
            --model_name_or_path {model_name}
@ -226,7 +226,7 @@ class TestTrainerExt(TestCasePlus):
            distributed_args = f"""
                -m torch.distributed.launch
                --nproc_per_node={n_gpu}
-                {self.examples_dir_str}/seq2seq/run_translation.py
+                {self.examples_dir_str}/pytorch/translation/run_translation.py
            """.split()
            cmd = [sys.executable] + distributed_args + args
            execute_subprocess_async(cmd, env=self.get_env())
--- a/tests/fixtures/tests_samples/wmt_en_ro/test.json
+++ b/tests/fixtures/tests_samples/wmt_en_ro/test.json
--- a/tests/fixtures/tests_samples/wmt_en_ro/train.json
+++ b/tests/fixtures/tests_samples/wmt_en_ro/train.json
--- a/Show More
+++ b/Show More
`@ -1 +1,2 @@`
	`datasets >= 1.4.0`	`datasets >= 1.4.0`
		`torch >= 1.3.0`