BIG Reorganize examples (#4213)

* Created using Colaboratory * [examples] reorganize files * remove run_tpu_glue.py as superseded by TPU support in Trainer * Bugfix: int, not tuple * move files around
2025-07-04 05:10:06 +06:00 · 2020-05-07 13:48:44 -04:00 · 2020-05-07 13:48:44 -04:00 · 0ae96ff8a7
commit 0ae96ff8a7
parent cafa6a9e29
65 changed files with 1355 additions and 1308 deletions
--- a/.gitignore
+++ b/.gitignore
@ -132,7 +132,8 @@ proc_data
 runs
 /runs_old
 /wandb
-examples/runs
+/examples/runs
 /examples/**/*.args
 # data
 /data
--- a/README.md
+++ b/README.md
@ -331,7 +331,7 @@ pip install -r ./examples/requirements.txt
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MRPC
-python ./examples/run_glue.py \
+python ./examples/text-classification/run_glue.py \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
@ -357,7 +357,7 @@ Parallel training is a simple way to use several GPUs (but is slower and less fl
 ```shell
 export GLUE_DIR=/path/to/glue
-python ./examples/run_glue.py \
+python ./examples/text-classification/run_glue.py \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --do_eval   \
@ -382,7 +382,7 @@ On this machine we thus have a batch size of 32, please increase `gradient_accum
 This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.
 ```bash
-python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py   \
+python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classification/run_glue.py   \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name MRPC \
    --do_train   \
--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@ -1 +0,0 @@
 ../../examples/README.md
--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@ -0,0 +1,649 @@
 # Examples
 In this section a few examples are put together. All of these examples work for several models, making use of the very
 similar API between the different models.
 **Important**
 To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
 Execute the following steps in a new virtual environment:
 ```bash
 git clone https://github.com/huggingface/transformers
 cd transformers
 pip install .
 pip install -r ./examples/requirements.txt
 ```
 | Section                    | Description                                                                                                                                                |
 |----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
 | [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
 | [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
 | [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
 | [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
 | [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
 | [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
 | [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
 | [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
 | [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
 | [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
 ## TensorFlow 2.0 Bert models on GLUE
 Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
 This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
 Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
 These options and the below benchmark are provided by @tlkh.
 Quick benchmarks from the script (no other modifications):
 | GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
 | --------- | -------- | ----------------------- | ----------------------|
 | Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
 | Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
 | V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
 | V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
 | 1080 Ti | FP32 | 55s | - |
 Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
 ## Running on TPUs
 You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
 [README](https://github.com/pytorch/xla/blob/master/README.md).
 The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
 identical to your normal GPU + Huggingface setup.
 ### GLUE
 Before running anyone of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 For running your GLUE task on MNLI dataset you can run something like the following:
 ```
 export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MNLI
 python run_glue_tpu.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME \
  --overwrite_output_dir \
  --logging_steps 50 \
  --save_steps 200 \
  --num_cores=8 \
  --only_log_master
 ```
 ## Language model training
 Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
 Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
 to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
 are fine-tuned using a masked language modeling (MLM) loss.
 Before running the following example, you should get a file that contains text on which the language model will be
 trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
 We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
 text that will be used for evaluation.
 ### GPT-2/GPT and causal language modeling
 The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
 the tokenization). The loss here is that of causal language modeling.
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
 python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE
 ```
 This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
 a score of ~20 perplexity once fine-tuned on the dataset.
 ### RoBERTa/BERT and masked language modeling
 The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
 as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
 pre-training: masked language modeling.
 In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
 slightly slower (over-fitting takes more epochs).
 We use the `--mlm` flag so that the script may change its loss function.
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
 python run_language_modeling.py \
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm
 ```
 ## Language generation
 Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
 A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
 can try out the different models available in the library.
 Example usage:
 ```bash
 python run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=gpt2
 ```
 ## GLUE
 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
 Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
 Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
 GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
 uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
 batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
 between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
 | Task  | Metric                       | Result      |
 |-------|------------------------------|-------------|
 | CoLA  | Matthew's corr               | 49.23       |
 | SST-2 | Accuracy                     | 91.97       |
 | MRPC  | F1/Accuracy                  | 89.47/85.29 |
 | STS-B | Person/Spearman corr.        | 83.95/83.70 |
 | QQP   | Accuracy/F1                  | 88.40/84.31 |
 | MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
 | QNLI  | Accuracy                     | 87.46       |
 | RTE   | Accuracy                     | 61.73       |
 | WNLI  | Accuracy                     | 45.07       |
 Some of these results are significantly different from the ones reported on the test set
 of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
 Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 ```bash
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MRPC
 python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
 ```
 where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
 The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
 In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
 output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
 The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
 CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
 said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
 since the data processor for each task inherits from the base class DataProcessor.
 ### MRPC
 #### Fine-tuning example
 The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
 than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
 Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 ```bash
 export GLUE_DIR=/path/to/glue
 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/
 ```
 Our test ran on a few seeds with [the original implementation hyper-
 parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
 results between 84% and 88%.
 #### Using Apex and mixed-precision
 Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
 [apex](https://github.com/NVIDIA/apex), then run the following example:
 ```bash
 export GLUE_DIR=/path/to/glue
 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
  --fp16
 ```
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
 reaches F1 > 92 on MRPC.
 ```bash
 export GLUE_DIR=/path/to/glue
 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name MRPC \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/
 ```
 Training with these hyper-parameters gave us the following results:
 ```bash
 acc = 0.8823529411764706
 acc_and_f1 = 0.901702786377709
 eval_loss = 0.3418912578906332
 f1 = 0.9210526315789473
 global_step = 174
 loss = 0.07231863956341798
 ```
 ### MNLI
 The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
 ```bash
 export GLUE_DIR=/path/to/glue
 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name mnli \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \
 ```
 The results  are the following:
 ```bash
 ***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904
 ***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904
 ```
 ## Multiple Choice
 Based on the script [`run_multiple_choice.py`]().
 #### Fine-tuning on SWAG
 Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
 ```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
 python ./examples/run_multiple_choice.py \
 --task_name swag \
 --model_name_or_path roberta-base \
 --do_train \
 --do_eval \
 --data_dir $SWAG_DIR \
 --learning_rate 5e-5 \
 --num_train_epochs 3 \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
 --per_gpu_train_batch_size=16 \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
 Training with the defined hyper-parameters yields the following results:
 ```
 ***** Eval results *****
 eval_acc = 0.8338998300509847
 eval_loss = 0.44457291918821606
 ```
 ## SQuAD
 Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
 #### Fine-tuning BERT on SQuAD1.0
 This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
 on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
 $SQUAD_DIR directory.
 * [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
 * [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
 * [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
 And for SQuAD2.0, you need to download:
 - [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
 - [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
 - [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/
 ```
 Training with the previously defined hyper-parameters yields the following results:
 ```bash
 f1 = 88.52
 exact_match = 81.22
 ```
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
 ```bash
 python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \
 ```
 Training with the previously defined hyper-parameters yields the following results:
 ```bash
 f1 = 93.15
 exact_match = 86.91
 ```
 This fine-tuned model is available as a checkpoint under the reference
 `bert-large-uncased-whole-word-masking-finetuned-squad`.
 #### Fine-tuning XLNet on SQuAD
 This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
 ##### Command for SQuAD1.0:
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=4  \
    --per_gpu_train_batch_size=4   \
    --save_steps 5000
 ```
 ##### Command for SQuAD2.0:
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=2  \
    --per_gpu_train_batch_size=2   \
    --save_steps 5000
 ```
 Larger batch size may improve the performance while costing more memory.
 ##### Results for SQuAD1.0 with the previously defined hyper-parameters:
 ```python
 {
 "exact": 85.45884578997162,
 "f1": 92.5974600601065,
 "total": 10570,
 "HasAns_exact": 85.45884578997162,
 "HasAns_f1": 92.59746006010651,
 "HasAns_total": 10570
 }
 ```
 ##### Results for SQuAD2.0 with the previously defined hyper-parameters:
 ```python
 {
 "exact": 80.4177545691906,
 "f1": 84.07154997729623,
 "total": 11873,
 "HasAns_exact": 76.73751686909581,
 "HasAns_f1": 84.05558584352873,
 "HasAns_total": 5928,
 "NoAns_exact": 84.0874684608915,
 "NoAns_f1": 84.0874684608915,
 "NoAns_total": 5945
 }
 ```
 ## XNLI
 Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
 [XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
 #### Fine-tuning on XNLI
 This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
 on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
 `$XNLI_DIR` directory.
 * [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
 * [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
 ```bash
 export XNLI_DIR=/path/to/XNLI
 python run_xnli.py \
  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --data_dir $XNLI_DIR \
  --per_gpu_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
  --output_dir /tmp/debug_xnli/ \
  --save_steps -1
 ```
 Training with the previously defined hyper-parameters yields the following results on the **test** set:
 ```bash
 acc = 0.7093812375249501
 ```
 ## MM-IMDb
 Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
 [MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
 ### Training on MM-IMDb
 ```
 python run_mmimdb.py \
    --data_dir /path/to/mmimdb/dataset/ \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --output_dir /path/to/save/dir/ \
    --do_train \
    --do_eval \
    --max_seq_len 512 \
    --gradient_accumulation_steps 20 \
    --num_image_embeds 3 \
    --num_train_epochs 100 \
    --patience 5
 ```
 ## Adversarial evaluation of model performances
 Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
 The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
 This is an example of using test_hans.py:
 ```bash
 export HANS_DIR=path-to-hans
 export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
 export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
 python examples/hans/test_hans.py \
        --task_name hans \
        --model_type $MODEL_TYPE \
        --do_eval \
        --data_dir $HANS_DIR \
        --model_name_or_path $MODEL_PATH \
        --max_seq_length 128 \
        --output_dir $MODEL_PATH \
 ```
 This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
 The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
 ```bash
 Heuristic entailed results:
 lexical_overlap: 0.9702
 subsequence: 0.9942
 constituent: 0.9962
 Heuristic non-entailed results:
 lexical_overlap: 0.199
 subsequence: 0.0396
 constituent: 0.118
 ```
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@ -54,7 +54,7 @@ Additionally, the following method  can be used to load values from a data file
 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^
-An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py>`__ script.
+An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
 XNLI
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@ -44,7 +44,7 @@ Sequence Classification
 Sequence classification is the task of classifying sequences according to a given number of classes. An example
 of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
 a model on a GLUE sequence classification task, you may leverage the
-`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_glue.py>`_ or
+`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`_ or
 `run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
 Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
--- a/examples/README.md
+++ b/examples/README.md
@ -15,7 +15,7 @@ pip install -r ./examples/requirements.txt
 ```
 | Section                    | Description                                                                                                                                                |
-|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
+|----------------------------|-----------------------------------------------------
 | [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
 | [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
 | [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
@ -27,623 +27,3 @@ pip install -r ./examples/requirements.txt
 | [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
 | [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
 ## TensorFlow 2.0 Bert models on GLUE
 Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
 This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
 Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
 These options and the below benchmark are provided by @tlkh.
 Quick benchmarks from the script (no other modifications):
 | GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
 | --------- | -------- | ----------------------- | ----------------------|
 | Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
 | Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
 | V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
 | V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
 | 1080 Ti | FP32 | 55s | - |
 Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
 ## Running on TPUs
 You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
 [README](https://github.com/pytorch/xla/blob/master/README.md).
 The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
 identical to your normal GPU + Huggingface setup.
 ### GLUE
 Before running anyone of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 For running your GLUE task on MNLI dataset you can run something like the following:
 ```
 export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MNLI
 python run_glue_tpu.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME \
  --overwrite_output_dir \
  --logging_steps 50 \
  --save_steps 200 \
  --num_cores=8 \
  --only_log_master
 ```
 ## Language model training
 Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
 Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
 to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
 are fine-tuned using a masked language modeling (MLM) loss.
 Before running the following example, you should get a file that contains text on which the language model will be
 trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
 We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
 text that will be used for evaluation.
 ### GPT-2/GPT and causal language modeling
 The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
 the tokenization). The loss here is that of causal language modeling.
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
 python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE
 ```
 This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
 a score of ~20 perplexity once fine-tuned on the dataset.
 ### RoBERTa/BERT and masked language modeling
 The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
 as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
 pre-training: masked language modeling.
 In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
 slightly slower (over-fitting takes more epochs).
 We use the `--mlm` flag so that the script may change its loss function.
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
 python run_language_modeling.py \
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm
 ```
 ## Language generation
 Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
 A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
 can try out the different models available in the library.
 Example usage:
 ```bash
 python run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=gpt2
 ```
 ## GLUE
 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).
 Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
 Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
 GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
 uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
 batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
 between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
 | Task  | Metric                       | Result      |
 |-------|------------------------------|-------------|
 | CoLA  | Matthew's corr               | 49.23       |
 | SST-2 | Accuracy                     | 91.97       |
 | MRPC  | F1/Accuracy                  | 89.47/85.29 |
 | STS-B | Person/Spearman corr.        | 83.95/83.70 |
 | QQP   | Accuracy/F1                  | 88.40/84.31 |
 | MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
 | QNLI  | Accuracy                     | 87.46       |
 | RTE   | Accuracy                     | 61.73       |
 | WNLI  | Accuracy                     | 45.07       |
 Some of these results are significantly different from the ones reported on the test set
 of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
 Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 ```bash
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MRPC
 python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
 ```
 where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
 The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
 In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
 output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
 The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
 CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
 said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
 since the data processor for each task inherits from the base class DataProcessor.
 ### MRPC
 #### Fine-tuning example
 The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
 than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
 Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 ```bash
 export GLUE_DIR=/path/to/glue
 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/
 ```
 Our test ran on a few seeds with [the original implementation hyper-
 parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
 results between 84% and 88%.
 #### Using Apex and mixed-precision
 Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
 [apex](https://github.com/NVIDIA/apex), then run the following example:
 ```bash
 export GLUE_DIR=/path/to/glue
 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
  --fp16
 ```
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
 reaches F1 > 92 on MRPC.
 ```bash
 export GLUE_DIR=/path/to/glue
 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name MRPC \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/
 ```
 Training with these hyper-parameters gave us the following results:
 ```bash
 acc = 0.8823529411764706
 acc_and_f1 = 0.901702786377709
 eval_loss = 0.3418912578906332
 f1 = 0.9210526315789473
 global_step = 174
 loss = 0.07231863956341798
 ```
 ### MNLI
 The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
 ```bash
 export GLUE_DIR=/path/to/glue
 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name mnli \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \
 ```
 The results  are the following:
 ```bash
 ***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904
 ***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904
 ```
 ## Multiple Choice
 Based on the script [`run_multiple_choice.py`]().
 #### Fine-tuning on SWAG
 Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
 ```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
 python ./examples/run_multiple_choice.py \
 --task_name swag \
 --model_name_or_path roberta-base \
 --do_train \
 --do_eval \
 --data_dir $SWAG_DIR \
 --learning_rate 5e-5 \
 --num_train_epochs 3 \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
 --per_gpu_train_batch_size=16 \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
 Training with the defined hyper-parameters yields the following results:
 ```
 ***** Eval results *****
 eval_acc = 0.8338998300509847
 eval_loss = 0.44457291918821606
 ```
 ## SQuAD
 Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
 #### Fine-tuning BERT on SQuAD1.0
 This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
 on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
 $SQUAD_DIR directory.
 * [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
 * [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
 * [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
 And for SQuAD2.0, you need to download:
 - [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
 - [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
 - [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/
 ```
 Training with the previously defined hyper-parameters yields the following results:
 ```bash
 f1 = 88.52
 exact_match = 81.22
 ```
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
 ```bash
 python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \
 ```
 Training with the previously defined hyper-parameters yields the following results:
 ```bash
 f1 = 93.15
 exact_match = 86.91
 ```
 This fine-tuned model is available as a checkpoint under the reference
 `bert-large-uncased-whole-word-masking-finetuned-squad`.
 #### Fine-tuning XLNet on SQuAD
 This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
 ##### Command for SQuAD1.0:
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=4  \
    --per_gpu_train_batch_size=4   \
    --save_steps 5000
 ```
 ##### Command for SQuAD2.0:
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=2  \
    --per_gpu_train_batch_size=2   \
    --save_steps 5000
 ```
 Larger batch size may improve the performance while costing more memory.
 ##### Results for SQuAD1.0 with the previously defined hyper-parameters:
 ```python
 {
 "exact": 85.45884578997162,
 "f1": 92.5974600601065,
 "total": 10570,
 "HasAns_exact": 85.45884578997162,
 "HasAns_f1": 92.59746006010651,
 "HasAns_total": 10570
 }
 ```
 ##### Results for SQuAD2.0 with the previously defined hyper-parameters:
 ```python
 {
 "exact": 80.4177545691906,
 "f1": 84.07154997729623,
 "total": 11873,
 "HasAns_exact": 76.73751686909581,
 "HasAns_f1": 84.05558584352873,
 "HasAns_total": 5928,
 "NoAns_exact": 84.0874684608915,
 "NoAns_f1": 84.0874684608915,
 "NoAns_total": 5945
 }
 ```
 ## XNLI
 Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
 [XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
 #### Fine-tuning on XNLI
 This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
 on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
 `$XNLI_DIR` directory.
 * [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
 * [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
 ```bash
 export XNLI_DIR=/path/to/XNLI
 python run_xnli.py \
  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --data_dir $XNLI_DIR \
  --per_gpu_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
  --output_dir /tmp/debug_xnli/ \
  --save_steps -1
 ```
 Training with the previously defined hyper-parameters yields the following results on the **test** set:
 ```bash
 acc = 0.7093812375249501
 ```
 ## MM-IMDb
 Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/mm-imdb/run_mmimdb.py).
 [MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
 ### Training on MM-IMDb
 ```
 python run_mmimdb.py \
    --data_dir /path/to/mmimdb/dataset/ \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --output_dir /path/to/save/dir/ \
    --do_train \
    --do_eval \
    --max_seq_len 512 \
    --gradient_accumulation_steps 20 \
    --num_image_embeds 3 \
    --num_train_epochs 100 \
    --patience 5
 ```
 ## Adversarial evaluation of model performances
 Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
 The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
 This is an example of using test_hans.py:
 ```bash
 export HANS_DIR=path-to-hans
 export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
 export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
 python examples/hans/test_hans.py \
        --task_name hans \
        --model_type $MODEL_TYPE \
        --do_eval \
        --data_dir $HANS_DIR \
        --model_name_or_path $MODEL_PATH \
        --max_seq_length 128 \
        --output_dir $MODEL_PATH \
 ```
 This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
 The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
 ```bash
 Heuristic entailed results:
 lexical_overlap: 0.9702
 subsequence: 0.9942
 constituent: 0.9962
 Heuristic non-entailed results:
 lexical_overlap: 0.199
 subsequence: 0.0396
 constituent: 0.118
 ```
--- a/examples/adversarial/README.md
+++ b/examples/adversarial/README.md
@ -0,0 +1,38 @@
 ## Adversarial evaluation of model performances
 Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
 The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
 This is an example of using test_hans.py:
 ```bash
 export HANS_DIR=path-to-hans
 export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
 export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
 python examples/hans/test_hans.py \
        --task_name hans \
        --model_type $MODEL_TYPE \
        --do_eval \
        --data_dir $HANS_DIR \
        --model_name_or_path $MODEL_PATH \
        --max_seq_length 128 \
        --output_dir $MODEL_PATH \
 ```
 This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
 The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
 ```bash
 Heuristic entailed results:
 lexical_overlap: 0.9702
 subsequence: 0.9942
 constituent: 0.9962
 Heuristic non-entailed results:
 lexical_overlap: 0.199
 subsequence: 0.0396
 constituent: 0.118
 ```
--- a/examples/adversarial/hans_processors.py
+++ b/examples/adversarial/hans_processors.py
--- a/examples/adversarial/test_hans.py
+++ b/examples/adversarial/test_hans.py
--- a/examples/adversarial/utils_hans.py
+++ b/examples/adversarial/utils_hans.py
--- a/examples/bertology/run_bertology.py
+++ b/examples/bertology/run_bertology.py
--- a/examples/contrib/mm-imdb/README.md
+++ b/examples/contrib/mm-imdb/README.md
@ -0,0 +1,23 @@
 ## MM-IMDb
 Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
 [MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
 ### Training on MM-IMDb
 ```
 python run_mmimdb.py \
    --data_dir /path/to/mmimdb/dataset/ \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --output_dir /path/to/save/dir/ \
    --do_train \
    --do_eval \
    --max_seq_len 512 \
    --gradient_accumulation_steps 20 \
    --num_image_embeds 3 \
    --num_train_epochs 100 \
    --patience 5
 ```
--- a/examples/contrib/mm-imdb/run_mmimdb.py
+++ b/examples/contrib/mm-imdb/run_mmimdb.py
--- a/examples/contrib/mm-imdb/utils_mmimdb.py
+++ b/examples/contrib/mm-imdb/utils_mmimdb.py
--- a/examples/glue/README.md
+++ b/examples/glue/README.md
@ -1,9 +0,0 @@
 # GLUE Benchmark
 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).
 #### Run PyTorch version using PyTorch-Lightning
 Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
 Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: `TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}`
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@ -0,0 +1,63 @@
 ## Language model training
 Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
 Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
 to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
 are fine-tuned using a masked language modeling (MLM) loss.
 Before running the following example, you should get a file that contains text on which the language model will be
 trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
 We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
 text that will be used for evaluation.
 ### GPT-2/GPT and causal language modeling
 The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
 the tokenization). The loss here is that of causal language modeling.
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
 python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE
 ```
 This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
 a score of ~20 perplexity once fine-tuned on the dataset.
 ### RoBERTa/BERT and masked language modeling
 The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
 as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
 pre-training: masked language modeling.
 In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
 slightly slower (over-fitting takes more epochs).
 We use the `--mlm` flag so that the script may change its loss function.
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
 python run_language_modeling.py \
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm
 ```
--- a/examples/language-modeling/run_language_modeling.py
+++ b/examples/language-modeling/run_language_modeling.py
--- a/examples/transformer_base.py
+++ b/examples/transformer_base.py
--- a/examples/multiple-choice/README.md
+++ b/examples/multiple-choice/README.md
@ -0,0 +1,31 @@
 ## Multiple Choice
 Based on the script [`run_multiple_choice.py`]().
 #### Fine-tuning on SWAG
 Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
 ```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
 python ./examples/run_multiple_choice.py \
 --task_name swag \
 --model_name_or_path roberta-base \
 --do_train \
 --do_eval \
 --data_dir $SWAG_DIR \
 --learning_rate 5e-5 \
 --num_train_epochs 3 \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
 --per_gpu_train_batch_size=16 \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
 Training with the defined hyper-parameters yields the following results:
 ```
 ***** Eval results *****
 eval_acc = 0.8338998300509847
 eval_loss = 0.44457291918821606
 ```
--- a/examples/multiple-choice/run_multiple_choice.py
+++ b/examples/multiple-choice/run_multiple_choice.py
--- a/examples/multiple-choice/utils_multiple_choice.py
+++ b/examples/multiple-choice/utils_multiple_choice.py
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@ -0,0 +1,159 @@
 ## SQuAD
 Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
 #### Fine-tuning BERT on SQuAD1.0
 This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
 on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
 $SQUAD_DIR directory.
 * [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
 * [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
 * [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
 And for SQuAD2.0, you need to download:
 - [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
 - [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
 - [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/
 ```
 Training with the previously defined hyper-parameters yields the following results:
 ```bash
 f1 = 88.52
 exact_match = 81.22
 ```
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
 ```bash
 python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \
 ```
 Training with the previously defined hyper-parameters yields the following results:
 ```bash
 f1 = 93.15
 exact_match = 86.91
 ```
 This fine-tuned model is available as a checkpoint under the reference
 `bert-large-uncased-whole-word-masking-finetuned-squad`.
 #### Fine-tuning XLNet on SQuAD
 This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
 ##### Command for SQuAD1.0:
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=4  \
    --per_gpu_train_batch_size=4   \
    --save_steps 5000
 ```
 ##### Command for SQuAD2.0:
 ```bash
 export SQUAD_DIR=/path/to/SQUAD
 python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=2  \
    --per_gpu_train_batch_size=2   \
    --save_steps 5000
 ```
 Larger batch size may improve the performance while costing more memory.
 ##### Results for SQuAD1.0 with the previously defined hyper-parameters:
 ```python
 {
 "exact": 85.45884578997162,
 "f1": 92.5974600601065,
 "total": 10570,
 "HasAns_exact": 85.45884578997162,
 "HasAns_f1": 92.59746006010651,
 "HasAns_total": 10570
 }
 ```
 ##### Results for SQuAD2.0 with the previously defined hyper-parameters:
 ```python
 {
 "exact": 80.4177545691906,
 "f1": 84.07154997729623,
 "total": 11873,
 "HasAns_exact": 76.73751686909581,
 "HasAns_f1": 84.05558584352873,
 "HasAns_total": 5928,
 "NoAns_exact": 84.0874684608915,
 "NoAns_f1": 84.0874684608915,
 "NoAns_total": 5945
 }
 ```
--- a/examples/question-answering/run_squad.py
+++ b/examples/question-answering/run_squad.py
--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@ -1,4 +1,3 @@
 tensorboardX
 tensorboard
 scikit-learn
 seqeval
--- a/examples/run_tpu_glue.py
+++ b/examples/run_tpu_glue.py
@ -1,611 +0,0 @@
 # coding=utf-8
 # Copyright 2019 The Google AI Language Team Authors and The HuggingFace Inc. team.
 # Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Finetuning the library models for sequence classification on GLUE (Bert, DistilBert, XLNet, RoBERTa)."""
 from __future__ import absolute_import, division, print_function
 import argparse
 import glob
 import logging
 import os
 import random
 import numpy as np
 import torch
 import torch_xla.core.xla_model as xm
 import torch_xla.debug.metrics as met
 import torch_xla.distributed.parallel_loader as pl
 import torch_xla.distributed.xla_multiprocessing as xmp
 from torch.utils.data import DataLoader, RandomSampler, TensorDataset
 from torch.utils.data.distributed import DistributedSampler
 from tqdm import tqdm, trange
 from transformers import (
    WEIGHTS_NAME,
    AdamW,
    BertConfig,
    BertForSequenceClassification,
    BertTokenizer,
    DistilBertConfig,
    DistilBertForSequenceClassification,
    DistilBertTokenizer,
    RobertaConfig,
    RobertaForSequenceClassification,
    RobertaTokenizer,
    XLMConfig,
    XLMForSequenceClassification,
    XLMTokenizer,
    XLNetConfig,
    XLNetForSequenceClassification,
    XLNetTokenizer,
    get_linear_schedule_with_warmup,
 )
 from transformers import glue_compute_metrics as compute_metrics
 from transformers import glue_convert_examples_to_features as convert_examples_to_features
 from transformers import glue_output_modes as output_modes
 from transformers import glue_processors as processors
 try:
    # Only tensorboardX supports writing directly to gs://
    from tensorboardX import SummaryWriter
 except ImportError:
    from torch.utils.tensorboard import SummaryWriter
 logger = logging.getLogger(__name__)
 ALL_MODELS = sum(
    (
        tuple(conf.pretrained_config_archive_map.keys())
        for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
    ),
    (),
 )
 MODEL_CLASSES = {
    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
    "xlnet": (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
    "xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
    "roberta": (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
 }
 def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
 def get_sampler(dataset):
    if xm.xrt_world_size() <= 1:
        return RandomSampler(dataset)
    return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())
 def train(args, train_dataset, model, tokenizer, disable_logging=False):
    """ Train the model """
    if xm.is_master_ordinal():
        # Only master writes to Tensorboard
        tb_writer = SummaryWriter(args.tensorboard_logdir)
    train_sampler = get_sampler(train_dataset)
    dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total,
    )
    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(dataloader) * args.train_batch_size)
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per TPU core = %d", args.train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        (args.train_batch_size * args.gradient_accumulation_steps * xm.xrt_world_size()),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)
    global_step = 0
    loss = None
    model.zero_grad()
    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=disable_logging)
    set_seed(args.seed)  # Added here for reproductibility (even between python 2 and 3)
    for epoch in train_iterator:
        # tpu-comment: Get TPU parallel loader which sends data to TPU in background.
        train_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device)
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", total=len(dataloader), disable=disable_logging)
        for step, batch in enumerate(epoch_iterator):
            # Save model checkpoint.
            if args.save_steps > 0 and global_step % args.save_steps == 0:
                output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
                logger.info("Saving model checkpoint to %s", output_dir)
                if xm.is_master_ordinal():
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                # Barrier to wait for saving checkpoint.
                xm.rendezvous("mid_training_checkpoint")
                # model.save_pretrained needs to be called by all ordinals
                model.save_pretrained(output_dir)
            model.train()
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
            if args.model_type != "distilbert":
                # XLM, DistilBERT and RoBERTa don't use segment_ids
                inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None
            outputs = model(**inputs)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps
            loss.backward()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                xm.optimizer_step(optimizer)
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1
                if args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics.
                    results = {}
                    if args.evaluate_during_training:
                        results = evaluate(args, model, tokenizer, disable_logging=disable_logging)
                    loss_scalar = loss.item()
                    logger.info(
                        "global_step: {global_step}, lr: {lr:.6f}, loss: {loss:.3f}".format(
                            global_step=global_step, lr=scheduler.get_lr()[0], loss=loss_scalar
                        )
                    )
                    if xm.is_master_ordinal():
                        # tpu-comment: All values must be in CPU and not on TPU device
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                        tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                        tb_writer.add_scalar("loss", loss_scalar, global_step)
            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.metrics_debug:
            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
            xm.master_print(met.metrics_report())
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break
    if xm.is_master_ordinal():
        tb_writer.close()
    return global_step, loss.item()
 def evaluate(args, model, tokenizer, prefix="", disable_logging=False):
    """Evaluate the model"""
    if xm.is_master_ordinal():
        # Only master writes to Tensorboard
        tb_writer = SummaryWriter(args.tensorboard_logdir)
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)
    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
        eval_sampler = get_sampler(eval_dataset)
        if not os.path.exists(eval_output_dir):
            os.makedirs(eval_output_dir)
        dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, shuffle=False)
        eval_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device)
        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(dataloader) * args.eval_batch_size)
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating", disable=disable_logging):
            model.eval()
            with torch.no_grad():
                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
                if args.model_type != "distilbert":
                    # XLM, DistilBERT and RoBERTa don't use segment_ids
                    inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None
                outputs = model(**inputs)
                batch_eval_loss, logits = outputs[:2]
                eval_loss += batch_eval_loss
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs["labels"].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
        # tpu-comment: Get all predictions and labels from all worker shards of eval dataset
        preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
        out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)
        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)
        results["eval_loss"] = eval_loss.item()
        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        if xm.is_master_ordinal():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results {} *****".format(prefix))
                for key in sorted(results.keys()):
                    logger.info("  %s = %s", key, str(results[key]))
                    writer.write("%s = %s\n" % (key, str(results[key])))
                    tb_writer.add_scalar(f"{eval_task}/{key}", results[key])
    if args.metrics_debug:
        # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
        xm.master_print(met.metrics_report())
    if xm.is_master_ordinal():
        tb_writer.close()
    return results
 def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if not xm.is_master_ordinal():
        xm.rendezvous("load_and_cache_examples")
    processor = processors[task]()
    output_mode = output_modes[task]
    cached_features_file = os.path.join(
        args.cache_dir,
        "cached_{}_{}_{}_{}".format(
            "dev" if evaluate else "train",
            list(filter(None, args.model_name_or_path.split("/"))).pop(),
            str(args.max_seq_length),
            str(task),
        ),
    )
    # Load data features from cache or dataset file
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta"]:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = (
            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        )
        features = convert_examples_to_features(
            examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
        )
        logger.info("Saving features into cached file %s", cached_features_file)
        torch.save(features, cached_features_file)
    if xm.is_master_ordinal():
        xm.rendezvous("load_and_cache_examples")
    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset
 def main(args):
    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
    ):
        raise ValueError(
            (
                "Output directory ({}) already exists and is not empty." " Use --overwrite_output_dir to overcome."
            ).format(args.output_dir)
        )
    # tpu-comment: Get TPU/XLA Device
    args.device = xm.xla_device()
    # Setup logging
    logging.basicConfig(
        format="[xla:{}] %(asctime)s - %(levelname)s - %(name)s -   %(message)s".format(xm.get_ordinal()),
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )
    disable_logging = False
    if not xm.is_master_ordinal() and args.only_log_master:
        # Disable all non-master loggers below CRITICAL.
        logging.disable(logging.CRITICAL)
        disable_logging = True
    logger.warning("Process rank: %s, device: %s, num_cores: %s", xm.get_ordinal(), args.device, args.num_cores)
    # Set seed to have same initialization
    set_seed(args.seed)
    # Prepare GLUE task
    args.task_name = args.task_name.lower()
    if args.task_name not in processors:
        raise ValueError("Task not found: %s" % (args.task_name))
    processor = processors[args.task_name]()
    args.output_mode = output_modes[args.task_name]
    label_list = processor.get_labels()
    num_labels = len(label_list)
    if not xm.is_master_ordinal():
        xm.rendezvous(
            "download_only_once"
        )  # Make sure only the first process in distributed training will download model & vocab
    # Load pretrained model and tokenizer
    args.model_type = args.model_type.lower()
    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
    config = config_class.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=args.task_name,
        cache_dir=args.cache_dir if args.cache_dir else None,
        xla_device=True,
    )
    tokenizer = tokenizer_class.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    model = model_class.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    if xm.is_master_ordinal():
        xm.rendezvous("download_only_once")
    # Send model to TPU/XLA device.
    model.to(args.device)
    logger.info("Training/evaluation parameters %s", args)
    if args.do_train:
        # Train the model.
        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, disable_logging=disable_logging)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
        if xm.is_master_ordinal():
            # Save trained model.
            # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
            # Create output directory if needed
            if not os.path.exists(args.output_dir):
                os.makedirs(args.output_dir)
            logger.info("Saving model checkpoint to %s", args.output_dir)
            # Save a trained model, configuration and tokenizer using `save_pretrained()`.
            # They can then be reloaded using `from_pretrained()`
            tokenizer.save_pretrained(args.output_dir)
            # Good practice: save your training arguments together with the trained.
            torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
        xm.rendezvous("post_training_checkpoint")
        # model.save_pretrained needs to be called by all ordinals
        model.save_pretrained(args.output_dir)
        # Load a trained model and vocabulary that you have fine-tuned
        model = model_class.from_pretrained(args.output_dir)
        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
        model.to(args.device)
    # Evaluation
    results = {}
    if args.do_eval:
        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, prefix=prefix, disable_logging=disable_logging)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)
    return results
 def get_args():
    parser = argparse.ArgumentParser()
    # Required parameters
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
    )
    parser.add_argument(
        "--model_type",
        default=None,
        type=str,
        required=True,
        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
    )
    parser.add_argument(
        "--task_name",
        default=None,
        type=str,
        required=True,
        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
    )
    parser.add_argument(
        "--output_dir",
        default=None,
        type=str,
        required=True,
        help="The output directory where the model predictions and checkpoints will be written.",
    )
    # TPU Parameters
    parser.add_argument("--num_cores", default=8, type=int, help="Number of TPU cores to use (1 or 8).")
    parser.add_argument("--metrics_debug", action="store_true", help="Whether to print debug metrics.")
    # Other parameters
    parser.add_argument(
        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
    )
    parser.add_argument(
        "--tokenizer_name",
        default="",
        type=str,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--cache_dir",
        default="",
        type=str,
        help="Where do you want to store the pre-trained models downloaded and features file generated",
    )
    parser.add_argument(
        "--max_seq_length",
        default=128,
        type=int,
        help="The maximum total input sequence length after tokenization. Sequences longer "
        "than this will be truncated, sequences shorter will be padded.",
    )
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
    parser.add_argument(
        "--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
    )
    parser.add_argument(
        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
    )
    parser.add_argument("--train_batch_size", default=8, type=int, help="Per core batch size for training.")
    parser.add_argument("--eval_batch_size", default=8, type=int, help="Per core batch size for evaluation.")
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
    parser.add_argument(
        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
    )
    parser.add_argument(
        "--max_steps",
        default=-1,
        type=int,
        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
    )
    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
    parser.add_argument("--tensorboard_logdir", default="./runs", type=str, help="Where to write tensorboard metrics.")
    parser.add_argument("--logging_steps", type=int, default=50, help="Log every X update steps.")
    parser.add_argument("--only_log_master", action="store_true", help="Whether to log only from each hosts master.")
    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X update steps.")
    parser.add_argument(
        "--eval_all_checkpoints",
        action="store_true",
        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
    )
    parser.add_argument(
        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
    )
    parser.add_argument(
        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
    )
    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
    return parser.parse_args()
 def _mp_fn(rank, args):
    main(args)
 def main_cli():
    args = get_args()
    xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
 if __name__ == "__main__":
    main_cli()
--- a/examples/summarization/bart/finetune.py
+++ b/examples/summarization/bart/finetune.py
@ -7,7 +7,7 @@ import time
 import torch
 from torch.utils.data import DataLoader
-from transformer_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
+from lightning_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
 try:
--- a/examples/summarization/bart/run_train.sh
+++ b/examples/summarization/bart/run_train.sh
@ -5,7 +5,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR
-# Add parent directory to python path to access transformer_base.py
+# Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../../":"${PYTHONPATH}"
 python finetune.py \
--- a/examples/summarization/bart/run_train_tiny.sh
+++ b/examples/summarization/bart/run_train_tiny.sh
@ -12,7 +12,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR
-# Add parent directory to python path to access transformer_base.py and utils.py
+# Add parent directory to python path to access lightning_base.py and utils.py
 export PYTHONPATH="../../":"${PYTHONPATH}"
 python finetune.py \
 --data_dir=cnn_tiny/ \
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@ -16,10 +16,20 @@
 import argparse
 import logging
 import os
 import sys
 import unittest
 from unittest.mock import patch
 SRC_DIRS = [
    os.path.join(os.path.dirname(__file__), dirname)
    for dirname in ["text-generation", "text-classification", "language-modeling", "question-answering"]
 ]
 sys.path.extend(SRC_DIRS)
 if SRC_DIRS is not None:
    import run_generation
    import run_glue
    import run_language_modeling
@ -43,24 +53,24 @@ class ExamplesTests(unittest.TestCase):
        stream_handler = logging.StreamHandler(sys.stdout)
        logger.addHandler(stream_handler)
-        testargs = [
+        testargs = """
-            "run_glue.py",
+            run_glue.py
-            "--data_dir=./examples/tests_samples/MRPC/",
+            --model_name_or_path bert-base-uncased
-            "--task_name=mrpc",
+            --data_dir ./tests/fixtures/tests_samples/MRPC/
-            "--do_train",
+            --task_name mrpc
-            "--do_eval",
+            --do_train
-            "--output_dir=./examples/tests_samples/temp_dir",
+            --do_eval
-            "--per_gpu_train_batch_size=2",
+            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            "--per_gpu_eval_batch_size=1",
+            --per_gpu_train_batch_size=2
-            "--learning_rate=1e-4",
+            --per_gpu_eval_batch_size=1
-            "--max_steps=10",
+            --learning_rate=1e-4
-            "--warmup_steps=2",
+            --max_steps=10
-            "--overwrite_output_dir",
+            --warmup_steps=2
-            "--seed=42",
+            --overwrite_output_dir
-            "--max_seq_length=128",
+            --seed=42
-        ]
+            --max_seq_length=128
-        model_name = "--model_name_or_path=bert-base-uncased"
+            """.split()
-        with patch.object(sys, "argv", testargs + [model_name]):
+        with patch.object(sys, "argv", testargs):
            result = run_glue.main()
            del result["loss"]
            for value in result.values():
@ -78,7 +88,7 @@ class ExamplesTests(unittest.TestCase):
            --line_by_line
            --train_data_file ./tests/fixtures/sample_text.txt
            --eval_data_file ./tests/fixtures/sample_text.txt
-            --output_dir ./tests/fixtures
+            --output_dir ./tests/fixtures/tests_samples/temp_dir
            --overwrite_output_dir
            --do_train
            --do_eval
@ -93,24 +103,25 @@ class ExamplesTests(unittest.TestCase):
        stream_handler = logging.StreamHandler(sys.stdout)
        logger.addHandler(stream_handler)
-        testargs = [
+        testargs = """
-            "run_squad.py",
+            run_squad.py
-            "--data_dir=./examples/tests_samples/SQUAD",
+            --model_type=bert
-            "--model_name=bert-base-uncased",
+            --model_name_or_path=bert-base-uncased
-            "--output_dir=./examples/tests_samples/temp_dir",
+            --data_dir=./tests/fixtures/tests_samples/SQUAD
-            "--max_steps=10",
+            --model_name=bert-base-uncased
-            "--warmup_steps=2",
+            --output_dir=./tests/fixtures/tests_samples/temp_dir
-            "--do_train",
+            --max_steps=10
-            "--do_eval",
+            --warmup_steps=2
-            "--version_2_with_negative",
+            --do_train
-            "--learning_rate=2e-4",
+            --do_eval
-            "--per_gpu_train_batch_size=2",
+            --version_2_with_negative
-            "--per_gpu_eval_batch_size=1",
+            --learning_rate=2e-4
-            "--overwrite_output_dir",
+            --per_gpu_train_batch_size=2
-            "--seed=42",
+            --per_gpu_eval_batch_size=1
-        ]
+            --overwrite_output_dir
-        model_type, model_name = ("--model_type=bert", "--model_name_or_path=bert-base-uncased")
+            --seed=42
-        with patch.object(sys, "argv", testargs + [model_type, model_name]):
+        """.split()
        with patch.object(sys, "argv", testargs):
            result = run_squad.main()
            self.assertGreaterEqual(result["f1"], 30)
            self.assertGreaterEqual(result["exact"], 30)
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@ -0,0 +1,299 @@
 ## GLUE Benchmark
 # Run TensorFlow 2.0 version
 Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
 This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
 Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
 These options and the below benchmark are provided by @tlkh.
 Quick benchmarks from the script (no other modifications):
 | GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
 | --------- | -------- | ----------------------- | ----------------------|
 | Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
 | Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
 | V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
 | V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
 | 1080 Ti | FP32 | 55s | - |
 Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
 # Run PyTorch version
 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
 Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
 Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
 GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
 uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
 batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
 between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
 | Task  | Metric                       | Result      |
 |-------|------------------------------|-------------|
 | CoLA  | Matthew's corr               | 49.23       |
 | SST-2 | Accuracy                     | 91.97       |
 | MRPC  | F1/Accuracy                  | 89.47/85.29 |
 | STS-B | Person/Spearman corr.        | 83.95/83.70 |
 | QQP   | Accuracy/F1                  | 88.40/84.31 |
 | MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
 | QNLI  | Accuracy                     | 87.46       |
 | RTE   | Accuracy                     | 61.73       |
 | WNLI  | Accuracy                     | 45.07       |
 Some of these results are significantly different from the ones reported on the test set
 of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
 Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 ```bash
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MRPC
 python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
 ```
 where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
 The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
 In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
 output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
 The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
 CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
 said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
 since the data processor for each task inherits from the base class DataProcessor.
 ## Running on TPUs
 You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
 [README](https://github.com/pytorch/xla/blob/master/README.md).
 The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
 identical to your normal GPU + Huggingface setup.
 For running your GLUE task on MNLI dataset you can run something like the following:
 ```
 export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
 export GLUE_DIR=/path/to/glue
 export TASK_NAME=MNLI
 python run_glue_tpu.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME \
  --overwrite_output_dir \
  --logging_steps 50 \
  --save_steps 200 \
  --num_cores=8 \
  --only_log_master
 ```
 ### MRPC
 #### Fine-tuning example
 The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
 than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
 Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
 ```bash
 export GLUE_DIR=/path/to/glue
 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/
 ```
 Our test ran on a few seeds with [the original implementation hyper-
 parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
 results between 84% and 88%.
 #### Using Apex and mixed-precision
 Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
 [apex](https://github.com/NVIDIA/apex), then run the following example:
 ```bash
 export GLUE_DIR=/path/to/glue
 python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
  --fp16
 ```
 #### Distributed training
 Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
 reaches F1 > 92 on MRPC.
 ```bash
 export GLUE_DIR=/path/to/glue
 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name MRPC \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/
 ```
 Training with these hyper-parameters gave us the following results:
 ```bash
 acc = 0.8823529411764706
 acc_and_f1 = 0.901702786377709
 eval_loss = 0.3418912578906332
 f1 = 0.9210526315789473
 global_step = 174
 loss = 0.07231863956341798
 ```
 ### MNLI
 The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
 ```bash
 export GLUE_DIR=/path/to/glue
 python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name mnli \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \
 ```
 The results  are the following:
 ```bash
 ***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904
 ***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904
 ```
 # Run PyTorch version using PyTorch-Lightning
 Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
 Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: 
 ```
 TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}
 ```
 # XNLI
 Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
 [XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
 #### Fine-tuning on XNLI
 This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
 on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
 `$XNLI_DIR` directory.
 * [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
 * [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
 ```bash
 export XNLI_DIR=/path/to/XNLI
 python run_xnli.py \
  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --data_dir $XNLI_DIR \
  --per_gpu_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
  --output_dir /tmp/debug_xnli/ \
  --save_steps -1
 ```
 Training with the previously defined hyper-parameters yields the following results on the **test** set:
 ```bash
 acc = 0.7093812375249501
 ```
--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
--- a/examples/text-classification/run_pl.sh
+++ b/examples/text-classification/run_pl.sh
@ -20,7 +20,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR
-# Add parent directory to python path to access transformer_base.py
+# Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../":"${PYTHONPATH}"
 python3 run_pl_glue.py --data_dir $DATA_DIR \
--- a/examples/text-classification/run_pl_glue.py
+++ b/examples/text-classification/run_pl_glue.py
@ -8,7 +8,7 @@ import numpy as np
 import torch
 from torch.utils.data import DataLoader, TensorDataset
-from transformer_base import BaseTransformer, add_generic_args, generic_train
+from lightning_base import BaseTransformer, add_generic_args, generic_train
 from transformers import glue_compute_metrics as compute_metrics
 from transformers import glue_convert_examples_to_features as convert_examples_to_features
 from transformers import glue_output_modes
--- a/examples/text-classification/run_tf_glue.py
+++ b/examples/text-classification/run_tf_glue.py
--- a/examples/text-classification/run_xnli.py
+++ b/examples/text-classification/run_xnli.py
@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
-    Adapted from `examples/run_glue.py`"""
+    Adapted from `examples/text-classification/run_glue.py`"""
 import argparse
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@ -0,0 +1,15 @@
 ## Language generation
 Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
 A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
 can try out the different models available in the library.
 Example usage:
 ```bash
 python run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=gpt2
 ```
--- a/examples/text-generation/pplm/README.md
+++ b/examples/text-generation/pplm/README.md
--- a/examples/text-generation/pplm/imgs/headfigure.png
+++ b/examples/text-generation/pplm/imgs/headfigure.png
--- a/examples/text-generation/pplm/imgs/wooly.png
+++ b/examples/text-generation/pplm/imgs/wooly.png
--- a/examples/text-generation/pplm/pplm_classification_head.py
+++ b/examples/text-generation/pplm/pplm_classification_head.py
--- a/examples/text-generation/pplm/run_pplm.py
+++ b/examples/text-generation/pplm/run_pplm.py
--- a/examples/text-generation/pplm/run_pplm_discrim_train.py
+++ b/examples/text-generation/pplm/run_pplm_discrim_train.py
--- a/examples/text-generation/run_generation.py
+++ b/examples/text-generation/run_generation.py
--- a/examples/token-classification/README.md
+++ b/examples/token-classification/README.md
--- a/examples/token-classification/run.sh
+++ b/examples/token-classification/run.sh
--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
--- a/examples/token-classification/run_pl.sh
+++ b/examples/token-classification/run_pl.sh
@ -27,7 +27,7 @@ export CURRENT_DIR=${PWD}
 export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 mkdir -p $OUTPUT_DIR
-# Add parent directory to python path to access transformer_base.py
+# Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../":"${PYTHONPATH}"
 python3 run_pl_ner.py --data_dir ./ \
--- a/examples/token-classification/run_pl_ner.py
+++ b/examples/token-classification/run_pl_ner.py
@ -9,7 +9,7 @@ from seqeval.metrics import f1_score, precision_score, recall_score
 from torch.nn import CrossEntropyLoss
 from torch.utils.data import DataLoader, TensorDataset
-from transformer_base import BaseTransformer, add_generic_args, generic_train
+from lightning_base import BaseTransformer, add_generic_args, generic_train
 from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
--- a/examples/token-classification/run_tf_ner.py
+++ b/examples/token-classification/run_tf_ner.py
--- a/examples/token-classification/test_ner_examples.py
+++ b/examples/token-classification/test_ner_examples.py
@ -18,10 +18,10 @@ class ExamplesTests(unittest.TestCase):
        testargs = """
            --model_name distilbert-base-german-cased
-            --output_dir ./examples/tests_samples/temp_dir
+            --output_dir ./tests/fixtures/tests_samples/temp_dir
            --overwrite_output_dir
-            --data_dir ./examples/tests_samples/GermEval
+            --data_dir ./tests/fixtures/tests_samples/GermEval
-            --labels ./examples/tests_samples/GermEval/labels.txt
+            --labels ./tests/fixtures/tests_samples/GermEval/labels.txt
            --max_seq_length 128
            --num_train_epochs 6
            --logging_steps 1
--- a/examples/token-classification/utils_ner.py
+++ b/examples/token-classification/utils_ner.py
--- a/model_cards/SparkBeyond/roberta-large-sts-b/README.md
+++ b/model_cards/SparkBeyond/roberta-large-sts-b/README.md
@ -4,7 +4,7 @@
 This model is a fine tuned RoBERTA model over STS-B.
 It was trained with these params:
-!python /content/transformers/examples/run_glue.py \
+!python /content/transformers/examples/text-classification/run_glue.py \
    --model_type roberta \
    --model_name_or_path roberta-large \
    --task_name STS-B \
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@ -333,7 +333,7 @@ class Trainer:
            total_train_batch_size = (
                self.args.train_batch_size
                * self.args.gradient_accumulation_steps
-                * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1),
+                * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)
            )
        logger.info("***** Running training *****")
        logger.info("  Num examples = %d", num_examples)
--- a/tests/fixtures/tests_samples/.gitignore
+++ b/tests/fixtures/tests_samples/.gitignore
--- a/tests/fixtures/tests_samples/GermEval/dev.txt
+++ b/tests/fixtures/tests_samples/GermEval/dev.txt
--- a/tests/fixtures/tests_samples/GermEval/labels.txt
+++ b/tests/fixtures/tests_samples/GermEval/labels.txt
--- a/tests/fixtures/tests_samples/GermEval/train.txt
+++ b/tests/fixtures/tests_samples/GermEval/train.txt
--- a/tests/fixtures/tests_samples/MRPC/dev.tsv
+++ b/tests/fixtures/tests_samples/MRPC/dev.tsv
--- a/tests/fixtures/tests_samples/MRPC/train.tsv
+++ b/tests/fixtures/tests_samples/MRPC/train.tsv
--- a/tests/fixtures/tests_samples/SQUAD/dev-v2.0.json
+++ b/tests/fixtures/tests_samples/SQUAD/dev-v2.0.json
--- a/tests/fixtures/tests_samples/SQUAD/train-v2.0.json
+++ b/tests/fixtures/tests_samples/SQUAD/train-v2.0.json
--- a/tests/fixtures/tests_samples/STS-B/dev.tsv
+++ b/tests/fixtures/tests_samples/STS-B/dev.tsv
--- a/tests/fixtures/tests_samples/STS-B/train.tsv
+++ b/tests/fixtures/tests_samples/STS-B/train.tsv
--- a/tests/test_trainer.py
+++ b/tests/test_trainer.py
@ -28,7 +28,7 @@ class DataCollatorIntegrationTest(unittest.TestCase):
        MODEL_ID = "bert-base-cased-finetuned-mrpc"
        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
        data_args = GlueDataTrainingArguments(
-            task_name="mrpc", data_dir="./examples/tests_samples/MRPC", overwrite_cache=True
+            task_name="mrpc", data_dir="./tests/fixtures/tests_samples/MRPC", overwrite_cache=True
        )
        dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
        data_collator = DefaultDataCollator()
@ -39,7 +39,7 @@ class DataCollatorIntegrationTest(unittest.TestCase):
        MODEL_ID = "distilroberta-base"
        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
        data_args = GlueDataTrainingArguments(
-            task_name="sts-b", data_dir="./examples/tests_samples/STS-B", overwrite_cache=True
+            task_name="sts-b", data_dir="./tests/fixtures/tests_samples/STS-B", overwrite_cache=True
        )
        dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
        data_collator = DefaultDataCollator()
@ -91,7 +91,7 @@ class TrainerIntegrationTest(unittest.TestCase):
        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
        model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
        data_args = GlueDataTrainingArguments(
-            task_name="mrpc", data_dir="./examples/tests_samples/MRPC", overwrite_cache=True
+            task_name="mrpc", data_dir="./tests/fixtures/tests_samples/MRPC", overwrite_cache=True
        )
        eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
--- a/valohai.yaml
+++ b/valohai.yaml
@ -1,13 +1,13 @@
 ---
 - step:
-    name: Execute python examples/run_glue.py
+    name: Execute python examples/text-classification/run_glue.py
    image: pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
    command:
      - python /valohai/repository/utils/download_glue_data.py --data_dir=/glue_data
      - pip install -e .
      - pip install -r examples/requirements.txt
-      - python examples/run_glue.py --do_train --data_dir=/glue_data/{parameter-value:task_name} {parameters}
+      - python examples/text-classification/run_glue.py --do_train --data_dir=/glue_data/{parameter-value:task_name} {parameters}
    parameters:
      - name: model_type
        pass-as: --model_type={v}