diff --git a/.gitignore b/.gitignore index d366cdddd6e..9f6b5e79f44 100644 --- a/.gitignore +++ b/.gitignore @@ -132,7 +132,8 @@ proc_data runs /runs_old /wandb -examples/runs +/examples/runs +/examples/**/*.args # data /data diff --git a/README.md b/README.md index 136d243095a..80b0b772c9f 100644 --- a/README.md +++ b/README.md @@ -331,7 +331,7 @@ pip install -r ./examples/requirements.txt export GLUE_DIR=/path/to/glue export TASK_NAME=MRPC -python ./examples/run_glue.py \ +python ./examples/text-classification/run_glue.py \ --model_name_or_path bert-base-uncased \ --task_name $TASK_NAME \ --do_train \ @@ -357,7 +357,7 @@ Parallel training is a simple way to use several GPUs (but is slower and less fl ```shell export GLUE_DIR=/path/to/glue -python ./examples/run_glue.py \ +python ./examples/text-classification/run_glue.py \ --model_name_or_path xlnet-large-cased \ --do_train \ --do_eval \ @@ -382,7 +382,7 @@ On this machine we thus have a batch size of 32, please increase `gradient_accum This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92. ```bash -python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \ +python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classification/run_glue.py \ --model_name_or_path bert-large-uncased-whole-word-masking \ --task_name MRPC \ --do_train \ diff --git a/docs/source/examples.md b/docs/source/examples.md deleted file mode 120000 index 6fa53604d90..00000000000 --- a/docs/source/examples.md +++ /dev/null @@ -1 +0,0 @@ -../../examples/README.md \ No newline at end of file diff --git a/docs/source/examples.md b/docs/source/examples.md new file mode 100644 index 00000000000..c5d0e79a499 --- /dev/null +++ b/docs/source/examples.md @@ -0,0 +1,649 @@ +# Examples + +In this section a few examples are put together. All of these examples work for several models, making use of the very +similar API between the different models. + +**Important** +To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples. +Execute the following steps in a new virtual environment: + +```bash +git clone https://github.com/huggingface/transformers +cd transformers +pip install . +pip install -r ./examples/requirements.txt +``` + +| Section | Description | +|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------ +| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. | +| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. | +| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. | +| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. | +| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. | +| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. | +| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. | +| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. | +| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. | +| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) | + +## TensorFlow 2.0 Bert models on GLUE + +Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py). + +Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/). + +This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime. +Options are toggled using `USE_XLA` or `USE_AMP` variables in the script. +These options and the below benchmark are provided by @tlkh. + +Quick benchmarks from the script (no other modifications): + +| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) | +| --------- | -------- | ----------------------- | ----------------------| +| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 | +| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 | +| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 | +| V100 | AMP | 22s | 0.8646/0.8385/0.8411 | +| 1080 Ti | FP32 | 55s | - | + +Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used). + +## Running on TPUs + +You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this +[README](https://github.com/pytorch/xla/blob/master/README.md). + +The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are +identical to your normal GPU + Huggingface setup. + +### GLUE + +Before running anyone of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +For running your GLUE task on MNLI dataset you can run something like the following: + +``` +export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" +export GLUE_DIR=/path/to/glue +export TASK_NAME=MNLI + +python run_glue_tpu.py \ + --model_type bert \ + --model_name_or_path bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/$TASK_NAME \ + --max_seq_length 128 \ + --train_batch_size 32 \ + --learning_rate 3e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/$TASK_NAME \ + --overwrite_output_dir \ + --logging_steps 50 \ + --save_steps 200 \ + --num_cores=8 \ + --only_log_master +``` + + +## Language model training + +Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py). + +Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT +to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa +are fine-tuned using a masked language modeling (MLM) loss. + +Before running the following example, you should get a file that contains text on which the language model will be +trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/). + +We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains +text that will be used for evaluation. + +### GPT-2/GPT and causal language modeling + +The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before +the tokenization). The loss here is that of causal language modeling. + +```bash +export TRAIN_FILE=/path/to/dataset/wiki.train.raw +export TEST_FILE=/path/to/dataset/wiki.test.raw + +python run_language_modeling.py \ + --output_dir=output \ + --model_type=gpt2 \ + --model_name_or_path=gpt2 \ + --do_train \ + --train_data_file=$TRAIN_FILE \ + --do_eval \ + --eval_data_file=$TEST_FILE +``` + +This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches +a score of ~20 perplexity once fine-tuned on the dataset. + +### RoBERTa/BERT and masked language modeling + +The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different +as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their +pre-training: masked language modeling. + +In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge +slightly slower (over-fitting takes more epochs). + +We use the `--mlm` flag so that the script may change its loss function. + +```bash +export TRAIN_FILE=/path/to/dataset/wiki.train.raw +export TEST_FILE=/path/to/dataset/wiki.test.raw + +python run_language_modeling.py \ + --output_dir=output \ + --model_type=roberta \ + --model_name_or_path=roberta-base \ + --do_train \ + --train_data_file=$TRAIN_FILE \ + --do_eval \ + --eval_data_file=$TEST_FILE \ + --mlm +``` + +## Language generation + +Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py). + +Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL. +A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you +can try out the different models available in the library. + +Example usage: + +```bash +python run_generation.py \ + --model_type=gpt2 \ + --model_name_or_path=gpt2 +``` + +## GLUE + +Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py). + +Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding +Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa. + +GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an +uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train +batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results +between different runs. We report the median on 5 runs (with different seeds) for each of the metrics. + +| Task | Metric | Result | +|-------|------------------------------|-------------| +| CoLA | Matthew's corr | 49.23 | +| SST-2 | Accuracy | 91.97 | +| MRPC | F1/Accuracy | 89.47/85.29 | +| STS-B | Person/Spearman corr. | 83.95/83.70 | +| QQP | Accuracy/F1 | 88.40/84.31 | +| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 | +| QNLI | Accuracy | 87.46 | +| RTE | Accuracy | 61.73 | +| WNLI | Accuracy | 45.07 | + +Some of these results are significantly different from the ones reported on the test set +of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite. + +Before running any one of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```bash +export GLUE_DIR=/path/to/glue +export TASK_NAME=MRPC + +python run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/$TASK_NAME \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/$TASK_NAME/ +``` + +where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. + +The dev set results will be present within the text file `eval_results.txt` in the specified output_dir. +In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate +output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`. + +The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, +CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being +said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well, +since the data processor for each task inherits from the base class DataProcessor. + +### MRPC + +#### Fine-tuning example + +The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less +than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed. + +Before running any one of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```bash +export GLUE_DIR=/path/to/glue + +python run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ +``` + +Our test ran on a few seeds with [the original implementation hyper- +parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation +results between 84% and 88%. + +#### Using Apex and mixed-precision + +Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install +[apex](https://github.com/NVIDIA/apex), then run the following example: + +```bash +export GLUE_DIR=/path/to/glue + +python run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ \ + --fp16 +``` + +#### Distributed training + +Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it +reaches F1 > 92 on MRPC. + +```bash +export GLUE_DIR=/path/to/glue + +python -m torch.distributed.launch \ + --nproc_per_node 8 run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 8 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ +``` + +Training with these hyper-parameters gave us the following results: + +```bash +acc = 0.8823529411764706 +acc_and_f1 = 0.901702786377709 +eval_loss = 0.3418912578906332 +f1 = 0.9210526315789473 +global_step = 174 +loss = 0.07231863956341798 +``` + +### MNLI + +The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task. + +```bash +export GLUE_DIR=/path/to/glue + +python -m torch.distributed.launch \ + --nproc_per_node 8 run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name mnli \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MNLI/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 8 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir output_dir \ +``` + +The results are the following: + +```bash +***** Eval results ***** + acc = 0.8679706601466992 + eval_loss = 0.4911287787382479 + global_step = 18408 + loss = 0.04755385363816904 + +***** Eval results ***** + acc = 0.8747965825874695 + eval_loss = 0.45516540421714036 + global_step = 18408 + loss = 0.04755385363816904 +``` + +## Multiple Choice + +Based on the script [`run_multiple_choice.py`](). + +#### Fine-tuning on SWAG +Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data + +```bash +#training on 4 tesla V100(16GB) GPUS +export SWAG_DIR=/path/to/swag_data_dir +python ./examples/run_multiple_choice.py \ +--task_name swag \ +--model_name_or_path roberta-base \ +--do_train \ +--do_eval \ +--data_dir $SWAG_DIR \ +--learning_rate 5e-5 \ +--num_train_epochs 3 \ +--max_seq_length 80 \ +--output_dir models_bert/swag_base \ +--per_gpu_eval_batch_size=16 \ +--per_gpu_train_batch_size=16 \ +--gradient_accumulation_steps 2 \ +--overwrite_output +``` +Training with the defined hyper-parameters yields the following results: +``` +***** Eval results ***** +eval_acc = 0.8338998300509847 +eval_loss = 0.44457291918821606 +``` + +## SQuAD + +Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py). + +#### Fine-tuning BERT on SQuAD1.0 + +This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) +on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a +$SQUAD_DIR directory. + +* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) +* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) +* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) + +And for SQuAD2.0, you need to download: + +- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) +- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) +- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_squad.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --do_train \ + --do_eval \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --per_gpu_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2.0 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ +``` + +Training with the previously defined hyper-parameters yields the following results: + +```bash +f1 = 88.52 +exact_match = 81.22 +``` + +#### Distributed training + + +Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1: + +```bash +python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \ + --model_type bert \ + --model_name_or_path bert-large-uncased-whole-word-masking \ + --do_train \ + --do_eval \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \ + --per_gpu_eval_batch_size=3 \ + --per_gpu_train_batch_size=3 \ +``` + +Training with the previously defined hyper-parameters yields the following results: + +```bash +f1 = 93.15 +exact_match = 86.91 +``` + +This fine-tuned model is available as a checkpoint under the reference +`bert-large-uncased-whole-word-masking-finetuned-squad`. + +#### Fine-tuning XLNet on SQuAD + +This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD . + +##### Command for SQuAD1.0: + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_squad.py \ + --model_type xlnet \ + --model_name_or_path xlnet-large-cased \ + --do_train \ + --do_eval \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ./wwm_cased_finetuned_squad/ \ + --per_gpu_eval_batch_size=4 \ + --per_gpu_train_batch_size=4 \ + --save_steps 5000 +``` + +##### Command for SQuAD2.0: + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_squad.py \ + --model_type xlnet \ + --model_name_or_path xlnet-large-cased \ + --do_train \ + --do_eval \ + --version_2_with_negative \ + --train_file $SQUAD_DIR/train-v2.0.json \ + --predict_file $SQUAD_DIR/dev-v2.0.json \ + --learning_rate 3e-5 \ + --num_train_epochs 4 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ./wwm_cased_finetuned_squad/ \ + --per_gpu_eval_batch_size=2 \ + --per_gpu_train_batch_size=2 \ + --save_steps 5000 +``` + +Larger batch size may improve the performance while costing more memory. + +##### Results for SQuAD1.0 with the previously defined hyper-parameters: + +```python +{ +"exact": 85.45884578997162, +"f1": 92.5974600601065, +"total": 10570, +"HasAns_exact": 85.45884578997162, +"HasAns_f1": 92.59746006010651, +"HasAns_total": 10570 +} +``` + +##### Results for SQuAD2.0 with the previously defined hyper-parameters: + +```python +{ +"exact": 80.4177545691906, +"f1": 84.07154997729623, +"total": 11873, +"HasAns_exact": 76.73751686909581, +"HasAns_f1": 84.05558584352873, +"HasAns_total": 5928, +"NoAns_exact": 84.0874684608915, +"NoAns_f1": 84.0874684608915, +"NoAns_total": 5945 +} +``` + + + + +## XNLI + +Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py). + +[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili). + +#### Fine-tuning on XNLI + +This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins +on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a +`$XNLI_DIR` directory. + +* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip) +* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip) + +```bash +export XNLI_DIR=/path/to/XNLI + +python run_xnli.py \ + --model_type bert \ + --model_name_or_path bert-base-multilingual-cased \ + --language de \ + --train_language en \ + --do_train \ + --do_eval \ + --data_dir $XNLI_DIR \ + --per_gpu_train_batch_size 32 \ + --learning_rate 5e-5 \ + --num_train_epochs 2.0 \ + --max_seq_length 128 \ + --output_dir /tmp/debug_xnli/ \ + --save_steps -1 +``` + +Training with the previously defined hyper-parameters yields the following results on the **test** set: + +```bash +acc = 0.7093812375249501 +``` + +## MM-IMDb + +Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py). + +[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata. + +### Training on MM-IMDb + +``` +python run_mmimdb.py \ + --data_dir /path/to/mmimdb/dataset/ \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --output_dir /path/to/save/dir/ \ + --do_train \ + --do_eval \ + --max_seq_len 512 \ + --gradient_accumulation_steps 20 \ + --num_image_embeds 3 \ + --num_train_epochs 100 \ + --patience 5 +``` + +## Adversarial evaluation of model performances + +Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi). + +The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans). + +This is an example of using test_hans.py: + +```bash +export HANS_DIR=path-to-hans +export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc +export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py + +python examples/hans/test_hans.py \ + --task_name hans \ + --model_type $MODEL_TYPE \ + --do_eval \ + --data_dir $HANS_DIR \ + --model_name_or_path $MODEL_PATH \ + --max_seq_length 128 \ + --output_dir $MODEL_PATH \ +``` + +This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset. + +The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows: + +```bash +Heuristic entailed results: +lexical_overlap: 0.9702 +subsequence: 0.9942 +constituent: 0.9962 + +Heuristic non-entailed results: +lexical_overlap: 0.199 +subsequence: 0.0396 +constituent: 0.118 +``` diff --git a/docs/source/main_classes/processors.rst b/docs/source/main_classes/processors.rst index 353a17810d4..6452c601bef 100644 --- a/docs/source/main_classes/processors.rst +++ b/docs/source/main_classes/processors.rst @@ -54,7 +54,7 @@ Additionally, the following method can be used to load values from a data file Example usage ^^^^^^^^^^^^^^^^^^^^^^^^^ -An example using these processors is given in the `run_glue.py `__ script. +An example using these processors is given in the `run_glue.py `__ script. XNLI diff --git a/docs/source/usage.rst b/docs/source/usage.rst index 5f7147e9c36..f3cdf56132d 100644 --- a/docs/source/usage.rst +++ b/docs/source/usage.rst @@ -44,7 +44,7 @@ Sequence Classification Sequence classification is the task of classifying sequences according to a given number of classes. An example of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a model on a GLUE sequence classification task, you may leverage the -`run_glue.py `_ or +`run_glue.py `_ or `run_tf_glue.py `_ scripts. Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative. diff --git a/examples/README.md b/examples/README.md index 69039263766..fbad5f0d305 100644 --- a/examples/README.md +++ b/examples/README.md @@ -15,7 +15,7 @@ pip install -r ./examples/requirements.txt ``` | Section | Description | -|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------ +|----------------------------|----------------------------------------------------- | [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. | | [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. | | [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. | @@ -27,623 +27,3 @@ pip install -r ./examples/requirements.txt | [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. | | [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) | -## TensorFlow 2.0 Bert models on GLUE - -Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py). - -Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/). - -This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime. -Options are toggled using `USE_XLA` or `USE_AMP` variables in the script. -These options and the below benchmark are provided by @tlkh. - -Quick benchmarks from the script (no other modifications): - -| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) | -| --------- | -------- | ----------------------- | ----------------------| -| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 | -| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 | -| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 | -| V100 | AMP | 22s | 0.8646/0.8385/0.8411 | -| 1080 Ti | FP32 | 55s | - | - -Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used). - -## Running on TPUs - -You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this -[README](https://github.com/pytorch/xla/blob/master/README.md). - -The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are -identical to your normal GPU + Huggingface setup. - -### GLUE - -Before running anyone of these GLUE tasks you should download the -[GLUE data](https://gluebenchmark.com/tasks) by running -[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) -and unpack it to some directory `$GLUE_DIR`. - -For running your GLUE task on MNLI dataset you can run something like the following: - -``` -export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" -export GLUE_DIR=/path/to/glue -export TASK_NAME=MNLI - -python run_glue_tpu.py \ - --model_type bert \ - --model_name_or_path bert-base-cased \ - --task_name $TASK_NAME \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/$TASK_NAME \ - --max_seq_length 128 \ - --train_batch_size 32 \ - --learning_rate 3e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/$TASK_NAME \ - --overwrite_output_dir \ - --logging_steps 50 \ - --save_steps 200 \ - --num_cores=8 \ - --only_log_master -``` - - -## Language model training - -Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py). - -Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT -to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa -are fine-tuned using a masked language modeling (MLM) loss. - -Before running the following example, you should get a file that contains text on which the language model will be -trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/). - -We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains -text that will be used for evaluation. - -### GPT-2/GPT and causal language modeling - -The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before -the tokenization). The loss here is that of causal language modeling. - -```bash -export TRAIN_FILE=/path/to/dataset/wiki.train.raw -export TEST_FILE=/path/to/dataset/wiki.test.raw - -python run_language_modeling.py \ - --output_dir=output \ - --model_type=gpt2 \ - --model_name_or_path=gpt2 \ - --do_train \ - --train_data_file=$TRAIN_FILE \ - --do_eval \ - --eval_data_file=$TEST_FILE -``` - -This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches -a score of ~20 perplexity once fine-tuned on the dataset. - -### RoBERTa/BERT and masked language modeling - -The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different -as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their -pre-training: masked language modeling. - -In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge -slightly slower (over-fitting takes more epochs). - -We use the `--mlm` flag so that the script may change its loss function. - -```bash -export TRAIN_FILE=/path/to/dataset/wiki.train.raw -export TEST_FILE=/path/to/dataset/wiki.test.raw - -python run_language_modeling.py \ - --output_dir=output \ - --model_type=roberta \ - --model_name_or_path=roberta-base \ - --do_train \ - --train_data_file=$TRAIN_FILE \ - --do_eval \ - --eval_data_file=$TEST_FILE \ - --mlm -``` - -## Language generation - -Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py). - -Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL. -A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you -can try out the different models available in the library. - -Example usage: - -```bash -python run_generation.py \ - --model_type=gpt2 \ - --model_name_or_path=gpt2 -``` - -## GLUE - -Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py). - -Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding -Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa. - -GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an -uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train -batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results -between different runs. We report the median on 5 runs (with different seeds) for each of the metrics. - -| Task | Metric | Result | -|-------|------------------------------|-------------| -| CoLA | Matthew's corr | 49.23 | -| SST-2 | Accuracy | 91.97 | -| MRPC | F1/Accuracy | 89.47/85.29 | -| STS-B | Person/Spearman corr. | 83.95/83.70 | -| QQP | Accuracy/F1 | 88.40/84.31 | -| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 | -| QNLI | Accuracy | 87.46 | -| RTE | Accuracy | 61.73 | -| WNLI | Accuracy | 45.07 | - -Some of these results are significantly different from the ones reported on the test set -of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite. - -Before running any one of these GLUE tasks you should download the -[GLUE data](https://gluebenchmark.com/tasks) by running -[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) -and unpack it to some directory `$GLUE_DIR`. - -```bash -export GLUE_DIR=/path/to/glue -export TASK_NAME=MRPC - -python run_glue.py \ - --model_type bert \ - --model_name_or_path bert-base-cased \ - --task_name $TASK_NAME \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/$TASK_NAME \ - --max_seq_length 128 \ - --per_gpu_train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/$TASK_NAME/ -``` - -where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. - -The dev set results will be present within the text file `eval_results.txt` in the specified output_dir. -In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate -output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`. - -The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, -CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being -said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well, -since the data processor for each task inherits from the base class DataProcessor. - -### MRPC - -#### Fine-tuning example - -The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less -than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed. - -Before running any one of these GLUE tasks you should download the -[GLUE data](https://gluebenchmark.com/tasks) by running -[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) -and unpack it to some directory `$GLUE_DIR`. - -```bash -export GLUE_DIR=/path/to/glue - -python run_glue.py \ - --model_name_or_path bert-base-cased \ - --task_name MRPC \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/MRPC/ \ - --max_seq_length 128 \ - --per_gpu_train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ -``` - -Our test ran on a few seeds with [the original implementation hyper- -parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation -results between 84% and 88%. - -#### Using Apex and mixed-precision - -Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install -[apex](https://github.com/NVIDIA/apex), then run the following example: - -```bash -export GLUE_DIR=/path/to/glue - -python run_glue.py \ - --model_name_or_path bert-base-cased \ - --task_name MRPC \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/MRPC/ \ - --max_seq_length 128 \ - --per_gpu_train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ \ - --fp16 -``` - -#### Distributed training - -Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it -reaches F1 > 92 on MRPC. - -```bash -export GLUE_DIR=/path/to/glue - -python -m torch.distributed.launch \ - --nproc_per_node 8 run_glue.py \ - --model_name_or_path bert-base-cased \ - --task_name MRPC \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/MRPC/ \ - --max_seq_length 128 \ - --per_gpu_train_batch_size 8 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir /tmp/mrpc_output/ -``` - -Training with these hyper-parameters gave us the following results: - -```bash -acc = 0.8823529411764706 -acc_and_f1 = 0.901702786377709 -eval_loss = 0.3418912578906332 -f1 = 0.9210526315789473 -global_step = 174 -loss = 0.07231863956341798 -``` - -### MNLI - -The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task. - -```bash -export GLUE_DIR=/path/to/glue - -python -m torch.distributed.launch \ - --nproc_per_node 8 run_glue.py \ - --model_name_or_path bert-base-cased \ - --task_name mnli \ - --do_train \ - --do_eval \ - --data_dir $GLUE_DIR/MNLI/ \ - --max_seq_length 128 \ - --per_gpu_train_batch_size 8 \ - --learning_rate 2e-5 \ - --num_train_epochs 3.0 \ - --output_dir output_dir \ -``` - -The results are the following: - -```bash -***** Eval results ***** - acc = 0.8679706601466992 - eval_loss = 0.4911287787382479 - global_step = 18408 - loss = 0.04755385363816904 - -***** Eval results ***** - acc = 0.8747965825874695 - eval_loss = 0.45516540421714036 - global_step = 18408 - loss = 0.04755385363816904 -``` - -## Multiple Choice - -Based on the script [`run_multiple_choice.py`](). - -#### Fine-tuning on SWAG -Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data - -```bash -#training on 4 tesla V100(16GB) GPUS -export SWAG_DIR=/path/to/swag_data_dir -python ./examples/run_multiple_choice.py \ ---task_name swag \ ---model_name_or_path roberta-base \ ---do_train \ ---do_eval \ ---data_dir $SWAG_DIR \ ---learning_rate 5e-5 \ ---num_train_epochs 3 \ ---max_seq_length 80 \ ---output_dir models_bert/swag_base \ ---per_gpu_eval_batch_size=16 \ ---per_gpu_train_batch_size=16 \ ---gradient_accumulation_steps 2 \ ---overwrite_output -``` -Training with the defined hyper-parameters yields the following results: -``` -***** Eval results ***** -eval_acc = 0.8338998300509847 -eval_loss = 0.44457291918821606 -``` - -## SQuAD - -Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py). - -#### Fine-tuning BERT on SQuAD1.0 - -This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) -on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a -$SQUAD_DIR directory. - -* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) -* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) -* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) - -And for SQuAD2.0, you need to download: - -- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) -- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) -- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) - -```bash -export SQUAD_DIR=/path/to/SQUAD - -python run_squad.py \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --do_train \ - --do_eval \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --per_gpu_train_batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 2.0 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ -``` - -Training with the previously defined hyper-parameters yields the following results: - -```bash -f1 = 88.52 -exact_match = 81.22 -``` - -#### Distributed training - - -Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1: - -```bash -python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \ - --model_type bert \ - --model_name_or_path bert-large-uncased-whole-word-masking \ - --do_train \ - --do_eval \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \ - --per_gpu_eval_batch_size=3 \ - --per_gpu_train_batch_size=3 \ -``` - -Training with the previously defined hyper-parameters yields the following results: - -```bash -f1 = 93.15 -exact_match = 86.91 -``` - -This fine-tuned model is available as a checkpoint under the reference -`bert-large-uncased-whole-word-masking-finetuned-squad`. - -#### Fine-tuning XLNet on SQuAD - -This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD . - -##### Command for SQuAD1.0: - -```bash -export SQUAD_DIR=/path/to/SQUAD - -python run_squad.py \ - --model_type xlnet \ - --model_name_or_path xlnet-large-cased \ - --do_train \ - --do_eval \ - --train_file $SQUAD_DIR/train-v1.1.json \ - --predict_file $SQUAD_DIR/dev-v1.1.json \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir ./wwm_cased_finetuned_squad/ \ - --per_gpu_eval_batch_size=4 \ - --per_gpu_train_batch_size=4 \ - --save_steps 5000 -``` - -##### Command for SQuAD2.0: - -```bash -export SQUAD_DIR=/path/to/SQUAD - -python run_squad.py \ - --model_type xlnet \ - --model_name_or_path xlnet-large-cased \ - --do_train \ - --do_eval \ - --version_2_with_negative \ - --train_file $SQUAD_DIR/train-v2.0.json \ - --predict_file $SQUAD_DIR/dev-v2.0.json \ - --learning_rate 3e-5 \ - --num_train_epochs 4 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir ./wwm_cased_finetuned_squad/ \ - --per_gpu_eval_batch_size=2 \ - --per_gpu_train_batch_size=2 \ - --save_steps 5000 -``` - -Larger batch size may improve the performance while costing more memory. - -##### Results for SQuAD1.0 with the previously defined hyper-parameters: - -```python -{ -"exact": 85.45884578997162, -"f1": 92.5974600601065, -"total": 10570, -"HasAns_exact": 85.45884578997162, -"HasAns_f1": 92.59746006010651, -"HasAns_total": 10570 -} -``` - -##### Results for SQuAD2.0 with the previously defined hyper-parameters: - -```python -{ -"exact": 80.4177545691906, -"f1": 84.07154997729623, -"total": 11873, -"HasAns_exact": 76.73751686909581, -"HasAns_f1": 84.05558584352873, -"HasAns_total": 5928, -"NoAns_exact": 84.0874684608915, -"NoAns_f1": 84.0874684608915, -"NoAns_total": 5945 -} -``` - - - - -## XNLI - -Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py). - -[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili). - -#### Fine-tuning on XNLI - -This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins -on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a -`$XNLI_DIR` directory. - -* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip) -* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip) - -```bash -export XNLI_DIR=/path/to/XNLI - -python run_xnli.py \ - --model_type bert \ - --model_name_or_path bert-base-multilingual-cased \ - --language de \ - --train_language en \ - --do_train \ - --do_eval \ - --data_dir $XNLI_DIR \ - --per_gpu_train_batch_size 32 \ - --learning_rate 5e-5 \ - --num_train_epochs 2.0 \ - --max_seq_length 128 \ - --output_dir /tmp/debug_xnli/ \ - --save_steps -1 -``` - -Training with the previously defined hyper-parameters yields the following results on the **test** set: - -```bash -acc = 0.7093812375249501 -``` - -## MM-IMDb - -Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/mm-imdb/run_mmimdb.py). - -[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata. - -### Training on MM-IMDb - -``` -python run_mmimdb.py \ - --data_dir /path/to/mmimdb/dataset/ \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --output_dir /path/to/save/dir/ \ - --do_train \ - --do_eval \ - --max_seq_len 512 \ - --gradient_accumulation_steps 20 \ - --num_image_embeds 3 \ - --num_train_epochs 100 \ - --patience 5 -``` - -## Adversarial evaluation of model performances - -Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi). - -The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans). - -This is an example of using test_hans.py: - -```bash -export HANS_DIR=path-to-hans -export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc -export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py - -python examples/hans/test_hans.py \ - --task_name hans \ - --model_type $MODEL_TYPE \ - --do_eval \ - --data_dir $HANS_DIR \ - --model_name_or_path $MODEL_PATH \ - --max_seq_length 128 \ - --output_dir $MODEL_PATH \ -``` - -This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset. - -The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows: - -```bash -Heuristic entailed results: -lexical_overlap: 0.9702 -subsequence: 0.9942 -constituent: 0.9962 - -Heuristic non-entailed results: -lexical_overlap: 0.199 -subsequence: 0.0396 -constituent: 0.118 -``` diff --git a/examples/adversarial/README.md b/examples/adversarial/README.md new file mode 100644 index 00000000000..824867fd267 --- /dev/null +++ b/examples/adversarial/README.md @@ -0,0 +1,38 @@ +## Adversarial evaluation of model performances + +Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi). + +The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans). + +This is an example of using test_hans.py: + +```bash +export HANS_DIR=path-to-hans +export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc +export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py + +python examples/hans/test_hans.py \ + --task_name hans \ + --model_type $MODEL_TYPE \ + --do_eval \ + --data_dir $HANS_DIR \ + --model_name_or_path $MODEL_PATH \ + --max_seq_length 128 \ + --output_dir $MODEL_PATH \ +``` + +This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset. + +The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows: + +```bash +Heuristic entailed results: +lexical_overlap: 0.9702 +subsequence: 0.9942 +constituent: 0.9962 + +Heuristic non-entailed results: +lexical_overlap: 0.199 +subsequence: 0.0396 +constituent: 0.118 +``` diff --git a/examples/hans/hans_processors.py b/examples/adversarial/hans_processors.py similarity index 100% rename from examples/hans/hans_processors.py rename to examples/adversarial/hans_processors.py diff --git a/examples/hans/test_hans.py b/examples/adversarial/test_hans.py similarity index 100% rename from examples/hans/test_hans.py rename to examples/adversarial/test_hans.py diff --git a/examples/hans/utils_hans.py b/examples/adversarial/utils_hans.py similarity index 100% rename from examples/hans/utils_hans.py rename to examples/adversarial/utils_hans.py diff --git a/examples/run_bertology.py b/examples/bertology/run_bertology.py similarity index 100% rename from examples/run_bertology.py rename to examples/bertology/run_bertology.py diff --git a/examples/contrib/mm-imdb/README.md b/examples/contrib/mm-imdb/README.md new file mode 100644 index 00000000000..eeef3a2ccd7 --- /dev/null +++ b/examples/contrib/mm-imdb/README.md @@ -0,0 +1,23 @@ +## MM-IMDb + +Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py). + +[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata. + +### Training on MM-IMDb + +``` +python run_mmimdb.py \ + --data_dir /path/to/mmimdb/dataset/ \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --output_dir /path/to/save/dir/ \ + --do_train \ + --do_eval \ + --max_seq_len 512 \ + --gradient_accumulation_steps 20 \ + --num_image_embeds 3 \ + --num_train_epochs 100 \ + --patience 5 +``` + diff --git a/examples/mm-imdb/run_mmimdb.py b/examples/contrib/mm-imdb/run_mmimdb.py similarity index 100% rename from examples/mm-imdb/run_mmimdb.py rename to examples/contrib/mm-imdb/run_mmimdb.py diff --git a/examples/mm-imdb/utils_mmimdb.py b/examples/contrib/mm-imdb/utils_mmimdb.py similarity index 100% rename from examples/mm-imdb/utils_mmimdb.py rename to examples/contrib/mm-imdb/utils_mmimdb.py diff --git a/examples/glue/README.md b/examples/glue/README.md deleted file mode 100644 index 654e170a766..00000000000 --- a/examples/glue/README.md +++ /dev/null @@ -1,9 +0,0 @@ -# GLUE Benchmark - -Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py). - -#### Run PyTorch version using PyTorch-Lightning - -Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory. - -Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: `TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}` \ No newline at end of file diff --git a/examples/language-modeling/README.md b/examples/language-modeling/README.md new file mode 100644 index 00000000000..e2dc600e090 --- /dev/null +++ b/examples/language-modeling/README.md @@ -0,0 +1,63 @@ + +## Language model training + +Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py). + +Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT +to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa +are fine-tuned using a masked language modeling (MLM) loss. + +Before running the following example, you should get a file that contains text on which the language model will be +trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/). + +We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains +text that will be used for evaluation. + +### GPT-2/GPT and causal language modeling + +The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before +the tokenization). The loss here is that of causal language modeling. + +```bash +export TRAIN_FILE=/path/to/dataset/wiki.train.raw +export TEST_FILE=/path/to/dataset/wiki.test.raw + +python run_language_modeling.py \ + --output_dir=output \ + --model_type=gpt2 \ + --model_name_or_path=gpt2 \ + --do_train \ + --train_data_file=$TRAIN_FILE \ + --do_eval \ + --eval_data_file=$TEST_FILE +``` + +This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches +a score of ~20 perplexity once fine-tuned on the dataset. + +### RoBERTa/BERT and masked language modeling + +The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different +as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their +pre-training: masked language modeling. + +In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge +slightly slower (over-fitting takes more epochs). + +We use the `--mlm` flag so that the script may change its loss function. + +```bash +export TRAIN_FILE=/path/to/dataset/wiki.train.raw +export TEST_FILE=/path/to/dataset/wiki.test.raw + +python run_language_modeling.py \ + --output_dir=output \ + --model_type=roberta \ + --model_name_or_path=roberta-base \ + --do_train \ + --train_data_file=$TRAIN_FILE \ + --do_eval \ + --eval_data_file=$TEST_FILE \ + --mlm +``` + diff --git a/examples/run_language_modeling.py b/examples/language-modeling/run_language_modeling.py similarity index 100% rename from examples/run_language_modeling.py rename to examples/language-modeling/run_language_modeling.py diff --git a/examples/transformer_base.py b/examples/lightning_base.py similarity index 100% rename from examples/transformer_base.py rename to examples/lightning_base.py diff --git a/examples/multiple-choice/README.md b/examples/multiple-choice/README.md new file mode 100644 index 00000000000..34721daf839 --- /dev/null +++ b/examples/multiple-choice/README.md @@ -0,0 +1,31 @@ +## Multiple Choice + +Based on the script [`run_multiple_choice.py`](). + +#### Fine-tuning on SWAG +Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data + +```bash +#training on 4 tesla V100(16GB) GPUS +export SWAG_DIR=/path/to/swag_data_dir +python ./examples/run_multiple_choice.py \ +--task_name swag \ +--model_name_or_path roberta-base \ +--do_train \ +--do_eval \ +--data_dir $SWAG_DIR \ +--learning_rate 5e-5 \ +--num_train_epochs 3 \ +--max_seq_length 80 \ +--output_dir models_bert/swag_base \ +--per_gpu_eval_batch_size=16 \ +--per_gpu_train_batch_size=16 \ +--gradient_accumulation_steps 2 \ +--overwrite_output +``` +Training with the defined hyper-parameters yields the following results: +``` +***** Eval results ***** +eval_acc = 0.8338998300509847 +eval_loss = 0.44457291918821606 +``` diff --git a/examples/run_multiple_choice.py b/examples/multiple-choice/run_multiple_choice.py similarity index 100% rename from examples/run_multiple_choice.py rename to examples/multiple-choice/run_multiple_choice.py diff --git a/examples/utils_multiple_choice.py b/examples/multiple-choice/utils_multiple_choice.py similarity index 100% rename from examples/utils_multiple_choice.py rename to examples/multiple-choice/utils_multiple_choice.py diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md new file mode 100644 index 00000000000..715a33caa5a --- /dev/null +++ b/examples/question-answering/README.md @@ -0,0 +1,159 @@ + + +## SQuAD + +Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py). + +#### Fine-tuning BERT on SQuAD1.0 + +This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) +on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a +$SQUAD_DIR directory. + +* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) +* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) +* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) + +And for SQuAD2.0, you need to download: + +- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) +- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) +- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_squad.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --do_train \ + --do_eval \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --per_gpu_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2.0 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ +``` + +Training with the previously defined hyper-parameters yields the following results: + +```bash +f1 = 88.52 +exact_match = 81.22 +``` + +#### Distributed training + + +Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1: + +```bash +python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \ + --model_type bert \ + --model_name_or_path bert-large-uncased-whole-word-masking \ + --do_train \ + --do_eval \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \ + --per_gpu_eval_batch_size=3 \ + --per_gpu_train_batch_size=3 \ +``` + +Training with the previously defined hyper-parameters yields the following results: + +```bash +f1 = 93.15 +exact_match = 86.91 +``` + +This fine-tuned model is available as a checkpoint under the reference +`bert-large-uncased-whole-word-masking-finetuned-squad`. + +#### Fine-tuning XLNet on SQuAD + +This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD . + +##### Command for SQuAD1.0: + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_squad.py \ + --model_type xlnet \ + --model_name_or_path xlnet-large-cased \ + --do_train \ + --do_eval \ + --train_file $SQUAD_DIR/train-v1.1.json \ + --predict_file $SQUAD_DIR/dev-v1.1.json \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ./wwm_cased_finetuned_squad/ \ + --per_gpu_eval_batch_size=4 \ + --per_gpu_train_batch_size=4 \ + --save_steps 5000 +``` + +##### Command for SQuAD2.0: + +```bash +export SQUAD_DIR=/path/to/SQUAD + +python run_squad.py \ + --model_type xlnet \ + --model_name_or_path xlnet-large-cased \ + --do_train \ + --do_eval \ + --version_2_with_negative \ + --train_file $SQUAD_DIR/train-v2.0.json \ + --predict_file $SQUAD_DIR/dev-v2.0.json \ + --learning_rate 3e-5 \ + --num_train_epochs 4 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir ./wwm_cased_finetuned_squad/ \ + --per_gpu_eval_batch_size=2 \ + --per_gpu_train_batch_size=2 \ + --save_steps 5000 +``` + +Larger batch size may improve the performance while costing more memory. + +##### Results for SQuAD1.0 with the previously defined hyper-parameters: + +```python +{ +"exact": 85.45884578997162, +"f1": 92.5974600601065, +"total": 10570, +"HasAns_exact": 85.45884578997162, +"HasAns_f1": 92.59746006010651, +"HasAns_total": 10570 +} +``` + +##### Results for SQuAD2.0 with the previously defined hyper-parameters: + +```python +{ +"exact": 80.4177545691906, +"f1": 84.07154997729623, +"total": 11873, +"HasAns_exact": 76.73751686909581, +"HasAns_f1": 84.05558584352873, +"HasAns_total": 5928, +"NoAns_exact": 84.0874684608915, +"NoAns_f1": 84.0874684608915, +"NoAns_total": 5945 +} +``` + diff --git a/examples/run_squad.py b/examples/question-answering/run_squad.py similarity index 100% rename from examples/run_squad.py rename to examples/question-answering/run_squad.py diff --git a/examples/requirements.txt b/examples/requirements.txt index 33eb2ace3c7..3e8717564e3 100644 --- a/examples/requirements.txt +++ b/examples/requirements.txt @@ -1,4 +1,3 @@ -tensorboardX tensorboard scikit-learn seqeval diff --git a/examples/run_tpu_glue.py b/examples/run_tpu_glue.py deleted file mode 100644 index 4f6e61bf533..00000000000 --- a/examples/run_tpu_glue.py +++ /dev/null @@ -1,611 +0,0 @@ -# coding=utf-8 -# Copyright 2019 The Google AI Language Team Authors and The HuggingFace Inc. team. -# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" Finetuning the library models for sequence classification on GLUE (Bert, DistilBert, XLNet, RoBERTa).""" - -from __future__ import absolute_import, division, print_function - -import argparse -import glob -import logging -import os -import random - -import numpy as np -import torch -import torch_xla.core.xla_model as xm -import torch_xla.debug.metrics as met -import torch_xla.distributed.parallel_loader as pl -import torch_xla.distributed.xla_multiprocessing as xmp -from torch.utils.data import DataLoader, RandomSampler, TensorDataset -from torch.utils.data.distributed import DistributedSampler -from tqdm import tqdm, trange - -from transformers import ( - WEIGHTS_NAME, - AdamW, - BertConfig, - BertForSequenceClassification, - BertTokenizer, - DistilBertConfig, - DistilBertForSequenceClassification, - DistilBertTokenizer, - RobertaConfig, - RobertaForSequenceClassification, - RobertaTokenizer, - XLMConfig, - XLMForSequenceClassification, - XLMTokenizer, - XLNetConfig, - XLNetForSequenceClassification, - XLNetTokenizer, - get_linear_schedule_with_warmup, -) -from transformers import glue_compute_metrics as compute_metrics -from transformers import glue_convert_examples_to_features as convert_examples_to_features -from transformers import glue_output_modes as output_modes -from transformers import glue_processors as processors - - -try: - # Only tensorboardX supports writing directly to gs:// - from tensorboardX import SummaryWriter -except ImportError: - from torch.utils.tensorboard import SummaryWriter - - -logger = logging.getLogger(__name__) - -ALL_MODELS = sum( - ( - tuple(conf.pretrained_config_archive_map.keys()) - for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig) - ), - (), -) - -MODEL_CLASSES = { - "bert": (BertConfig, BertForSequenceClassification, BertTokenizer), - "xlnet": (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer), - "xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer), - "roberta": (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer), - "distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer), -} - - -def set_seed(seed): - random.seed(seed) - np.random.seed(seed) - torch.manual_seed(seed) - - -def get_sampler(dataset): - if xm.xrt_world_size() <= 1: - return RandomSampler(dataset) - return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()) - - -def train(args, train_dataset, model, tokenizer, disable_logging=False): - """ Train the model """ - if xm.is_master_ordinal(): - # Only master writes to Tensorboard - tb_writer = SummaryWriter(args.tensorboard_logdir) - - train_sampler = get_sampler(train_dataset) - dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) - - if args.max_steps > 0: - t_total = args.max_steps - args.num_train_epochs = args.max_steps // (len(dataloader) // args.gradient_accumulation_steps) + 1 - else: - t_total = len(dataloader) // args.gradient_accumulation_steps * args.num_train_epochs - - # Prepare optimizer and schedule (linear warmup and decay) - no_decay = ["bias", "LayerNorm.weight"] - optimizer_grouped_parameters = [ - { - "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], - "weight_decay": args.weight_decay, - }, - {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, - ] - optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) - scheduler = get_linear_schedule_with_warmup( - optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total, - ) - - # Train! - logger.info("***** Running training *****") - logger.info(" Num examples = %d", len(dataloader) * args.train_batch_size) - logger.info(" Num Epochs = %d", args.num_train_epochs) - logger.info(" Instantaneous batch size per TPU core = %d", args.train_batch_size) - logger.info( - " Total train batch size (w. parallel, distributed & accumulation) = %d", - (args.train_batch_size * args.gradient_accumulation_steps * xm.xrt_world_size()), - ) - logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) - logger.info(" Total optimization steps = %d", t_total) - - global_step = 0 - loss = None - model.zero_grad() - train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=disable_logging) - set_seed(args.seed) # Added here for reproductibility (even between python 2 and 3) - for epoch in train_iterator: - # tpu-comment: Get TPU parallel loader which sends data to TPU in background. - train_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device) - epoch_iterator = tqdm(train_dataloader, desc="Iteration", total=len(dataloader), disable=disable_logging) - for step, batch in enumerate(epoch_iterator): - - # Save model checkpoint. - if args.save_steps > 0 and global_step % args.save_steps == 0: - output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) - logger.info("Saving model checkpoint to %s", output_dir) - - if xm.is_master_ordinal(): - if not os.path.exists(output_dir): - os.makedirs(output_dir) - torch.save(args, os.path.join(output_dir, "training_args.bin")) - - # Barrier to wait for saving checkpoint. - xm.rendezvous("mid_training_checkpoint") - # model.save_pretrained needs to be called by all ordinals - model.save_pretrained(output_dir) - - model.train() - inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} - if args.model_type != "distilbert": - # XLM, DistilBERT and RoBERTa don't use segment_ids - inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None - outputs = model(**inputs) - loss = outputs[0] # model outputs are always tuple in transformers (see doc) - - if args.gradient_accumulation_steps > 1: - loss = loss / args.gradient_accumulation_steps - - loss.backward() - - if (step + 1) % args.gradient_accumulation_steps == 0: - torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) - - xm.optimizer_step(optimizer) - scheduler.step() # Update learning rate schedule - model.zero_grad() - global_step += 1 - - if args.logging_steps > 0 and global_step % args.logging_steps == 0: - # Log metrics. - results = {} - if args.evaluate_during_training: - results = evaluate(args, model, tokenizer, disable_logging=disable_logging) - loss_scalar = loss.item() - logger.info( - "global_step: {global_step}, lr: {lr:.6f}, loss: {loss:.3f}".format( - global_step=global_step, lr=scheduler.get_lr()[0], loss=loss_scalar - ) - ) - if xm.is_master_ordinal(): - # tpu-comment: All values must be in CPU and not on TPU device - for key, value in results.items(): - tb_writer.add_scalar("eval_{}".format(key), value, global_step) - tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step) - tb_writer.add_scalar("loss", loss_scalar, global_step) - - if args.max_steps > 0 and global_step > args.max_steps: - epoch_iterator.close() - break - if args.metrics_debug: - # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.) - xm.master_print(met.metrics_report()) - if args.max_steps > 0 and global_step > args.max_steps: - train_iterator.close() - break - - if xm.is_master_ordinal(): - tb_writer.close() - return global_step, loss.item() - - -def evaluate(args, model, tokenizer, prefix="", disable_logging=False): - """Evaluate the model""" - if xm.is_master_ordinal(): - # Only master writes to Tensorboard - tb_writer = SummaryWriter(args.tensorboard_logdir) - - # Loop to handle MNLI double evaluation (matched, mis-matched) - eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,) - eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,) - - results = {} - for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs): - eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True) - eval_sampler = get_sampler(eval_dataset) - - if not os.path.exists(eval_output_dir): - os.makedirs(eval_output_dir) - - dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, shuffle=False) - eval_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device) - - # Eval! - logger.info("***** Running evaluation {} *****".format(prefix)) - logger.info(" Num examples = %d", len(dataloader) * args.eval_batch_size) - logger.info(" Batch size = %d", args.eval_batch_size) - eval_loss = 0.0 - nb_eval_steps = 0 - preds = None - out_label_ids = None - for batch in tqdm(eval_dataloader, desc="Evaluating", disable=disable_logging): - model.eval() - - with torch.no_grad(): - inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} - if args.model_type != "distilbert": - # XLM, DistilBERT and RoBERTa don't use segment_ids - inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None - outputs = model(**inputs) - batch_eval_loss, logits = outputs[:2] - - eval_loss += batch_eval_loss - nb_eval_steps += 1 - if preds is None: - preds = logits.detach().cpu().numpy() - out_label_ids = inputs["labels"].detach().cpu().numpy() - else: - preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) - out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0) - - # tpu-comment: Get all predictions and labels from all worker shards of eval dataset - preds = xm.mesh_reduce("eval_preds", preds, np.concatenate) - out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate) - - eval_loss = eval_loss / nb_eval_steps - if args.output_mode == "classification": - preds = np.argmax(preds, axis=1) - elif args.output_mode == "regression": - preds = np.squeeze(preds) - result = compute_metrics(eval_task, preds, out_label_ids) - results.update(result) - results["eval_loss"] = eval_loss.item() - - output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt") - if xm.is_master_ordinal(): - with open(output_eval_file, "w") as writer: - logger.info("***** Eval results {} *****".format(prefix)) - for key in sorted(results.keys()): - logger.info(" %s = %s", key, str(results[key])) - writer.write("%s = %s\n" % (key, str(results[key]))) - tb_writer.add_scalar(f"{eval_task}/{key}", results[key]) - - if args.metrics_debug: - # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.) - xm.master_print(met.metrics_report()) - - if xm.is_master_ordinal(): - tb_writer.close() - - return results - - -def load_and_cache_examples(args, task, tokenizer, evaluate=False): - if not xm.is_master_ordinal(): - xm.rendezvous("load_and_cache_examples") - - processor = processors[task]() - output_mode = output_modes[task] - cached_features_file = os.path.join( - args.cache_dir, - "cached_{}_{}_{}_{}".format( - "dev" if evaluate else "train", - list(filter(None, args.model_name_or_path.split("/"))).pop(), - str(args.max_seq_length), - str(task), - ), - ) - - # Load data features from cache or dataset file - if os.path.exists(cached_features_file) and not args.overwrite_cache: - logger.info("Loading features from cached file %s", cached_features_file) - features = torch.load(cached_features_file) - else: - logger.info("Creating features from dataset file at %s", args.data_dir) - label_list = processor.get_labels() - if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta"]: - # HACK(label indices are swapped in RoBERTa pretrained model) - label_list[1], label_list[2] = label_list[2], label_list[1] - examples = ( - processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir) - ) - features = convert_examples_to_features( - examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode, - ) - logger.info("Saving features into cached file %s", cached_features_file) - torch.save(features, cached_features_file) - - if xm.is_master_ordinal(): - xm.rendezvous("load_and_cache_examples") - - # Convert to Tensors and build dataset - all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) - all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long) - all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long) - if output_mode == "classification": - all_labels = torch.tensor([f.label for f in features], dtype=torch.long) - elif output_mode == "regression": - all_labels = torch.tensor([f.label for f in features], dtype=torch.float) - - dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels) - return dataset - - -def main(args): - if ( - os.path.exists(args.output_dir) - and os.listdir(args.output_dir) - and args.do_train - and not args.overwrite_output_dir - ): - raise ValueError( - ( - "Output directory ({}) already exists and is not empty." " Use --overwrite_output_dir to overcome." - ).format(args.output_dir) - ) - - # tpu-comment: Get TPU/XLA Device - args.device = xm.xla_device() - - # Setup logging - logging.basicConfig( - format="[xla:{}] %(asctime)s - %(levelname)s - %(name)s - %(message)s".format(xm.get_ordinal()), - datefmt="%m/%d/%Y %H:%M:%S", - level=logging.INFO, - ) - disable_logging = False - if not xm.is_master_ordinal() and args.only_log_master: - # Disable all non-master loggers below CRITICAL. - logging.disable(logging.CRITICAL) - disable_logging = True - logger.warning("Process rank: %s, device: %s, num_cores: %s", xm.get_ordinal(), args.device, args.num_cores) - - # Set seed to have same initialization - set_seed(args.seed) - - # Prepare GLUE task - args.task_name = args.task_name.lower() - if args.task_name not in processors: - raise ValueError("Task not found: %s" % (args.task_name)) - processor = processors[args.task_name]() - args.output_mode = output_modes[args.task_name] - label_list = processor.get_labels() - num_labels = len(label_list) - - if not xm.is_master_ordinal(): - xm.rendezvous( - "download_only_once" - ) # Make sure only the first process in distributed training will download model & vocab - - # Load pretrained model and tokenizer - args.model_type = args.model_type.lower() - config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type] - config = config_class.from_pretrained( - args.config_name if args.config_name else args.model_name_or_path, - num_labels=num_labels, - finetuning_task=args.task_name, - cache_dir=args.cache_dir if args.cache_dir else None, - xla_device=True, - ) - tokenizer = tokenizer_class.from_pretrained( - args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, - do_lower_case=args.do_lower_case, - cache_dir=args.cache_dir if args.cache_dir else None, - ) - model = model_class.from_pretrained( - args.model_name_or_path, - from_tf=bool(".ckpt" in args.model_name_or_path), - config=config, - cache_dir=args.cache_dir if args.cache_dir else None, - ) - - if xm.is_master_ordinal(): - xm.rendezvous("download_only_once") - - # Send model to TPU/XLA device. - model.to(args.device) - - logger.info("Training/evaluation parameters %s", args) - - if args.do_train: - # Train the model. - train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False) - global_step, tr_loss = train(args, train_dataset, model, tokenizer, disable_logging=disable_logging) - logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) - - if xm.is_master_ordinal(): - # Save trained model. - # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() - - # Create output directory if needed - if not os.path.exists(args.output_dir): - os.makedirs(args.output_dir) - - logger.info("Saving model checkpoint to %s", args.output_dir) - # Save a trained model, configuration and tokenizer using `save_pretrained()`. - # They can then be reloaded using `from_pretrained()` - tokenizer.save_pretrained(args.output_dir) - # Good practice: save your training arguments together with the trained. - torch.save(args, os.path.join(args.output_dir, "training_args.bin")) - - xm.rendezvous("post_training_checkpoint") - # model.save_pretrained needs to be called by all ordinals - model.save_pretrained(args.output_dir) - - # Load a trained model and vocabulary that you have fine-tuned - model = model_class.from_pretrained(args.output_dir) - tokenizer = tokenizer_class.from_pretrained(args.output_dir) - model.to(args.device) - - # Evaluation - results = {} - if args.do_eval: - tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) - checkpoints = [args.output_dir] - if args.eval_all_checkpoints: - checkpoints = list( - os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) - logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging - logger.info("Evaluate the following checkpoints: %s", checkpoints) - for checkpoint in checkpoints: - global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else "" - prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else "" - - model = model_class.from_pretrained(checkpoint) - model.to(args.device) - result = evaluate(args, model, tokenizer, prefix=prefix, disable_logging=disable_logging) - result = dict((k + "_{}".format(global_step), v) for k, v in result.items()) - results.update(result) - - return results - - -def get_args(): - parser = argparse.ArgumentParser() - - # Required parameters - parser.add_argument( - "--data_dir", - default=None, - type=str, - required=True, - help="The input data dir. Should contain the .tsv files (or other data files) for the task.", - ) - parser.add_argument( - "--model_type", - default=None, - type=str, - required=True, - help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), - ) - parser.add_argument( - "--model_name_or_path", - default=None, - type=str, - required=True, - help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS), - ) - parser.add_argument( - "--task_name", - default=None, - type=str, - required=True, - help="The name of the task to train selected in the list: " + ", ".join(processors.keys()), - ) - parser.add_argument( - "--output_dir", - default=None, - type=str, - required=True, - help="The output directory where the model predictions and checkpoints will be written.", - ) - - # TPU Parameters - parser.add_argument("--num_cores", default=8, type=int, help="Number of TPU cores to use (1 or 8).") - parser.add_argument("--metrics_debug", action="store_true", help="Whether to print debug metrics.") - - # Other parameters - parser.add_argument( - "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name" - ) - parser.add_argument( - "--tokenizer_name", - default="", - type=str, - help="Pretrained tokenizer name or path if not the same as model_name", - ) - parser.add_argument( - "--cache_dir", - default="", - type=str, - help="Where do you want to store the pre-trained models downloaded and features file generated", - ) - parser.add_argument( - "--max_seq_length", - default=128, - type=int, - help="The maximum total input sequence length after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded.", - ) - parser.add_argument("--do_train", action="store_true", help="Whether to run training.") - parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.") - parser.add_argument( - "--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step." - ) - parser.add_argument( - "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model." - ) - - parser.add_argument("--train_batch_size", default=8, type=int, help="Per core batch size for training.") - parser.add_argument("--eval_batch_size", default=8, type=int, help="Per core batch size for evaluation.") - parser.add_argument( - "--gradient_accumulation_steps", - type=int, - default=1, - help="Number of updates steps to accumulate before performing a backward/update pass.", - ) - parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.") - parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") - parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") - parser.add_argument( - "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform." - ) - parser.add_argument( - "--max_steps", - default=-1, - type=int, - help="If > 0: set total number of training steps to perform. Override num_train_epochs.", - ) - parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") - parser.add_argument("--tensorboard_logdir", default="./runs", type=str, help="Where to write tensorboard metrics.") - parser.add_argument("--logging_steps", type=int, default=50, help="Log every X update steps.") - parser.add_argument("--only_log_master", action="store_true", help="Whether to log only from each hosts master.") - parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X update steps.") - parser.add_argument( - "--eval_all_checkpoints", - action="store_true", - help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number", - ) - parser.add_argument( - "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory" - ) - parser.add_argument( - "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets" - ) - parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") - - return parser.parse_args() - - -def _mp_fn(rank, args): - main(args) - - -def main_cli(): - args = get_args() - xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores) - - -if __name__ == "__main__": - main_cli() diff --git a/examples/summarization/bart/finetune.py b/examples/summarization/bart/finetune.py index b916ea75440..078491f9181 100644 --- a/examples/summarization/bart/finetune.py +++ b/examples/summarization/bart/finetune.py @@ -7,7 +7,7 @@ import time import torch from torch.utils.data import DataLoader -from transformer_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup +from lightning_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup try: diff --git a/examples/summarization/bart/run_train.sh b/examples/summarization/bart/run_train.sh index 8ac009ef278..608047fca32 100755 --- a/examples/summarization/bart/run_train.sh +++ b/examples/summarization/bart/run_train.sh @@ -5,7 +5,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} # Make output directory if it doesn't exist mkdir -p $OUTPUT_DIR -# Add parent directory to python path to access transformer_base.py +# Add parent directory to python path to access lightning_base.py export PYTHONPATH="../../":"${PYTHONPATH}" python finetune.py \ diff --git a/examples/summarization/bart/run_train_tiny.sh b/examples/summarization/bart/run_train_tiny.sh index 2aa9418f5b1..b04bf40264d 100755 --- a/examples/summarization/bart/run_train_tiny.sh +++ b/examples/summarization/bart/run_train_tiny.sh @@ -12,7 +12,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} # Make output directory if it doesn't exist mkdir -p $OUTPUT_DIR -# Add parent directory to python path to access transformer_base.py and utils.py +# Add parent directory to python path to access lightning_base.py and utils.py export PYTHONPATH="../../":"${PYTHONPATH}" python finetune.py \ --data_dir=cnn_tiny/ \ diff --git a/examples/test_examples.py b/examples/test_examples.py index 5babf45eb28..84dad0a546f 100644 --- a/examples/test_examples.py +++ b/examples/test_examples.py @@ -16,14 +16,24 @@ import argparse import logging +import os import sys import unittest from unittest.mock import patch -import run_generation -import run_glue -import run_language_modeling -import run_squad + +SRC_DIRS = [ + os.path.join(os.path.dirname(__file__), dirname) + for dirname in ["text-generation", "text-classification", "language-modeling", "question-answering"] +] +sys.path.extend(SRC_DIRS) + + +if SRC_DIRS is not None: + import run_generation + import run_glue + import run_language_modeling + import run_squad logging.basicConfig(level=logging.DEBUG) @@ -43,24 +53,24 @@ class ExamplesTests(unittest.TestCase): stream_handler = logging.StreamHandler(sys.stdout) logger.addHandler(stream_handler) - testargs = [ - "run_glue.py", - "--data_dir=./examples/tests_samples/MRPC/", - "--task_name=mrpc", - "--do_train", - "--do_eval", - "--output_dir=./examples/tests_samples/temp_dir", - "--per_gpu_train_batch_size=2", - "--per_gpu_eval_batch_size=1", - "--learning_rate=1e-4", - "--max_steps=10", - "--warmup_steps=2", - "--overwrite_output_dir", - "--seed=42", - "--max_seq_length=128", - ] - model_name = "--model_name_or_path=bert-base-uncased" - with patch.object(sys, "argv", testargs + [model_name]): + testargs = """ + run_glue.py + --model_name_or_path bert-base-uncased + --data_dir ./tests/fixtures/tests_samples/MRPC/ + --task_name mrpc + --do_train + --do_eval + --output_dir ./tests/fixtures/tests_samples/temp_dir + --per_gpu_train_batch_size=2 + --per_gpu_eval_batch_size=1 + --learning_rate=1e-4 + --max_steps=10 + --warmup_steps=2 + --overwrite_output_dir + --seed=42 + --max_seq_length=128 + """.split() + with patch.object(sys, "argv", testargs): result = run_glue.main() del result["loss"] for value in result.values(): @@ -78,7 +88,7 @@ class ExamplesTests(unittest.TestCase): --line_by_line --train_data_file ./tests/fixtures/sample_text.txt --eval_data_file ./tests/fixtures/sample_text.txt - --output_dir ./tests/fixtures + --output_dir ./tests/fixtures/tests_samples/temp_dir --overwrite_output_dir --do_train --do_eval @@ -93,24 +103,25 @@ class ExamplesTests(unittest.TestCase): stream_handler = logging.StreamHandler(sys.stdout) logger.addHandler(stream_handler) - testargs = [ - "run_squad.py", - "--data_dir=./examples/tests_samples/SQUAD", - "--model_name=bert-base-uncased", - "--output_dir=./examples/tests_samples/temp_dir", - "--max_steps=10", - "--warmup_steps=2", - "--do_train", - "--do_eval", - "--version_2_with_negative", - "--learning_rate=2e-4", - "--per_gpu_train_batch_size=2", - "--per_gpu_eval_batch_size=1", - "--overwrite_output_dir", - "--seed=42", - ] - model_type, model_name = ("--model_type=bert", "--model_name_or_path=bert-base-uncased") - with patch.object(sys, "argv", testargs + [model_type, model_name]): + testargs = """ + run_squad.py + --model_type=bert + --model_name_or_path=bert-base-uncased + --data_dir=./tests/fixtures/tests_samples/SQUAD + --model_name=bert-base-uncased + --output_dir=./tests/fixtures/tests_samples/temp_dir + --max_steps=10 + --warmup_steps=2 + --do_train + --do_eval + --version_2_with_negative + --learning_rate=2e-4 + --per_gpu_train_batch_size=2 + --per_gpu_eval_batch_size=1 + --overwrite_output_dir + --seed=42 + """.split() + with patch.object(sys, "argv", testargs): result = run_squad.main() self.assertGreaterEqual(result["f1"], 30) self.assertGreaterEqual(result["exact"], 30) diff --git a/examples/text-classification/README.md b/examples/text-classification/README.md new file mode 100644 index 00000000000..d1f6704c869 --- /dev/null +++ b/examples/text-classification/README.md @@ -0,0 +1,299 @@ +## GLUE Benchmark + +# Run TensorFlow 2.0 version + +Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py). + +Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/). + +This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime. +Options are toggled using `USE_XLA` or `USE_AMP` variables in the script. +These options and the below benchmark are provided by @tlkh. + +Quick benchmarks from the script (no other modifications): + +| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) | +| --------- | -------- | ----------------------- | ----------------------| +| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 | +| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 | +| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 | +| V100 | AMP | 22s | 0.8646/0.8385/0.8411 | +| 1080 Ti | FP32 | 55s | - | + +Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used). + + + +# Run PyTorch version + +Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py). + +Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding +Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa. + +GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an +uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train +batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results +between different runs. We report the median on 5 runs (with different seeds) for each of the metrics. + +| Task | Metric | Result | +|-------|------------------------------|-------------| +| CoLA | Matthew's corr | 49.23 | +| SST-2 | Accuracy | 91.97 | +| MRPC | F1/Accuracy | 89.47/85.29 | +| STS-B | Person/Spearman corr. | 83.95/83.70 | +| QQP | Accuracy/F1 | 88.40/84.31 | +| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 | +| QNLI | Accuracy | 87.46 | +| RTE | Accuracy | 61.73 | +| WNLI | Accuracy | 45.07 | + +Some of these results are significantly different from the ones reported on the test set +of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite. + +Before running any one of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```bash +export GLUE_DIR=/path/to/glue +export TASK_NAME=MRPC + +python run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/$TASK_NAME \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/$TASK_NAME/ +``` + +where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. + +The dev set results will be present within the text file `eval_results.txt` in the specified output_dir. +In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate +output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`. + +The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, +CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being +said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well, +since the data processor for each task inherits from the base class DataProcessor. + +## Running on TPUs + +You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this +[README](https://github.com/pytorch/xla/blob/master/README.md). + +The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are +identical to your normal GPU + Huggingface setup. + +For running your GLUE task on MNLI dataset you can run something like the following: + +``` +export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" +export GLUE_DIR=/path/to/glue +export TASK_NAME=MNLI + +python run_glue_tpu.py \ + --model_type bert \ + --model_name_or_path bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/$TASK_NAME \ + --max_seq_length 128 \ + --train_batch_size 32 \ + --learning_rate 3e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/$TASK_NAME \ + --overwrite_output_dir \ + --logging_steps 50 \ + --save_steps 200 \ + --num_cores=8 \ + --only_log_master +``` + +### MRPC + +#### Fine-tuning example + +The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less +than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed. + +Before running any one of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```bash +export GLUE_DIR=/path/to/glue + +python run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ +``` + +Our test ran on a few seeds with [the original implementation hyper- +parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation +results between 84% and 88%. + +#### Using Apex and mixed-precision + +Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install +[apex](https://github.com/NVIDIA/apex), then run the following example: + +```bash +export GLUE_DIR=/path/to/glue + +python run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ \ + --fp16 +``` + +#### Distributed training + +Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it +reaches F1 > 92 on MRPC. + +```bash +export GLUE_DIR=/path/to/glue + +python -m torch.distributed.launch \ + --nproc_per_node 8 run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name MRPC \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MRPC/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 8 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/mrpc_output/ +``` + +Training with these hyper-parameters gave us the following results: + +```bash +acc = 0.8823529411764706 +acc_and_f1 = 0.901702786377709 +eval_loss = 0.3418912578906332 +f1 = 0.9210526315789473 +global_step = 174 +loss = 0.07231863956341798 +``` + +### MNLI + +The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task. + +```bash +export GLUE_DIR=/path/to/glue + +python -m torch.distributed.launch \ + --nproc_per_node 8 run_glue.py \ + --model_name_or_path bert-base-cased \ + --task_name mnli \ + --do_train \ + --do_eval \ + --data_dir $GLUE_DIR/MNLI/ \ + --max_seq_length 128 \ + --per_gpu_train_batch_size 8 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir output_dir \ +``` + +The results are the following: + +```bash +***** Eval results ***** + acc = 0.8679706601466992 + eval_loss = 0.4911287787382479 + global_step = 18408 + loss = 0.04755385363816904 + +***** Eval results ***** + acc = 0.8747965825874695 + eval_loss = 0.45516540421714036 + global_step = 18408 + loss = 0.04755385363816904 +``` + +# Run PyTorch version using PyTorch-Lightning + +Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory. + +Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: + +``` +TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358} +``` + + +# XNLI + +Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py). + +[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili). + +#### Fine-tuning on XNLI + +This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins +on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a +`$XNLI_DIR` directory. + +* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip) +* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip) + +```bash +export XNLI_DIR=/path/to/XNLI + +python run_xnli.py \ + --model_type bert \ + --model_name_or_path bert-base-multilingual-cased \ + --language de \ + --train_language en \ + --do_train \ + --do_eval \ + --data_dir $XNLI_DIR \ + --per_gpu_train_batch_size 32 \ + --learning_rate 5e-5 \ + --num_train_epochs 2.0 \ + --max_seq_length 128 \ + --output_dir /tmp/debug_xnli/ \ + --save_steps -1 +``` + +Training with the previously defined hyper-parameters yields the following results on the **test** set: + +```bash +acc = 0.7093812375249501 +``` + + + + diff --git a/examples/run_glue.py b/examples/text-classification/run_glue.py similarity index 100% rename from examples/run_glue.py rename to examples/text-classification/run_glue.py diff --git a/examples/glue/run_pl.sh b/examples/text-classification/run_pl.sh similarity index 93% rename from examples/glue/run_pl.sh rename to examples/text-classification/run_pl.sh index 3801fcaec5e..26a95404149 100755 --- a/examples/glue/run_pl.sh +++ b/examples/text-classification/run_pl.sh @@ -20,7 +20,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} # Make output directory if it doesn't exist mkdir -p $OUTPUT_DIR -# Add parent directory to python path to access transformer_base.py +# Add parent directory to python path to access lightning_base.py export PYTHONPATH="../":"${PYTHONPATH}" python3 run_pl_glue.py --data_dir $DATA_DIR \ diff --git a/examples/glue/run_pl_glue.py b/examples/text-classification/run_pl_glue.py similarity index 99% rename from examples/glue/run_pl_glue.py rename to examples/text-classification/run_pl_glue.py index dc0dcb12122..88e5912cad8 100644 --- a/examples/glue/run_pl_glue.py +++ b/examples/text-classification/run_pl_glue.py @@ -8,7 +8,7 @@ import numpy as np import torch from torch.utils.data import DataLoader, TensorDataset -from transformer_base import BaseTransformer, add_generic_args, generic_train +from lightning_base import BaseTransformer, add_generic_args, generic_train from transformers import glue_compute_metrics as compute_metrics from transformers import glue_convert_examples_to_features as convert_examples_to_features from transformers import glue_output_modes diff --git a/examples/run_tf_glue.py b/examples/text-classification/run_tf_glue.py similarity index 100% rename from examples/run_tf_glue.py rename to examples/text-classification/run_tf_glue.py diff --git a/examples/run_xnli.py b/examples/text-classification/run_xnli.py similarity index 99% rename from examples/run_xnli.py rename to examples/text-classification/run_xnli.py index 7c2790f53d9..d902d22cd20 100644 --- a/examples/run_xnli.py +++ b/examples/text-classification/run_xnli.py @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. """ Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM). - Adapted from `examples/run_glue.py`""" + Adapted from `examples/text-classification/run_glue.py`""" import argparse diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md new file mode 100644 index 00000000000..4549e538b9d --- /dev/null +++ b/examples/text-generation/README.md @@ -0,0 +1,15 @@ +## Language generation + +Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py). + +Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL. +A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you +can try out the different models available in the library. + +Example usage: + +```bash +python run_generation.py \ + --model_type=gpt2 \ + --model_name_or_path=gpt2 +``` diff --git a/examples/pplm/README.md b/examples/text-generation/pplm/README.md similarity index 100% rename from examples/pplm/README.md rename to examples/text-generation/pplm/README.md diff --git a/examples/pplm/imgs/headfigure.png b/examples/text-generation/pplm/imgs/headfigure.png similarity index 100% rename from examples/pplm/imgs/headfigure.png rename to examples/text-generation/pplm/imgs/headfigure.png diff --git a/examples/pplm/imgs/wooly.png b/examples/text-generation/pplm/imgs/wooly.png similarity index 100% rename from examples/pplm/imgs/wooly.png rename to examples/text-generation/pplm/imgs/wooly.png diff --git a/examples/pplm/pplm_classification_head.py b/examples/text-generation/pplm/pplm_classification_head.py similarity index 100% rename from examples/pplm/pplm_classification_head.py rename to examples/text-generation/pplm/pplm_classification_head.py diff --git a/examples/pplm/run_pplm.py b/examples/text-generation/pplm/run_pplm.py similarity index 100% rename from examples/pplm/run_pplm.py rename to examples/text-generation/pplm/run_pplm.py diff --git a/examples/pplm/run_pplm_discrim_train.py b/examples/text-generation/pplm/run_pplm_discrim_train.py similarity index 100% rename from examples/pplm/run_pplm_discrim_train.py rename to examples/text-generation/pplm/run_pplm_discrim_train.py diff --git a/examples/run_generation.py b/examples/text-generation/run_generation.py similarity index 100% rename from examples/run_generation.py rename to examples/text-generation/run_generation.py diff --git a/examples/ner/README.md b/examples/token-classification/README.md similarity index 100% rename from examples/ner/README.md rename to examples/token-classification/README.md diff --git a/examples/ner/run.sh b/examples/token-classification/run.sh similarity index 100% rename from examples/ner/run.sh rename to examples/token-classification/run.sh diff --git a/examples/ner/run_ner.py b/examples/token-classification/run_ner.py similarity index 100% rename from examples/ner/run_ner.py rename to examples/token-classification/run_ner.py diff --git a/examples/ner/run_pl.sh b/examples/token-classification/run_pl.sh similarity index 95% rename from examples/ner/run_pl.sh rename to examples/token-classification/run_pl.sh index 5a863be22de..9776ff87187 100755 --- a/examples/ner/run_pl.sh +++ b/examples/token-classification/run_pl.sh @@ -27,7 +27,7 @@ export CURRENT_DIR=${PWD} export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME} mkdir -p $OUTPUT_DIR -# Add parent directory to python path to access transformer_base.py +# Add parent directory to python path to access lightning_base.py export PYTHONPATH="../":"${PYTHONPATH}" python3 run_pl_ner.py --data_dir ./ \ diff --git a/examples/ner/run_pl_ner.py b/examples/token-classification/run_pl_ner.py similarity index 99% rename from examples/ner/run_pl_ner.py rename to examples/token-classification/run_pl_ner.py index c4a4d6eadab..f015dad947a 100644 --- a/examples/ner/run_pl_ner.py +++ b/examples/token-classification/run_pl_ner.py @@ -9,7 +9,7 @@ from seqeval.metrics import f1_score, precision_score, recall_score from torch.nn import CrossEntropyLoss from torch.utils.data import DataLoader, TensorDataset -from transformer_base import BaseTransformer, add_generic_args, generic_train +from lightning_base import BaseTransformer, add_generic_args, generic_train from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file diff --git a/examples/ner/run_tf_ner.py b/examples/token-classification/run_tf_ner.py similarity index 100% rename from examples/ner/run_tf_ner.py rename to examples/token-classification/run_tf_ner.py diff --git a/examples/ner/test_ner_examples.py b/examples/token-classification/test_ner_examples.py similarity index 79% rename from examples/ner/test_ner_examples.py rename to examples/token-classification/test_ner_examples.py index 336457698c2..7d36f0403a1 100644 --- a/examples/ner/test_ner_examples.py +++ b/examples/token-classification/test_ner_examples.py @@ -18,10 +18,10 @@ class ExamplesTests(unittest.TestCase): testargs = """ --model_name distilbert-base-german-cased - --output_dir ./examples/tests_samples/temp_dir + --output_dir ./tests/fixtures/tests_samples/temp_dir --overwrite_output_dir - --data_dir ./examples/tests_samples/GermEval - --labels ./examples/tests_samples/GermEval/labels.txt + --data_dir ./tests/fixtures/tests_samples/GermEval + --labels ./tests/fixtures/tests_samples/GermEval/labels.txt --max_seq_length 128 --num_train_epochs 6 --logging_steps 1 diff --git a/examples/ner/utils_ner.py b/examples/token-classification/utils_ner.py similarity index 100% rename from examples/ner/utils_ner.py rename to examples/token-classification/utils_ner.py diff --git a/model_cards/SparkBeyond/roberta-large-sts-b/README.md b/model_cards/SparkBeyond/roberta-large-sts-b/README.md index 13bef74e0ab..6fa2fd2a631 100644 --- a/model_cards/SparkBeyond/roberta-large-sts-b/README.md +++ b/model_cards/SparkBeyond/roberta-large-sts-b/README.md @@ -4,7 +4,7 @@ This model is a fine tuned RoBERTA model over STS-B. It was trained with these params: -!python /content/transformers/examples/run_glue.py \ +!python /content/transformers/examples/text-classification/run_glue.py \ --model_type roberta \ --model_name_or_path roberta-large \ --task_name STS-B \ diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py index 3ee9bdd843e..6d3e4f97b53 100644 --- a/src/transformers/trainer.py +++ b/src/transformers/trainer.py @@ -333,7 +333,7 @@ class Trainer: total_train_batch_size = ( self.args.train_batch_size * self.args.gradient_accumulation_steps - * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1), + * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1) ) logger.info("***** Running training *****") logger.info(" Num examples = %d", num_examples) diff --git a/examples/tests_samples/.gitignore b/tests/fixtures/tests_samples/.gitignore similarity index 100% rename from examples/tests_samples/.gitignore rename to tests/fixtures/tests_samples/.gitignore diff --git a/examples/tests_samples/GermEval/dev.txt b/tests/fixtures/tests_samples/GermEval/dev.txt similarity index 100% rename from examples/tests_samples/GermEval/dev.txt rename to tests/fixtures/tests_samples/GermEval/dev.txt diff --git a/examples/tests_samples/GermEval/labels.txt b/tests/fixtures/tests_samples/GermEval/labels.txt similarity index 100% rename from examples/tests_samples/GermEval/labels.txt rename to tests/fixtures/tests_samples/GermEval/labels.txt diff --git a/examples/tests_samples/GermEval/train.txt b/tests/fixtures/tests_samples/GermEval/train.txt similarity index 100% rename from examples/tests_samples/GermEval/train.txt rename to tests/fixtures/tests_samples/GermEval/train.txt diff --git a/examples/tests_samples/MRPC/dev.tsv b/tests/fixtures/tests_samples/MRPC/dev.tsv similarity index 100% rename from examples/tests_samples/MRPC/dev.tsv rename to tests/fixtures/tests_samples/MRPC/dev.tsv diff --git a/examples/tests_samples/MRPC/train.tsv b/tests/fixtures/tests_samples/MRPC/train.tsv similarity index 100% rename from examples/tests_samples/MRPC/train.tsv rename to tests/fixtures/tests_samples/MRPC/train.tsv diff --git a/examples/tests_samples/SQUAD/dev-v2.0.json b/tests/fixtures/tests_samples/SQUAD/dev-v2.0.json similarity index 100% rename from examples/tests_samples/SQUAD/dev-v2.0.json rename to tests/fixtures/tests_samples/SQUAD/dev-v2.0.json diff --git a/examples/tests_samples/SQUAD/train-v2.0.json b/tests/fixtures/tests_samples/SQUAD/train-v2.0.json similarity index 100% rename from examples/tests_samples/SQUAD/train-v2.0.json rename to tests/fixtures/tests_samples/SQUAD/train-v2.0.json diff --git a/examples/tests_samples/STS-B/dev.tsv b/tests/fixtures/tests_samples/STS-B/dev.tsv similarity index 100% rename from examples/tests_samples/STS-B/dev.tsv rename to tests/fixtures/tests_samples/STS-B/dev.tsv diff --git a/examples/tests_samples/STS-B/train.tsv b/tests/fixtures/tests_samples/STS-B/train.tsv similarity index 100% rename from examples/tests_samples/STS-B/train.tsv rename to tests/fixtures/tests_samples/STS-B/train.tsv diff --git a/tests/test_trainer.py b/tests/test_trainer.py index 66195c20eb8..f397c71b96f 100644 --- a/tests/test_trainer.py +++ b/tests/test_trainer.py @@ -28,7 +28,7 @@ class DataCollatorIntegrationTest(unittest.TestCase): MODEL_ID = "bert-base-cased-finetuned-mrpc" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) data_args = GlueDataTrainingArguments( - task_name="mrpc", data_dir="./examples/tests_samples/MRPC", overwrite_cache=True + task_name="mrpc", data_dir="./tests/fixtures/tests_samples/MRPC", overwrite_cache=True ) dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) data_collator = DefaultDataCollator() @@ -39,7 +39,7 @@ class DataCollatorIntegrationTest(unittest.TestCase): MODEL_ID = "distilroberta-base" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) data_args = GlueDataTrainingArguments( - task_name="sts-b", data_dir="./examples/tests_samples/STS-B", overwrite_cache=True + task_name="sts-b", data_dir="./tests/fixtures/tests_samples/STS-B", overwrite_cache=True ) dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) data_collator = DefaultDataCollator() @@ -91,7 +91,7 @@ class TrainerIntegrationTest(unittest.TestCase): tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) data_args = GlueDataTrainingArguments( - task_name="mrpc", data_dir="./examples/tests_samples/MRPC", overwrite_cache=True + task_name="mrpc", data_dir="./tests/fixtures/tests_samples/MRPC", overwrite_cache=True ) eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) diff --git a/valohai.yaml b/valohai.yaml index 0e07c4e79cb..753549ecded 100644 --- a/valohai.yaml +++ b/valohai.yaml @@ -1,13 +1,13 @@ --- - step: - name: Execute python examples/run_glue.py + name: Execute python examples/text-classification/run_glue.py image: pytorch/pytorch:nightly-devel-cuda10.0-cudnn7 command: - python /valohai/repository/utils/download_glue_data.py --data_dir=/glue_data - pip install -e . - pip install -r examples/requirements.txt - - python examples/run_glue.py --do_train --data_dir=/glue_data/{parameter-value:task_name} {parameters} + - python examples/text-classification/run_glue.py --do_train --data_dir=/glue_data/{parameter-value:task_name} {parameters} parameters: - name: model_type pass-as: --model_type={v}