mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 05:10:06 +06:00
BIG Reorganize examples (#4213)
* Created using Colaboratory * [examples] reorganize files * remove run_tpu_glue.py as superseded by TPU support in Trainer * Bugfix: int, not tuple * move files around
This commit is contained in:
parent
cafa6a9e29
commit
0ae96ff8a7
3
.gitignore
vendored
3
.gitignore
vendored
@ -132,7 +132,8 @@ proc_data
|
|||||||
runs
|
runs
|
||||||
/runs_old
|
/runs_old
|
||||||
/wandb
|
/wandb
|
||||||
examples/runs
|
/examples/runs
|
||||||
|
/examples/**/*.args
|
||||||
|
|
||||||
# data
|
# data
|
||||||
/data
|
/data
|
||||||
|
@ -331,7 +331,7 @@ pip install -r ./examples/requirements.txt
|
|||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
export TASK_NAME=MRPC
|
export TASK_NAME=MRPC
|
||||||
|
|
||||||
python ./examples/run_glue.py \
|
python ./examples/text-classification/run_glue.py \
|
||||||
--model_name_or_path bert-base-uncased \
|
--model_name_or_path bert-base-uncased \
|
||||||
--task_name $TASK_NAME \
|
--task_name $TASK_NAME \
|
||||||
--do_train \
|
--do_train \
|
||||||
@ -357,7 +357,7 @@ Parallel training is a simple way to use several GPUs (but is slower and less fl
|
|||||||
```shell
|
```shell
|
||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
python ./examples/run_glue.py \
|
python ./examples/text-classification/run_glue.py \
|
||||||
--model_name_or_path xlnet-large-cased \
|
--model_name_or_path xlnet-large-cased \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
@ -382,7 +382,7 @@ On this machine we thus have a batch size of 32, please increase `gradient_accum
|
|||||||
This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.
|
This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \
|
python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classification/run_glue.py \
|
||||||
--model_name_or_path bert-large-uncased-whole-word-masking \
|
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||||
--task_name MRPC \
|
--task_name MRPC \
|
||||||
--do_train \
|
--do_train \
|
||||||
|
@ -1 +0,0 @@
|
|||||||
../../examples/README.md
|
|
649
docs/source/examples.md
Normal file
649
docs/source/examples.md
Normal file
@ -0,0 +1,649 @@
|
|||||||
|
# Examples
|
||||||
|
|
||||||
|
In this section a few examples are put together. All of these examples work for several models, making use of the very
|
||||||
|
similar API between the different models.
|
||||||
|
|
||||||
|
**Important**
|
||||||
|
To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
|
||||||
|
Execute the following steps in a new virtual environment:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/huggingface/transformers
|
||||||
|
cd transformers
|
||||||
|
pip install .
|
||||||
|
pip install -r ./examples/requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
| Section | Description |
|
||||||
|
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
|
||||||
|
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
|
||||||
|
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
|
||||||
|
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
||||||
|
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
|
||||||
|
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
|
||||||
|
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
|
||||||
|
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
|
||||||
|
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|
||||||
|
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
|
||||||
|
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
|
||||||
|
|
||||||
|
## TensorFlow 2.0 Bert models on GLUE
|
||||||
|
|
||||||
|
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
|
||||||
|
|
||||||
|
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
|
||||||
|
|
||||||
|
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
|
||||||
|
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
|
||||||
|
These options and the below benchmark are provided by @tlkh.
|
||||||
|
|
||||||
|
Quick benchmarks from the script (no other modifications):
|
||||||
|
|
||||||
|
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
|
||||||
|
| --------- | -------- | ----------------------- | ----------------------|
|
||||||
|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
|
||||||
|
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
|
||||||
|
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
|
||||||
|
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
|
||||||
|
| 1080 Ti | FP32 | 55s | - |
|
||||||
|
|
||||||
|
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
|
||||||
|
|
||||||
|
## Running on TPUs
|
||||||
|
|
||||||
|
You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
|
||||||
|
[README](https://github.com/pytorch/xla/blob/master/README.md).
|
||||||
|
|
||||||
|
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
|
||||||
|
identical to your normal GPU + Huggingface setup.
|
||||||
|
|
||||||
|
### GLUE
|
||||||
|
|
||||||
|
Before running anyone of these GLUE tasks you should download the
|
||||||
|
[GLUE data](https://gluebenchmark.com/tasks) by running
|
||||||
|
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
||||||
|
and unpack it to some directory `$GLUE_DIR`.
|
||||||
|
|
||||||
|
For running your GLUE task on MNLI dataset you can run something like the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
export TASK_NAME=MNLI
|
||||||
|
|
||||||
|
python run_glue_tpu.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name $TASK_NAME \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/$TASK_NAME \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--train_batch_size 32 \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/$TASK_NAME \
|
||||||
|
--overwrite_output_dir \
|
||||||
|
--logging_steps 50 \
|
||||||
|
--save_steps 200 \
|
||||||
|
--num_cores=8 \
|
||||||
|
--only_log_master
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Language model training
|
||||||
|
|
||||||
|
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
|
||||||
|
|
||||||
|
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
|
||||||
|
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
|
||||||
|
are fine-tuned using a masked language modeling (MLM) loss.
|
||||||
|
|
||||||
|
Before running the following example, you should get a file that contains text on which the language model will be
|
||||||
|
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
|
||||||
|
|
||||||
|
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
|
||||||
|
text that will be used for evaluation.
|
||||||
|
|
||||||
|
### GPT-2/GPT and causal language modeling
|
||||||
|
|
||||||
|
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
|
||||||
|
the tokenization). The loss here is that of causal language modeling.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
|
||||||
|
export TEST_FILE=/path/to/dataset/wiki.test.raw
|
||||||
|
|
||||||
|
python run_language_modeling.py \
|
||||||
|
--output_dir=output \
|
||||||
|
--model_type=gpt2 \
|
||||||
|
--model_name_or_path=gpt2 \
|
||||||
|
--do_train \
|
||||||
|
--train_data_file=$TRAIN_FILE \
|
||||||
|
--do_eval \
|
||||||
|
--eval_data_file=$TEST_FILE
|
||||||
|
```
|
||||||
|
|
||||||
|
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
|
||||||
|
a score of ~20 perplexity once fine-tuned on the dataset.
|
||||||
|
|
||||||
|
### RoBERTa/BERT and masked language modeling
|
||||||
|
|
||||||
|
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
|
||||||
|
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
|
||||||
|
pre-training: masked language modeling.
|
||||||
|
|
||||||
|
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
|
||||||
|
slightly slower (over-fitting takes more epochs).
|
||||||
|
|
||||||
|
We use the `--mlm` flag so that the script may change its loss function.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
|
||||||
|
export TEST_FILE=/path/to/dataset/wiki.test.raw
|
||||||
|
|
||||||
|
python run_language_modeling.py \
|
||||||
|
--output_dir=output \
|
||||||
|
--model_type=roberta \
|
||||||
|
--model_name_or_path=roberta-base \
|
||||||
|
--do_train \
|
||||||
|
--train_data_file=$TRAIN_FILE \
|
||||||
|
--do_eval \
|
||||||
|
--eval_data_file=$TEST_FILE \
|
||||||
|
--mlm
|
||||||
|
```
|
||||||
|
|
||||||
|
## Language generation
|
||||||
|
|
||||||
|
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
|
||||||
|
|
||||||
|
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
|
||||||
|
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
||||||
|
can try out the different models available in the library.
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_generation.py \
|
||||||
|
--model_type=gpt2 \
|
||||||
|
--model_name_or_path=gpt2
|
||||||
|
```
|
||||||
|
|
||||||
|
## GLUE
|
||||||
|
|
||||||
|
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
|
||||||
|
|
||||||
|
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
||||||
|
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
||||||
|
|
||||||
|
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
|
||||||
|
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
|
||||||
|
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
|
||||||
|
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
|
||||||
|
|
||||||
|
| Task | Metric | Result |
|
||||||
|
|-------|------------------------------|-------------|
|
||||||
|
| CoLA | Matthew's corr | 49.23 |
|
||||||
|
| SST-2 | Accuracy | 91.97 |
|
||||||
|
| MRPC | F1/Accuracy | 89.47/85.29 |
|
||||||
|
| STS-B | Person/Spearman corr. | 83.95/83.70 |
|
||||||
|
| QQP | Accuracy/F1 | 88.40/84.31 |
|
||||||
|
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
|
||||||
|
| QNLI | Accuracy | 87.46 |
|
||||||
|
| RTE | Accuracy | 61.73 |
|
||||||
|
| WNLI | Accuracy | 45.07 |
|
||||||
|
|
||||||
|
Some of these results are significantly different from the ones reported on the test set
|
||||||
|
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
|
||||||
|
|
||||||
|
Before running any one of these GLUE tasks you should download the
|
||||||
|
[GLUE data](https://gluebenchmark.com/tasks) by running
|
||||||
|
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
||||||
|
and unpack it to some directory `$GLUE_DIR`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
export TASK_NAME=MRPC
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name $TASK_NAME \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/$TASK_NAME \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/$TASK_NAME/
|
||||||
|
```
|
||||||
|
|
||||||
|
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
|
||||||
|
|
||||||
|
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
|
||||||
|
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
|
||||||
|
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
|
||||||
|
|
||||||
|
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
|
||||||
|
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
|
||||||
|
said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
|
||||||
|
since the data processor for each task inherits from the base class DataProcessor.
|
||||||
|
|
||||||
|
### MRPC
|
||||||
|
|
||||||
|
#### Fine-tuning example
|
||||||
|
|
||||||
|
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
|
||||||
|
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
|
||||||
|
|
||||||
|
Before running any one of these GLUE tasks you should download the
|
||||||
|
[GLUE data](https://gluebenchmark.com/tasks) by running
|
||||||
|
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
||||||
|
and unpack it to some directory `$GLUE_DIR`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name MRPC \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mrpc_output/
|
||||||
|
```
|
||||||
|
|
||||||
|
Our test ran on a few seeds with [the original implementation hyper-
|
||||||
|
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
|
||||||
|
results between 84% and 88%.
|
||||||
|
|
||||||
|
#### Using Apex and mixed-precision
|
||||||
|
|
||||||
|
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
|
||||||
|
[apex](https://github.com/NVIDIA/apex), then run the following example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name MRPC \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mrpc_output/ \
|
||||||
|
--fp16
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Distributed training
|
||||||
|
|
||||||
|
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
|
||||||
|
reaches F1 > 92 on MRPC.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python -m torch.distributed.launch \
|
||||||
|
--nproc_per_node 8 run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name MRPC \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 8 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mrpc_output/
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with these hyper-parameters gave us the following results:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
acc = 0.8823529411764706
|
||||||
|
acc_and_f1 = 0.901702786377709
|
||||||
|
eval_loss = 0.3418912578906332
|
||||||
|
f1 = 0.9210526315789473
|
||||||
|
global_step = 174
|
||||||
|
loss = 0.07231863956341798
|
||||||
|
```
|
||||||
|
|
||||||
|
### MNLI
|
||||||
|
|
||||||
|
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python -m torch.distributed.launch \
|
||||||
|
--nproc_per_node 8 run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name mnli \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MNLI/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 8 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir output_dir \
|
||||||
|
```
|
||||||
|
|
||||||
|
The results are the following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
***** Eval results *****
|
||||||
|
acc = 0.8679706601466992
|
||||||
|
eval_loss = 0.4911287787382479
|
||||||
|
global_step = 18408
|
||||||
|
loss = 0.04755385363816904
|
||||||
|
|
||||||
|
***** Eval results *****
|
||||||
|
acc = 0.8747965825874695
|
||||||
|
eval_loss = 0.45516540421714036
|
||||||
|
global_step = 18408
|
||||||
|
loss = 0.04755385363816904
|
||||||
|
```
|
||||||
|
|
||||||
|
## Multiple Choice
|
||||||
|
|
||||||
|
Based on the script [`run_multiple_choice.py`]().
|
||||||
|
|
||||||
|
#### Fine-tuning on SWAG
|
||||||
|
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#training on 4 tesla V100(16GB) GPUS
|
||||||
|
export SWAG_DIR=/path/to/swag_data_dir
|
||||||
|
python ./examples/run_multiple_choice.py \
|
||||||
|
--task_name swag \
|
||||||
|
--model_name_or_path roberta-base \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $SWAG_DIR \
|
||||||
|
--learning_rate 5e-5 \
|
||||||
|
--num_train_epochs 3 \
|
||||||
|
--max_seq_length 80 \
|
||||||
|
--output_dir models_bert/swag_base \
|
||||||
|
--per_gpu_eval_batch_size=16 \
|
||||||
|
--per_gpu_train_batch_size=16 \
|
||||||
|
--gradient_accumulation_steps 2 \
|
||||||
|
--overwrite_output
|
||||||
|
```
|
||||||
|
Training with the defined hyper-parameters yields the following results:
|
||||||
|
```
|
||||||
|
***** Eval results *****
|
||||||
|
eval_acc = 0.8338998300509847
|
||||||
|
eval_loss = 0.44457291918821606
|
||||||
|
```
|
||||||
|
|
||||||
|
## SQuAD
|
||||||
|
|
||||||
|
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
|
||||||
|
|
||||||
|
#### Fine-tuning BERT on SQuAD1.0
|
||||||
|
|
||||||
|
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
|
||||||
|
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
|
||||||
|
$SQUAD_DIR directory.
|
||||||
|
|
||||||
|
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
|
||||||
|
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
|
||||||
|
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
|
||||||
|
|
||||||
|
And for SQuAD2.0, you need to download:
|
||||||
|
|
||||||
|
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
|
||||||
|
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
|
||||||
|
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
|
python run_squad.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-uncased \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_file $SQUAD_DIR/train-v1.1.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
||||||
|
--per_gpu_train_batch_size 12 \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 2.0 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir /tmp/debug_squad/
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
f1 = 88.52
|
||||||
|
exact_match = 81.22
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Distributed training
|
||||||
|
|
||||||
|
|
||||||
|
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_file $SQUAD_DIR/train-v1.1.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 2 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
|
||||||
|
--per_gpu_eval_batch_size=3 \
|
||||||
|
--per_gpu_train_batch_size=3 \
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
f1 = 93.15
|
||||||
|
exact_match = 86.91
|
||||||
|
```
|
||||||
|
|
||||||
|
This fine-tuned model is available as a checkpoint under the reference
|
||||||
|
`bert-large-uncased-whole-word-masking-finetuned-squad`.
|
||||||
|
|
||||||
|
#### Fine-tuning XLNet on SQuAD
|
||||||
|
|
||||||
|
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
|
||||||
|
|
||||||
|
##### Command for SQuAD1.0:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
|
python run_squad.py \
|
||||||
|
--model_type xlnet \
|
||||||
|
--model_name_or_path xlnet-large-cased \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_file $SQUAD_DIR/train-v1.1.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 2 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir ./wwm_cased_finetuned_squad/ \
|
||||||
|
--per_gpu_eval_batch_size=4 \
|
||||||
|
--per_gpu_train_batch_size=4 \
|
||||||
|
--save_steps 5000
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Command for SQuAD2.0:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
|
python run_squad.py \
|
||||||
|
--model_type xlnet \
|
||||||
|
--model_name_or_path xlnet-large-cased \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--version_2_with_negative \
|
||||||
|
--train_file $SQUAD_DIR/train-v2.0.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v2.0.json \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 4 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir ./wwm_cased_finetuned_squad/ \
|
||||||
|
--per_gpu_eval_batch_size=2 \
|
||||||
|
--per_gpu_train_batch_size=2 \
|
||||||
|
--save_steps 5000
|
||||||
|
```
|
||||||
|
|
||||||
|
Larger batch size may improve the performance while costing more memory.
|
||||||
|
|
||||||
|
##### Results for SQuAD1.0 with the previously defined hyper-parameters:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"exact": 85.45884578997162,
|
||||||
|
"f1": 92.5974600601065,
|
||||||
|
"total": 10570,
|
||||||
|
"HasAns_exact": 85.45884578997162,
|
||||||
|
"HasAns_f1": 92.59746006010651,
|
||||||
|
"HasAns_total": 10570
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Results for SQuAD2.0 with the previously defined hyper-parameters:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"exact": 80.4177545691906,
|
||||||
|
"f1": 84.07154997729623,
|
||||||
|
"total": 11873,
|
||||||
|
"HasAns_exact": 76.73751686909581,
|
||||||
|
"HasAns_f1": 84.05558584352873,
|
||||||
|
"HasAns_total": 5928,
|
||||||
|
"NoAns_exact": 84.0874684608915,
|
||||||
|
"NoAns_f1": 84.0874684608915,
|
||||||
|
"NoAns_total": 5945
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## XNLI
|
||||||
|
|
||||||
|
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
|
||||||
|
|
||||||
|
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
|
||||||
|
|
||||||
|
#### Fine-tuning on XNLI
|
||||||
|
|
||||||
|
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
|
||||||
|
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
|
||||||
|
`$XNLI_DIR` directory.
|
||||||
|
|
||||||
|
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
|
||||||
|
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export XNLI_DIR=/path/to/XNLI
|
||||||
|
|
||||||
|
python run_xnli.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-multilingual-cased \
|
||||||
|
--language de \
|
||||||
|
--train_language en \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $XNLI_DIR \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 5e-5 \
|
||||||
|
--num_train_epochs 2.0 \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--output_dir /tmp/debug_xnli/ \
|
||||||
|
--save_steps -1
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results on the **test** set:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
acc = 0.7093812375249501
|
||||||
|
```
|
||||||
|
|
||||||
|
## MM-IMDb
|
||||||
|
|
||||||
|
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
|
||||||
|
|
||||||
|
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
|
||||||
|
|
||||||
|
### Training on MM-IMDb
|
||||||
|
|
||||||
|
```
|
||||||
|
python run_mmimdb.py \
|
||||||
|
--data_dir /path/to/mmimdb/dataset/ \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-uncased \
|
||||||
|
--output_dir /path/to/save/dir/ \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--max_seq_len 512 \
|
||||||
|
--gradient_accumulation_steps 20 \
|
||||||
|
--num_image_embeds 3 \
|
||||||
|
--num_train_epochs 100 \
|
||||||
|
--patience 5
|
||||||
|
```
|
||||||
|
|
||||||
|
## Adversarial evaluation of model performances
|
||||||
|
|
||||||
|
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
|
||||||
|
|
||||||
|
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
|
||||||
|
|
||||||
|
This is an example of using test_hans.py:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export HANS_DIR=path-to-hans
|
||||||
|
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
|
||||||
|
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
|
||||||
|
|
||||||
|
python examples/hans/test_hans.py \
|
||||||
|
--task_name hans \
|
||||||
|
--model_type $MODEL_TYPE \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $HANS_DIR \
|
||||||
|
--model_name_or_path $MODEL_PATH \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--output_dir $MODEL_PATH \
|
||||||
|
```
|
||||||
|
|
||||||
|
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
|
||||||
|
|
||||||
|
The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Heuristic entailed results:
|
||||||
|
lexical_overlap: 0.9702
|
||||||
|
subsequence: 0.9942
|
||||||
|
constituent: 0.9962
|
||||||
|
|
||||||
|
Heuristic non-entailed results:
|
||||||
|
lexical_overlap: 0.199
|
||||||
|
subsequence: 0.0396
|
||||||
|
constituent: 0.118
|
||||||
|
```
|
@ -54,7 +54,7 @@ Additionally, the following method can be used to load values from a data file
|
|||||||
Example usage
|
Example usage
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py>`__ script.
|
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
|
||||||
|
|
||||||
|
|
||||||
XNLI
|
XNLI
|
||||||
|
@ -44,7 +44,7 @@ Sequence Classification
|
|||||||
Sequence classification is the task of classifying sequences according to a given number of classes. An example
|
Sequence classification is the task of classifying sequences according to a given number of classes. An example
|
||||||
of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
|
of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
|
||||||
a model on a GLUE sequence classification task, you may leverage the
|
a model on a GLUE sequence classification task, you may leverage the
|
||||||
`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_glue.py>`_ or
|
`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`_ or
|
||||||
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
|
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
|
||||||
|
|
||||||
Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
|
Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
|
||||||
|
@ -15,7 +15,7 @@ pip install -r ./examples/requirements.txt
|
|||||||
```
|
```
|
||||||
|
|
||||||
| Section | Description |
|
| Section | Description |
|
||||||
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
|
|----------------------------|-----------------------------------------------------
|
||||||
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
|
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
|
||||||
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
|
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
|
||||||
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
||||||
@ -27,623 +27,3 @@ pip install -r ./examples/requirements.txt
|
|||||||
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
|
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
|
||||||
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
|
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
|
||||||
|
|
||||||
## TensorFlow 2.0 Bert models on GLUE
|
|
||||||
|
|
||||||
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
|
|
||||||
|
|
||||||
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
|
|
||||||
|
|
||||||
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
|
|
||||||
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
|
|
||||||
These options and the below benchmark are provided by @tlkh.
|
|
||||||
|
|
||||||
Quick benchmarks from the script (no other modifications):
|
|
||||||
|
|
||||||
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
|
|
||||||
| --------- | -------- | ----------------------- | ----------------------|
|
|
||||||
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
|
|
||||||
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
|
|
||||||
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
|
|
||||||
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
|
|
||||||
| 1080 Ti | FP32 | 55s | - |
|
|
||||||
|
|
||||||
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
|
|
||||||
|
|
||||||
## Running on TPUs
|
|
||||||
|
|
||||||
You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
|
|
||||||
[README](https://github.com/pytorch/xla/blob/master/README.md).
|
|
||||||
|
|
||||||
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
|
|
||||||
identical to your normal GPU + Huggingface setup.
|
|
||||||
|
|
||||||
### GLUE
|
|
||||||
|
|
||||||
Before running anyone of these GLUE tasks you should download the
|
|
||||||
[GLUE data](https://gluebenchmark.com/tasks) by running
|
|
||||||
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
|
||||||
and unpack it to some directory `$GLUE_DIR`.
|
|
||||||
|
|
||||||
For running your GLUE task on MNLI dataset you can run something like the following:
|
|
||||||
|
|
||||||
```
|
|
||||||
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
export TASK_NAME=MNLI
|
|
||||||
|
|
||||||
python run_glue_tpu.py \
|
|
||||||
--model_type bert \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name $TASK_NAME \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $GLUE_DIR/$TASK_NAME \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--train_batch_size 32 \
|
|
||||||
--learning_rate 3e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/$TASK_NAME \
|
|
||||||
--overwrite_output_dir \
|
|
||||||
--logging_steps 50 \
|
|
||||||
--save_steps 200 \
|
|
||||||
--num_cores=8 \
|
|
||||||
--only_log_master
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## Language model training
|
|
||||||
|
|
||||||
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
|
|
||||||
|
|
||||||
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
|
|
||||||
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
|
|
||||||
are fine-tuned using a masked language modeling (MLM) loss.
|
|
||||||
|
|
||||||
Before running the following example, you should get a file that contains text on which the language model will be
|
|
||||||
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
|
|
||||||
|
|
||||||
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
|
|
||||||
text that will be used for evaluation.
|
|
||||||
|
|
||||||
### GPT-2/GPT and causal language modeling
|
|
||||||
|
|
||||||
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
|
|
||||||
the tokenization). The loss here is that of causal language modeling.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
|
|
||||||
export TEST_FILE=/path/to/dataset/wiki.test.raw
|
|
||||||
|
|
||||||
python run_language_modeling.py \
|
|
||||||
--output_dir=output \
|
|
||||||
--model_type=gpt2 \
|
|
||||||
--model_name_or_path=gpt2 \
|
|
||||||
--do_train \
|
|
||||||
--train_data_file=$TRAIN_FILE \
|
|
||||||
--do_eval \
|
|
||||||
--eval_data_file=$TEST_FILE
|
|
||||||
```
|
|
||||||
|
|
||||||
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
|
|
||||||
a score of ~20 perplexity once fine-tuned on the dataset.
|
|
||||||
|
|
||||||
### RoBERTa/BERT and masked language modeling
|
|
||||||
|
|
||||||
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
|
|
||||||
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
|
|
||||||
pre-training: masked language modeling.
|
|
||||||
|
|
||||||
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
|
|
||||||
slightly slower (over-fitting takes more epochs).
|
|
||||||
|
|
||||||
We use the `--mlm` flag so that the script may change its loss function.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
|
|
||||||
export TEST_FILE=/path/to/dataset/wiki.test.raw
|
|
||||||
|
|
||||||
python run_language_modeling.py \
|
|
||||||
--output_dir=output \
|
|
||||||
--model_type=roberta \
|
|
||||||
--model_name_or_path=roberta-base \
|
|
||||||
--do_train \
|
|
||||||
--train_data_file=$TRAIN_FILE \
|
|
||||||
--do_eval \
|
|
||||||
--eval_data_file=$TEST_FILE \
|
|
||||||
--mlm
|
|
||||||
```
|
|
||||||
|
|
||||||
## Language generation
|
|
||||||
|
|
||||||
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
|
|
||||||
|
|
||||||
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
|
|
||||||
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
|
||||||
can try out the different models available in the library.
|
|
||||||
|
|
||||||
Example usage:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python run_generation.py \
|
|
||||||
--model_type=gpt2 \
|
|
||||||
--model_name_or_path=gpt2
|
|
||||||
```
|
|
||||||
|
|
||||||
## GLUE
|
|
||||||
|
|
||||||
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).
|
|
||||||
|
|
||||||
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
|
||||||
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
|
||||||
|
|
||||||
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
|
|
||||||
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
|
|
||||||
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
|
|
||||||
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
|
|
||||||
|
|
||||||
| Task | Metric | Result |
|
|
||||||
|-------|------------------------------|-------------|
|
|
||||||
| CoLA | Matthew's corr | 49.23 |
|
|
||||||
| SST-2 | Accuracy | 91.97 |
|
|
||||||
| MRPC | F1/Accuracy | 89.47/85.29 |
|
|
||||||
| STS-B | Person/Spearman corr. | 83.95/83.70 |
|
|
||||||
| QQP | Accuracy/F1 | 88.40/84.31 |
|
|
||||||
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
|
|
||||||
| QNLI | Accuracy | 87.46 |
|
|
||||||
| RTE | Accuracy | 61.73 |
|
|
||||||
| WNLI | Accuracy | 45.07 |
|
|
||||||
|
|
||||||
Some of these results are significantly different from the ones reported on the test set
|
|
||||||
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
|
|
||||||
|
|
||||||
Before running any one of these GLUE tasks you should download the
|
|
||||||
[GLUE data](https://gluebenchmark.com/tasks) by running
|
|
||||||
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
|
||||||
and unpack it to some directory `$GLUE_DIR`.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
export TASK_NAME=MRPC
|
|
||||||
|
|
||||||
python run_glue.py \
|
|
||||||
--model_type bert \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name $TASK_NAME \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $GLUE_DIR/$TASK_NAME \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_gpu_train_batch_size 32 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/$TASK_NAME/
|
|
||||||
```
|
|
||||||
|
|
||||||
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
|
|
||||||
|
|
||||||
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
|
|
||||||
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
|
|
||||||
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
|
|
||||||
|
|
||||||
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
|
|
||||||
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
|
|
||||||
said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
|
|
||||||
since the data processor for each task inherits from the base class DataProcessor.
|
|
||||||
|
|
||||||
### MRPC
|
|
||||||
|
|
||||||
#### Fine-tuning example
|
|
||||||
|
|
||||||
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
|
|
||||||
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
|
|
||||||
|
|
||||||
Before running any one of these GLUE tasks you should download the
|
|
||||||
[GLUE data](https://gluebenchmark.com/tasks) by running
|
|
||||||
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
|
||||||
and unpack it to some directory `$GLUE_DIR`.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
|
|
||||||
python run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name MRPC \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $GLUE_DIR/MRPC/ \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_gpu_train_batch_size 32 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/mrpc_output/
|
|
||||||
```
|
|
||||||
|
|
||||||
Our test ran on a few seeds with [the original implementation hyper-
|
|
||||||
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
|
|
||||||
results between 84% and 88%.
|
|
||||||
|
|
||||||
#### Using Apex and mixed-precision
|
|
||||||
|
|
||||||
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
|
|
||||||
[apex](https://github.com/NVIDIA/apex), then run the following example:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
|
|
||||||
python run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name MRPC \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $GLUE_DIR/MRPC/ \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_gpu_train_batch_size 32 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/mrpc_output/ \
|
|
||||||
--fp16
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Distributed training
|
|
||||||
|
|
||||||
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
|
|
||||||
reaches F1 > 92 on MRPC.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
|
|
||||||
python -m torch.distributed.launch \
|
|
||||||
--nproc_per_node 8 run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name MRPC \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $GLUE_DIR/MRPC/ \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_gpu_train_batch_size 8 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/mrpc_output/
|
|
||||||
```
|
|
||||||
|
|
||||||
Training with these hyper-parameters gave us the following results:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
acc = 0.8823529411764706
|
|
||||||
acc_and_f1 = 0.901702786377709
|
|
||||||
eval_loss = 0.3418912578906332
|
|
||||||
f1 = 0.9210526315789473
|
|
||||||
global_step = 174
|
|
||||||
loss = 0.07231863956341798
|
|
||||||
```
|
|
||||||
|
|
||||||
### MNLI
|
|
||||||
|
|
||||||
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
|
|
||||||
python -m torch.distributed.launch \
|
|
||||||
--nproc_per_node 8 run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name mnli \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $GLUE_DIR/MNLI/ \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_gpu_train_batch_size 8 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir output_dir \
|
|
||||||
```
|
|
||||||
|
|
||||||
The results are the following:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
***** Eval results *****
|
|
||||||
acc = 0.8679706601466992
|
|
||||||
eval_loss = 0.4911287787382479
|
|
||||||
global_step = 18408
|
|
||||||
loss = 0.04755385363816904
|
|
||||||
|
|
||||||
***** Eval results *****
|
|
||||||
acc = 0.8747965825874695
|
|
||||||
eval_loss = 0.45516540421714036
|
|
||||||
global_step = 18408
|
|
||||||
loss = 0.04755385363816904
|
|
||||||
```
|
|
||||||
|
|
||||||
## Multiple Choice
|
|
||||||
|
|
||||||
Based on the script [`run_multiple_choice.py`]().
|
|
||||||
|
|
||||||
#### Fine-tuning on SWAG
|
|
||||||
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
|
|
||||||
|
|
||||||
```bash
|
|
||||||
#training on 4 tesla V100(16GB) GPUS
|
|
||||||
export SWAG_DIR=/path/to/swag_data_dir
|
|
||||||
python ./examples/run_multiple_choice.py \
|
|
||||||
--task_name swag \
|
|
||||||
--model_name_or_path roberta-base \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $SWAG_DIR \
|
|
||||||
--learning_rate 5e-5 \
|
|
||||||
--num_train_epochs 3 \
|
|
||||||
--max_seq_length 80 \
|
|
||||||
--output_dir models_bert/swag_base \
|
|
||||||
--per_gpu_eval_batch_size=16 \
|
|
||||||
--per_gpu_train_batch_size=16 \
|
|
||||||
--gradient_accumulation_steps 2 \
|
|
||||||
--overwrite_output
|
|
||||||
```
|
|
||||||
Training with the defined hyper-parameters yields the following results:
|
|
||||||
```
|
|
||||||
***** Eval results *****
|
|
||||||
eval_acc = 0.8338998300509847
|
|
||||||
eval_loss = 0.44457291918821606
|
|
||||||
```
|
|
||||||
|
|
||||||
## SQuAD
|
|
||||||
|
|
||||||
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
|
|
||||||
|
|
||||||
#### Fine-tuning BERT on SQuAD1.0
|
|
||||||
|
|
||||||
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
|
|
||||||
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
|
|
||||||
$SQUAD_DIR directory.
|
|
||||||
|
|
||||||
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
|
|
||||||
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
|
|
||||||
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
|
|
||||||
|
|
||||||
And for SQuAD2.0, you need to download:
|
|
||||||
|
|
||||||
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
|
|
||||||
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
|
|
||||||
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export SQUAD_DIR=/path/to/SQUAD
|
|
||||||
|
|
||||||
python run_squad.py \
|
|
||||||
--model_type bert \
|
|
||||||
--model_name_or_path bert-base-uncased \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--train_file $SQUAD_DIR/train-v1.1.json \
|
|
||||||
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
|
||||||
--per_gpu_train_batch_size 12 \
|
|
||||||
--learning_rate 3e-5 \
|
|
||||||
--num_train_epochs 2.0 \
|
|
||||||
--max_seq_length 384 \
|
|
||||||
--doc_stride 128 \
|
|
||||||
--output_dir /tmp/debug_squad/
|
|
||||||
```
|
|
||||||
|
|
||||||
Training with the previously defined hyper-parameters yields the following results:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
f1 = 88.52
|
|
||||||
exact_match = 81.22
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Distributed training
|
|
||||||
|
|
||||||
|
|
||||||
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
|
|
||||||
--model_type bert \
|
|
||||||
--model_name_or_path bert-large-uncased-whole-word-masking \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--train_file $SQUAD_DIR/train-v1.1.json \
|
|
||||||
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
|
||||||
--learning_rate 3e-5 \
|
|
||||||
--num_train_epochs 2 \
|
|
||||||
--max_seq_length 384 \
|
|
||||||
--doc_stride 128 \
|
|
||||||
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
|
|
||||||
--per_gpu_eval_batch_size=3 \
|
|
||||||
--per_gpu_train_batch_size=3 \
|
|
||||||
```
|
|
||||||
|
|
||||||
Training with the previously defined hyper-parameters yields the following results:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
f1 = 93.15
|
|
||||||
exact_match = 86.91
|
|
||||||
```
|
|
||||||
|
|
||||||
This fine-tuned model is available as a checkpoint under the reference
|
|
||||||
`bert-large-uncased-whole-word-masking-finetuned-squad`.
|
|
||||||
|
|
||||||
#### Fine-tuning XLNet on SQuAD
|
|
||||||
|
|
||||||
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
|
|
||||||
|
|
||||||
##### Command for SQuAD1.0:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export SQUAD_DIR=/path/to/SQUAD
|
|
||||||
|
|
||||||
python run_squad.py \
|
|
||||||
--model_type xlnet \
|
|
||||||
--model_name_or_path xlnet-large-cased \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--train_file $SQUAD_DIR/train-v1.1.json \
|
|
||||||
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
|
||||||
--learning_rate 3e-5 \
|
|
||||||
--num_train_epochs 2 \
|
|
||||||
--max_seq_length 384 \
|
|
||||||
--doc_stride 128 \
|
|
||||||
--output_dir ./wwm_cased_finetuned_squad/ \
|
|
||||||
--per_gpu_eval_batch_size=4 \
|
|
||||||
--per_gpu_train_batch_size=4 \
|
|
||||||
--save_steps 5000
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Command for SQuAD2.0:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export SQUAD_DIR=/path/to/SQUAD
|
|
||||||
|
|
||||||
python run_squad.py \
|
|
||||||
--model_type xlnet \
|
|
||||||
--model_name_or_path xlnet-large-cased \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--version_2_with_negative \
|
|
||||||
--train_file $SQUAD_DIR/train-v2.0.json \
|
|
||||||
--predict_file $SQUAD_DIR/dev-v2.0.json \
|
|
||||||
--learning_rate 3e-5 \
|
|
||||||
--num_train_epochs 4 \
|
|
||||||
--max_seq_length 384 \
|
|
||||||
--doc_stride 128 \
|
|
||||||
--output_dir ./wwm_cased_finetuned_squad/ \
|
|
||||||
--per_gpu_eval_batch_size=2 \
|
|
||||||
--per_gpu_train_batch_size=2 \
|
|
||||||
--save_steps 5000
|
|
||||||
```
|
|
||||||
|
|
||||||
Larger batch size may improve the performance while costing more memory.
|
|
||||||
|
|
||||||
##### Results for SQuAD1.0 with the previously defined hyper-parameters:
|
|
||||||
|
|
||||||
```python
|
|
||||||
{
|
|
||||||
"exact": 85.45884578997162,
|
|
||||||
"f1": 92.5974600601065,
|
|
||||||
"total": 10570,
|
|
||||||
"HasAns_exact": 85.45884578997162,
|
|
||||||
"HasAns_f1": 92.59746006010651,
|
|
||||||
"HasAns_total": 10570
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Results for SQuAD2.0 with the previously defined hyper-parameters:
|
|
||||||
|
|
||||||
```python
|
|
||||||
{
|
|
||||||
"exact": 80.4177545691906,
|
|
||||||
"f1": 84.07154997729623,
|
|
||||||
"total": 11873,
|
|
||||||
"HasAns_exact": 76.73751686909581,
|
|
||||||
"HasAns_f1": 84.05558584352873,
|
|
||||||
"HasAns_total": 5928,
|
|
||||||
"NoAns_exact": 84.0874684608915,
|
|
||||||
"NoAns_f1": 84.0874684608915,
|
|
||||||
"NoAns_total": 5945
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## XNLI
|
|
||||||
|
|
||||||
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
|
|
||||||
|
|
||||||
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
|
|
||||||
|
|
||||||
#### Fine-tuning on XNLI
|
|
||||||
|
|
||||||
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
|
|
||||||
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
|
|
||||||
`$XNLI_DIR` directory.
|
|
||||||
|
|
||||||
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
|
|
||||||
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export XNLI_DIR=/path/to/XNLI
|
|
||||||
|
|
||||||
python run_xnli.py \
|
|
||||||
--model_type bert \
|
|
||||||
--model_name_or_path bert-base-multilingual-cased \
|
|
||||||
--language de \
|
|
||||||
--train_language en \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $XNLI_DIR \
|
|
||||||
--per_gpu_train_batch_size 32 \
|
|
||||||
--learning_rate 5e-5 \
|
|
||||||
--num_train_epochs 2.0 \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--output_dir /tmp/debug_xnli/ \
|
|
||||||
--save_steps -1
|
|
||||||
```
|
|
||||||
|
|
||||||
Training with the previously defined hyper-parameters yields the following results on the **test** set:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
acc = 0.7093812375249501
|
|
||||||
```
|
|
||||||
|
|
||||||
## MM-IMDb
|
|
||||||
|
|
||||||
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/mm-imdb/run_mmimdb.py).
|
|
||||||
|
|
||||||
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
|
|
||||||
|
|
||||||
### Training on MM-IMDb
|
|
||||||
|
|
||||||
```
|
|
||||||
python run_mmimdb.py \
|
|
||||||
--data_dir /path/to/mmimdb/dataset/ \
|
|
||||||
--model_type bert \
|
|
||||||
--model_name_or_path bert-base-uncased \
|
|
||||||
--output_dir /path/to/save/dir/ \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--max_seq_len 512 \
|
|
||||||
--gradient_accumulation_steps 20 \
|
|
||||||
--num_image_embeds 3 \
|
|
||||||
--num_train_epochs 100 \
|
|
||||||
--patience 5
|
|
||||||
```
|
|
||||||
|
|
||||||
## Adversarial evaluation of model performances
|
|
||||||
|
|
||||||
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
|
|
||||||
|
|
||||||
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
|
|
||||||
|
|
||||||
This is an example of using test_hans.py:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export HANS_DIR=path-to-hans
|
|
||||||
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
|
|
||||||
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
|
|
||||||
|
|
||||||
python examples/hans/test_hans.py \
|
|
||||||
--task_name hans \
|
|
||||||
--model_type $MODEL_TYPE \
|
|
||||||
--do_eval \
|
|
||||||
--data_dir $HANS_DIR \
|
|
||||||
--model_name_or_path $MODEL_PATH \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--output_dir $MODEL_PATH \
|
|
||||||
```
|
|
||||||
|
|
||||||
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
|
|
||||||
|
|
||||||
The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
Heuristic entailed results:
|
|
||||||
lexical_overlap: 0.9702
|
|
||||||
subsequence: 0.9942
|
|
||||||
constituent: 0.9962
|
|
||||||
|
|
||||||
Heuristic non-entailed results:
|
|
||||||
lexical_overlap: 0.199
|
|
||||||
subsequence: 0.0396
|
|
||||||
constituent: 0.118
|
|
||||||
```
|
|
||||||
|
38
examples/adversarial/README.md
Normal file
38
examples/adversarial/README.md
Normal file
@ -0,0 +1,38 @@
|
|||||||
|
## Adversarial evaluation of model performances
|
||||||
|
|
||||||
|
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
|
||||||
|
|
||||||
|
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
|
||||||
|
|
||||||
|
This is an example of using test_hans.py:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export HANS_DIR=path-to-hans
|
||||||
|
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
|
||||||
|
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
|
||||||
|
|
||||||
|
python examples/hans/test_hans.py \
|
||||||
|
--task_name hans \
|
||||||
|
--model_type $MODEL_TYPE \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $HANS_DIR \
|
||||||
|
--model_name_or_path $MODEL_PATH \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--output_dir $MODEL_PATH \
|
||||||
|
```
|
||||||
|
|
||||||
|
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
|
||||||
|
|
||||||
|
The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Heuristic entailed results:
|
||||||
|
lexical_overlap: 0.9702
|
||||||
|
subsequence: 0.9942
|
||||||
|
constituent: 0.9962
|
||||||
|
|
||||||
|
Heuristic non-entailed results:
|
||||||
|
lexical_overlap: 0.199
|
||||||
|
subsequence: 0.0396
|
||||||
|
constituent: 0.118
|
||||||
|
```
|
23
examples/contrib/mm-imdb/README.md
Normal file
23
examples/contrib/mm-imdb/README.md
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
## MM-IMDb
|
||||||
|
|
||||||
|
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
|
||||||
|
|
||||||
|
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
|
||||||
|
|
||||||
|
### Training on MM-IMDb
|
||||||
|
|
||||||
|
```
|
||||||
|
python run_mmimdb.py \
|
||||||
|
--data_dir /path/to/mmimdb/dataset/ \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-uncased \
|
||||||
|
--output_dir /path/to/save/dir/ \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--max_seq_len 512 \
|
||||||
|
--gradient_accumulation_steps 20 \
|
||||||
|
--num_image_embeds 3 \
|
||||||
|
--num_train_epochs 100 \
|
||||||
|
--patience 5
|
||||||
|
```
|
||||||
|
|
@ -1,9 +0,0 @@
|
|||||||
# GLUE Benchmark
|
|
||||||
|
|
||||||
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).
|
|
||||||
|
|
||||||
#### Run PyTorch version using PyTorch-Lightning
|
|
||||||
|
|
||||||
Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
|
|
||||||
|
|
||||||
Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: `TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}`
|
|
63
examples/language-modeling/README.md
Normal file
63
examples/language-modeling/README.md
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
|
||||||
|
## Language model training
|
||||||
|
|
||||||
|
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
|
||||||
|
|
||||||
|
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
|
||||||
|
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
|
||||||
|
are fine-tuned using a masked language modeling (MLM) loss.
|
||||||
|
|
||||||
|
Before running the following example, you should get a file that contains text on which the language model will be
|
||||||
|
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
|
||||||
|
|
||||||
|
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
|
||||||
|
text that will be used for evaluation.
|
||||||
|
|
||||||
|
### GPT-2/GPT and causal language modeling
|
||||||
|
|
||||||
|
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
|
||||||
|
the tokenization). The loss here is that of causal language modeling.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
|
||||||
|
export TEST_FILE=/path/to/dataset/wiki.test.raw
|
||||||
|
|
||||||
|
python run_language_modeling.py \
|
||||||
|
--output_dir=output \
|
||||||
|
--model_type=gpt2 \
|
||||||
|
--model_name_or_path=gpt2 \
|
||||||
|
--do_train \
|
||||||
|
--train_data_file=$TRAIN_FILE \
|
||||||
|
--do_eval \
|
||||||
|
--eval_data_file=$TEST_FILE
|
||||||
|
```
|
||||||
|
|
||||||
|
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
|
||||||
|
a score of ~20 perplexity once fine-tuned on the dataset.
|
||||||
|
|
||||||
|
### RoBERTa/BERT and masked language modeling
|
||||||
|
|
||||||
|
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
|
||||||
|
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
|
||||||
|
pre-training: masked language modeling.
|
||||||
|
|
||||||
|
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
|
||||||
|
slightly slower (over-fitting takes more epochs).
|
||||||
|
|
||||||
|
We use the `--mlm` flag so that the script may change its loss function.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
|
||||||
|
export TEST_FILE=/path/to/dataset/wiki.test.raw
|
||||||
|
|
||||||
|
python run_language_modeling.py \
|
||||||
|
--output_dir=output \
|
||||||
|
--model_type=roberta \
|
||||||
|
--model_name_or_path=roberta-base \
|
||||||
|
--do_train \
|
||||||
|
--train_data_file=$TRAIN_FILE \
|
||||||
|
--do_eval \
|
||||||
|
--eval_data_file=$TEST_FILE \
|
||||||
|
--mlm
|
||||||
|
```
|
||||||
|
|
31
examples/multiple-choice/README.md
Normal file
31
examples/multiple-choice/README.md
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
## Multiple Choice
|
||||||
|
|
||||||
|
Based on the script [`run_multiple_choice.py`]().
|
||||||
|
|
||||||
|
#### Fine-tuning on SWAG
|
||||||
|
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#training on 4 tesla V100(16GB) GPUS
|
||||||
|
export SWAG_DIR=/path/to/swag_data_dir
|
||||||
|
python ./examples/run_multiple_choice.py \
|
||||||
|
--task_name swag \
|
||||||
|
--model_name_or_path roberta-base \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $SWAG_DIR \
|
||||||
|
--learning_rate 5e-5 \
|
||||||
|
--num_train_epochs 3 \
|
||||||
|
--max_seq_length 80 \
|
||||||
|
--output_dir models_bert/swag_base \
|
||||||
|
--per_gpu_eval_batch_size=16 \
|
||||||
|
--per_gpu_train_batch_size=16 \
|
||||||
|
--gradient_accumulation_steps 2 \
|
||||||
|
--overwrite_output
|
||||||
|
```
|
||||||
|
Training with the defined hyper-parameters yields the following results:
|
||||||
|
```
|
||||||
|
***** Eval results *****
|
||||||
|
eval_acc = 0.8338998300509847
|
||||||
|
eval_loss = 0.44457291918821606
|
||||||
|
```
|
159
examples/question-answering/README.md
Normal file
159
examples/question-answering/README.md
Normal file
@ -0,0 +1,159 @@
|
|||||||
|
|
||||||
|
|
||||||
|
## SQuAD
|
||||||
|
|
||||||
|
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
|
||||||
|
|
||||||
|
#### Fine-tuning BERT on SQuAD1.0
|
||||||
|
|
||||||
|
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
|
||||||
|
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
|
||||||
|
$SQUAD_DIR directory.
|
||||||
|
|
||||||
|
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
|
||||||
|
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
|
||||||
|
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
|
||||||
|
|
||||||
|
And for SQuAD2.0, you need to download:
|
||||||
|
|
||||||
|
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
|
||||||
|
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
|
||||||
|
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
|
python run_squad.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-uncased \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_file $SQUAD_DIR/train-v1.1.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
||||||
|
--per_gpu_train_batch_size 12 \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 2.0 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir /tmp/debug_squad/
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
f1 = 88.52
|
||||||
|
exact_match = 81.22
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Distributed training
|
||||||
|
|
||||||
|
|
||||||
|
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_file $SQUAD_DIR/train-v1.1.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 2 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
|
||||||
|
--per_gpu_eval_batch_size=3 \
|
||||||
|
--per_gpu_train_batch_size=3 \
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
f1 = 93.15
|
||||||
|
exact_match = 86.91
|
||||||
|
```
|
||||||
|
|
||||||
|
This fine-tuned model is available as a checkpoint under the reference
|
||||||
|
`bert-large-uncased-whole-word-masking-finetuned-squad`.
|
||||||
|
|
||||||
|
#### Fine-tuning XLNet on SQuAD
|
||||||
|
|
||||||
|
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
|
||||||
|
|
||||||
|
##### Command for SQuAD1.0:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
|
python run_squad.py \
|
||||||
|
--model_type xlnet \
|
||||||
|
--model_name_or_path xlnet-large-cased \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_file $SQUAD_DIR/train-v1.1.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v1.1.json \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 2 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir ./wwm_cased_finetuned_squad/ \
|
||||||
|
--per_gpu_eval_batch_size=4 \
|
||||||
|
--per_gpu_train_batch_size=4 \
|
||||||
|
--save_steps 5000
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Command for SQuAD2.0:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
|
python run_squad.py \
|
||||||
|
--model_type xlnet \
|
||||||
|
--model_name_or_path xlnet-large-cased \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--version_2_with_negative \
|
||||||
|
--train_file $SQUAD_DIR/train-v2.0.json \
|
||||||
|
--predict_file $SQUAD_DIR/dev-v2.0.json \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 4 \
|
||||||
|
--max_seq_length 384 \
|
||||||
|
--doc_stride 128 \
|
||||||
|
--output_dir ./wwm_cased_finetuned_squad/ \
|
||||||
|
--per_gpu_eval_batch_size=2 \
|
||||||
|
--per_gpu_train_batch_size=2 \
|
||||||
|
--save_steps 5000
|
||||||
|
```
|
||||||
|
|
||||||
|
Larger batch size may improve the performance while costing more memory.
|
||||||
|
|
||||||
|
##### Results for SQuAD1.0 with the previously defined hyper-parameters:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"exact": 85.45884578997162,
|
||||||
|
"f1": 92.5974600601065,
|
||||||
|
"total": 10570,
|
||||||
|
"HasAns_exact": 85.45884578997162,
|
||||||
|
"HasAns_f1": 92.59746006010651,
|
||||||
|
"HasAns_total": 10570
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Results for SQuAD2.0 with the previously defined hyper-parameters:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"exact": 80.4177545691906,
|
||||||
|
"f1": 84.07154997729623,
|
||||||
|
"total": 11873,
|
||||||
|
"HasAns_exact": 76.73751686909581,
|
||||||
|
"HasAns_f1": 84.05558584352873,
|
||||||
|
"HasAns_total": 5928,
|
||||||
|
"NoAns_exact": 84.0874684608915,
|
||||||
|
"NoAns_f1": 84.0874684608915,
|
||||||
|
"NoAns_total": 5945
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
@ -1,4 +1,3 @@
|
|||||||
tensorboardX
|
|
||||||
tensorboard
|
tensorboard
|
||||||
scikit-learn
|
scikit-learn
|
||||||
seqeval
|
seqeval
|
||||||
|
@ -1,611 +0,0 @@
|
|||||||
# coding=utf-8
|
|
||||||
# Copyright 2019 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
|
||||||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
""" Finetuning the library models for sequence classification on GLUE (Bert, DistilBert, XLNet, RoBERTa)."""
|
|
||||||
|
|
||||||
from __future__ import absolute_import, division, print_function
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import glob
|
|
||||||
import logging
|
|
||||||
import os
|
|
||||||
import random
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import torch
|
|
||||||
import torch_xla.core.xla_model as xm
|
|
||||||
import torch_xla.debug.metrics as met
|
|
||||||
import torch_xla.distributed.parallel_loader as pl
|
|
||||||
import torch_xla.distributed.xla_multiprocessing as xmp
|
|
||||||
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
|
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
|
||||||
from tqdm import tqdm, trange
|
|
||||||
|
|
||||||
from transformers import (
|
|
||||||
WEIGHTS_NAME,
|
|
||||||
AdamW,
|
|
||||||
BertConfig,
|
|
||||||
BertForSequenceClassification,
|
|
||||||
BertTokenizer,
|
|
||||||
DistilBertConfig,
|
|
||||||
DistilBertForSequenceClassification,
|
|
||||||
DistilBertTokenizer,
|
|
||||||
RobertaConfig,
|
|
||||||
RobertaForSequenceClassification,
|
|
||||||
RobertaTokenizer,
|
|
||||||
XLMConfig,
|
|
||||||
XLMForSequenceClassification,
|
|
||||||
XLMTokenizer,
|
|
||||||
XLNetConfig,
|
|
||||||
XLNetForSequenceClassification,
|
|
||||||
XLNetTokenizer,
|
|
||||||
get_linear_schedule_with_warmup,
|
|
||||||
)
|
|
||||||
from transformers import glue_compute_metrics as compute_metrics
|
|
||||||
from transformers import glue_convert_examples_to_features as convert_examples_to_features
|
|
||||||
from transformers import glue_output_modes as output_modes
|
|
||||||
from transformers import glue_processors as processors
|
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Only tensorboardX supports writing directly to gs://
|
|
||||||
from tensorboardX import SummaryWriter
|
|
||||||
except ImportError:
|
|
||||||
from torch.utils.tensorboard import SummaryWriter
|
|
||||||
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
ALL_MODELS = sum(
|
|
||||||
(
|
|
||||||
tuple(conf.pretrained_config_archive_map.keys())
|
|
||||||
for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
|
|
||||||
),
|
|
||||||
(),
|
|
||||||
)
|
|
||||||
|
|
||||||
MODEL_CLASSES = {
|
|
||||||
"bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
|
|
||||||
"xlnet": (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
|
|
||||||
"xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
|
|
||||||
"roberta": (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
|
|
||||||
"distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def set_seed(seed):
|
|
||||||
random.seed(seed)
|
|
||||||
np.random.seed(seed)
|
|
||||||
torch.manual_seed(seed)
|
|
||||||
|
|
||||||
|
|
||||||
def get_sampler(dataset):
|
|
||||||
if xm.xrt_world_size() <= 1:
|
|
||||||
return RandomSampler(dataset)
|
|
||||||
return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())
|
|
||||||
|
|
||||||
|
|
||||||
def train(args, train_dataset, model, tokenizer, disable_logging=False):
|
|
||||||
""" Train the model """
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
# Only master writes to Tensorboard
|
|
||||||
tb_writer = SummaryWriter(args.tensorboard_logdir)
|
|
||||||
|
|
||||||
train_sampler = get_sampler(train_dataset)
|
|
||||||
dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
|
||||||
|
|
||||||
if args.max_steps > 0:
|
|
||||||
t_total = args.max_steps
|
|
||||||
args.num_train_epochs = args.max_steps // (len(dataloader) // args.gradient_accumulation_steps) + 1
|
|
||||||
else:
|
|
||||||
t_total = len(dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
|
|
||||||
|
|
||||||
# Prepare optimizer and schedule (linear warmup and decay)
|
|
||||||
no_decay = ["bias", "LayerNorm.weight"]
|
|
||||||
optimizer_grouped_parameters = [
|
|
||||||
{
|
|
||||||
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
|
|
||||||
"weight_decay": args.weight_decay,
|
|
||||||
},
|
|
||||||
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
|
|
||||||
]
|
|
||||||
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
|
||||||
scheduler = get_linear_schedule_with_warmup(
|
|
||||||
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Train!
|
|
||||||
logger.info("***** Running training *****")
|
|
||||||
logger.info(" Num examples = %d", len(dataloader) * args.train_batch_size)
|
|
||||||
logger.info(" Num Epochs = %d", args.num_train_epochs)
|
|
||||||
logger.info(" Instantaneous batch size per TPU core = %d", args.train_batch_size)
|
|
||||||
logger.info(
|
|
||||||
" Total train batch size (w. parallel, distributed & accumulation) = %d",
|
|
||||||
(args.train_batch_size * args.gradient_accumulation_steps * xm.xrt_world_size()),
|
|
||||||
)
|
|
||||||
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
|
||||||
logger.info(" Total optimization steps = %d", t_total)
|
|
||||||
|
|
||||||
global_step = 0
|
|
||||||
loss = None
|
|
||||||
model.zero_grad()
|
|
||||||
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=disable_logging)
|
|
||||||
set_seed(args.seed) # Added here for reproductibility (even between python 2 and 3)
|
|
||||||
for epoch in train_iterator:
|
|
||||||
# tpu-comment: Get TPU parallel loader which sends data to TPU in background.
|
|
||||||
train_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device)
|
|
||||||
epoch_iterator = tqdm(train_dataloader, desc="Iteration", total=len(dataloader), disable=disable_logging)
|
|
||||||
for step, batch in enumerate(epoch_iterator):
|
|
||||||
|
|
||||||
# Save model checkpoint.
|
|
||||||
if args.save_steps > 0 and global_step % args.save_steps == 0:
|
|
||||||
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
|
|
||||||
logger.info("Saving model checkpoint to %s", output_dir)
|
|
||||||
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
if not os.path.exists(output_dir):
|
|
||||||
os.makedirs(output_dir)
|
|
||||||
torch.save(args, os.path.join(output_dir, "training_args.bin"))
|
|
||||||
|
|
||||||
# Barrier to wait for saving checkpoint.
|
|
||||||
xm.rendezvous("mid_training_checkpoint")
|
|
||||||
# model.save_pretrained needs to be called by all ordinals
|
|
||||||
model.save_pretrained(output_dir)
|
|
||||||
|
|
||||||
model.train()
|
|
||||||
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
|
|
||||||
if args.model_type != "distilbert":
|
|
||||||
# XLM, DistilBERT and RoBERTa don't use segment_ids
|
|
||||||
inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None
|
|
||||||
outputs = model(**inputs)
|
|
||||||
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
|
|
||||||
|
|
||||||
if args.gradient_accumulation_steps > 1:
|
|
||||||
loss = loss / args.gradient_accumulation_steps
|
|
||||||
|
|
||||||
loss.backward()
|
|
||||||
|
|
||||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
|
||||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
|
||||||
|
|
||||||
xm.optimizer_step(optimizer)
|
|
||||||
scheduler.step() # Update learning rate schedule
|
|
||||||
model.zero_grad()
|
|
||||||
global_step += 1
|
|
||||||
|
|
||||||
if args.logging_steps > 0 and global_step % args.logging_steps == 0:
|
|
||||||
# Log metrics.
|
|
||||||
results = {}
|
|
||||||
if args.evaluate_during_training:
|
|
||||||
results = evaluate(args, model, tokenizer, disable_logging=disable_logging)
|
|
||||||
loss_scalar = loss.item()
|
|
||||||
logger.info(
|
|
||||||
"global_step: {global_step}, lr: {lr:.6f}, loss: {loss:.3f}".format(
|
|
||||||
global_step=global_step, lr=scheduler.get_lr()[0], loss=loss_scalar
|
|
||||||
)
|
|
||||||
)
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
# tpu-comment: All values must be in CPU and not on TPU device
|
|
||||||
for key, value in results.items():
|
|
||||||
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
|
|
||||||
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
|
|
||||||
tb_writer.add_scalar("loss", loss_scalar, global_step)
|
|
||||||
|
|
||||||
if args.max_steps > 0 and global_step > args.max_steps:
|
|
||||||
epoch_iterator.close()
|
|
||||||
break
|
|
||||||
if args.metrics_debug:
|
|
||||||
# tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
|
|
||||||
xm.master_print(met.metrics_report())
|
|
||||||
if args.max_steps > 0 and global_step > args.max_steps:
|
|
||||||
train_iterator.close()
|
|
||||||
break
|
|
||||||
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
tb_writer.close()
|
|
||||||
return global_step, loss.item()
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, model, tokenizer, prefix="", disable_logging=False):
|
|
||||||
"""Evaluate the model"""
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
# Only master writes to Tensorboard
|
|
||||||
tb_writer = SummaryWriter(args.tensorboard_logdir)
|
|
||||||
|
|
||||||
# Loop to handle MNLI double evaluation (matched, mis-matched)
|
|
||||||
eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
|
|
||||||
eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)
|
|
||||||
|
|
||||||
results = {}
|
|
||||||
for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
|
|
||||||
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
|
|
||||||
eval_sampler = get_sampler(eval_dataset)
|
|
||||||
|
|
||||||
if not os.path.exists(eval_output_dir):
|
|
||||||
os.makedirs(eval_output_dir)
|
|
||||||
|
|
||||||
dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, shuffle=False)
|
|
||||||
eval_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device)
|
|
||||||
|
|
||||||
# Eval!
|
|
||||||
logger.info("***** Running evaluation {} *****".format(prefix))
|
|
||||||
logger.info(" Num examples = %d", len(dataloader) * args.eval_batch_size)
|
|
||||||
logger.info(" Batch size = %d", args.eval_batch_size)
|
|
||||||
eval_loss = 0.0
|
|
||||||
nb_eval_steps = 0
|
|
||||||
preds = None
|
|
||||||
out_label_ids = None
|
|
||||||
for batch in tqdm(eval_dataloader, desc="Evaluating", disable=disable_logging):
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
with torch.no_grad():
|
|
||||||
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
|
|
||||||
if args.model_type != "distilbert":
|
|
||||||
# XLM, DistilBERT and RoBERTa don't use segment_ids
|
|
||||||
inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None
|
|
||||||
outputs = model(**inputs)
|
|
||||||
batch_eval_loss, logits = outputs[:2]
|
|
||||||
|
|
||||||
eval_loss += batch_eval_loss
|
|
||||||
nb_eval_steps += 1
|
|
||||||
if preds is None:
|
|
||||||
preds = logits.detach().cpu().numpy()
|
|
||||||
out_label_ids = inputs["labels"].detach().cpu().numpy()
|
|
||||||
else:
|
|
||||||
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
|
|
||||||
out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
|
|
||||||
|
|
||||||
# tpu-comment: Get all predictions and labels from all worker shards of eval dataset
|
|
||||||
preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
|
|
||||||
out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)
|
|
||||||
|
|
||||||
eval_loss = eval_loss / nb_eval_steps
|
|
||||||
if args.output_mode == "classification":
|
|
||||||
preds = np.argmax(preds, axis=1)
|
|
||||||
elif args.output_mode == "regression":
|
|
||||||
preds = np.squeeze(preds)
|
|
||||||
result = compute_metrics(eval_task, preds, out_label_ids)
|
|
||||||
results.update(result)
|
|
||||||
results["eval_loss"] = eval_loss.item()
|
|
||||||
|
|
||||||
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
with open(output_eval_file, "w") as writer:
|
|
||||||
logger.info("***** Eval results {} *****".format(prefix))
|
|
||||||
for key in sorted(results.keys()):
|
|
||||||
logger.info(" %s = %s", key, str(results[key]))
|
|
||||||
writer.write("%s = %s\n" % (key, str(results[key])))
|
|
||||||
tb_writer.add_scalar(f"{eval_task}/{key}", results[key])
|
|
||||||
|
|
||||||
if args.metrics_debug:
|
|
||||||
# tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
|
|
||||||
xm.master_print(met.metrics_report())
|
|
||||||
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
tb_writer.close()
|
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
||||||
|
|
||||||
def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
|
||||||
if not xm.is_master_ordinal():
|
|
||||||
xm.rendezvous("load_and_cache_examples")
|
|
||||||
|
|
||||||
processor = processors[task]()
|
|
||||||
output_mode = output_modes[task]
|
|
||||||
cached_features_file = os.path.join(
|
|
||||||
args.cache_dir,
|
|
||||||
"cached_{}_{}_{}_{}".format(
|
|
||||||
"dev" if evaluate else "train",
|
|
||||||
list(filter(None, args.model_name_or_path.split("/"))).pop(),
|
|
||||||
str(args.max_seq_length),
|
|
||||||
str(task),
|
|
||||||
),
|
|
||||||
)
|
|
||||||
|
|
||||||
# Load data features from cache or dataset file
|
|
||||||
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
|
||||||
logger.info("Loading features from cached file %s", cached_features_file)
|
|
||||||
features = torch.load(cached_features_file)
|
|
||||||
else:
|
|
||||||
logger.info("Creating features from dataset file at %s", args.data_dir)
|
|
||||||
label_list = processor.get_labels()
|
|
||||||
if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta"]:
|
|
||||||
# HACK(label indices are swapped in RoBERTa pretrained model)
|
|
||||||
label_list[1], label_list[2] = label_list[2], label_list[1]
|
|
||||||
examples = (
|
|
||||||
processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
|
|
||||||
)
|
|
||||||
features = convert_examples_to_features(
|
|
||||||
examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
|
|
||||||
)
|
|
||||||
logger.info("Saving features into cached file %s", cached_features_file)
|
|
||||||
torch.save(features, cached_features_file)
|
|
||||||
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
xm.rendezvous("load_and_cache_examples")
|
|
||||||
|
|
||||||
# Convert to Tensors and build dataset
|
|
||||||
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
|
||||||
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
|
|
||||||
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
|
|
||||||
if output_mode == "classification":
|
|
||||||
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
|
|
||||||
elif output_mode == "regression":
|
|
||||||
all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
|
|
||||||
|
|
||||||
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
|
|
||||||
return dataset
|
|
||||||
|
|
||||||
|
|
||||||
def main(args):
|
|
||||||
if (
|
|
||||||
os.path.exists(args.output_dir)
|
|
||||||
and os.listdir(args.output_dir)
|
|
||||||
and args.do_train
|
|
||||||
and not args.overwrite_output_dir
|
|
||||||
):
|
|
||||||
raise ValueError(
|
|
||||||
(
|
|
||||||
"Output directory ({}) already exists and is not empty." " Use --overwrite_output_dir to overcome."
|
|
||||||
).format(args.output_dir)
|
|
||||||
)
|
|
||||||
|
|
||||||
# tpu-comment: Get TPU/XLA Device
|
|
||||||
args.device = xm.xla_device()
|
|
||||||
|
|
||||||
# Setup logging
|
|
||||||
logging.basicConfig(
|
|
||||||
format="[xla:{}] %(asctime)s - %(levelname)s - %(name)s - %(message)s".format(xm.get_ordinal()),
|
|
||||||
datefmt="%m/%d/%Y %H:%M:%S",
|
|
||||||
level=logging.INFO,
|
|
||||||
)
|
|
||||||
disable_logging = False
|
|
||||||
if not xm.is_master_ordinal() and args.only_log_master:
|
|
||||||
# Disable all non-master loggers below CRITICAL.
|
|
||||||
logging.disable(logging.CRITICAL)
|
|
||||||
disable_logging = True
|
|
||||||
logger.warning("Process rank: %s, device: %s, num_cores: %s", xm.get_ordinal(), args.device, args.num_cores)
|
|
||||||
|
|
||||||
# Set seed to have same initialization
|
|
||||||
set_seed(args.seed)
|
|
||||||
|
|
||||||
# Prepare GLUE task
|
|
||||||
args.task_name = args.task_name.lower()
|
|
||||||
if args.task_name not in processors:
|
|
||||||
raise ValueError("Task not found: %s" % (args.task_name))
|
|
||||||
processor = processors[args.task_name]()
|
|
||||||
args.output_mode = output_modes[args.task_name]
|
|
||||||
label_list = processor.get_labels()
|
|
||||||
num_labels = len(label_list)
|
|
||||||
|
|
||||||
if not xm.is_master_ordinal():
|
|
||||||
xm.rendezvous(
|
|
||||||
"download_only_once"
|
|
||||||
) # Make sure only the first process in distributed training will download model & vocab
|
|
||||||
|
|
||||||
# Load pretrained model and tokenizer
|
|
||||||
args.model_type = args.model_type.lower()
|
|
||||||
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
|
|
||||||
config = config_class.from_pretrained(
|
|
||||||
args.config_name if args.config_name else args.model_name_or_path,
|
|
||||||
num_labels=num_labels,
|
|
||||||
finetuning_task=args.task_name,
|
|
||||||
cache_dir=args.cache_dir if args.cache_dir else None,
|
|
||||||
xla_device=True,
|
|
||||||
)
|
|
||||||
tokenizer = tokenizer_class.from_pretrained(
|
|
||||||
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
|
|
||||||
do_lower_case=args.do_lower_case,
|
|
||||||
cache_dir=args.cache_dir if args.cache_dir else None,
|
|
||||||
)
|
|
||||||
model = model_class.from_pretrained(
|
|
||||||
args.model_name_or_path,
|
|
||||||
from_tf=bool(".ckpt" in args.model_name_or_path),
|
|
||||||
config=config,
|
|
||||||
cache_dir=args.cache_dir if args.cache_dir else None,
|
|
||||||
)
|
|
||||||
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
xm.rendezvous("download_only_once")
|
|
||||||
|
|
||||||
# Send model to TPU/XLA device.
|
|
||||||
model.to(args.device)
|
|
||||||
|
|
||||||
logger.info("Training/evaluation parameters %s", args)
|
|
||||||
|
|
||||||
if args.do_train:
|
|
||||||
# Train the model.
|
|
||||||
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
|
|
||||||
global_step, tr_loss = train(args, train_dataset, model, tokenizer, disable_logging=disable_logging)
|
|
||||||
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
|
|
||||||
|
|
||||||
if xm.is_master_ordinal():
|
|
||||||
# Save trained model.
|
|
||||||
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
|
||||||
|
|
||||||
# Create output directory if needed
|
|
||||||
if not os.path.exists(args.output_dir):
|
|
||||||
os.makedirs(args.output_dir)
|
|
||||||
|
|
||||||
logger.info("Saving model checkpoint to %s", args.output_dir)
|
|
||||||
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
|
|
||||||
# They can then be reloaded using `from_pretrained()`
|
|
||||||
tokenizer.save_pretrained(args.output_dir)
|
|
||||||
# Good practice: save your training arguments together with the trained.
|
|
||||||
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
|
|
||||||
|
|
||||||
xm.rendezvous("post_training_checkpoint")
|
|
||||||
# model.save_pretrained needs to be called by all ordinals
|
|
||||||
model.save_pretrained(args.output_dir)
|
|
||||||
|
|
||||||
# Load a trained model and vocabulary that you have fine-tuned
|
|
||||||
model = model_class.from_pretrained(args.output_dir)
|
|
||||||
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
|
|
||||||
model.to(args.device)
|
|
||||||
|
|
||||||
# Evaluation
|
|
||||||
results = {}
|
|
||||||
if args.do_eval:
|
|
||||||
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
|
||||||
checkpoints = [args.output_dir]
|
|
||||||
if args.eval_all_checkpoints:
|
|
||||||
checkpoints = list(
|
|
||||||
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
|
|
||||||
)
|
|
||||||
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
|
|
||||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
|
||||||
for checkpoint in checkpoints:
|
|
||||||
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
|
|
||||||
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
|
|
||||||
|
|
||||||
model = model_class.from_pretrained(checkpoint)
|
|
||||||
model.to(args.device)
|
|
||||||
result = evaluate(args, model, tokenizer, prefix=prefix, disable_logging=disable_logging)
|
|
||||||
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
|
|
||||||
results.update(result)
|
|
||||||
|
|
||||||
return results
|
|
||||||
|
|
||||||
|
|
||||||
def get_args():
|
|
||||||
parser = argparse.ArgumentParser()
|
|
||||||
|
|
||||||
# Required parameters
|
|
||||||
parser.add_argument(
|
|
||||||
"--data_dir",
|
|
||||||
default=None,
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--model_type",
|
|
||||||
default=None,
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--model_name_or_path",
|
|
||||||
default=None,
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--task_name",
|
|
||||||
default=None,
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--output_dir",
|
|
||||||
default=None,
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help="The output directory where the model predictions and checkpoints will be written.",
|
|
||||||
)
|
|
||||||
|
|
||||||
# TPU Parameters
|
|
||||||
parser.add_argument("--num_cores", default=8, type=int, help="Number of TPU cores to use (1 or 8).")
|
|
||||||
parser.add_argument("--metrics_debug", action="store_true", help="Whether to print debug metrics.")
|
|
||||||
|
|
||||||
# Other parameters
|
|
||||||
parser.add_argument(
|
|
||||||
"--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--tokenizer_name",
|
|
||||||
default="",
|
|
||||||
type=str,
|
|
||||||
help="Pretrained tokenizer name or path if not the same as model_name",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--cache_dir",
|
|
||||||
default="",
|
|
||||||
type=str,
|
|
||||||
help="Where do you want to store the pre-trained models downloaded and features file generated",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--max_seq_length",
|
|
||||||
default=128,
|
|
||||||
type=int,
|
|
||||||
help="The maximum total input sequence length after tokenization. Sequences longer "
|
|
||||||
"than this will be truncated, sequences shorter will be padded.",
|
|
||||||
)
|
|
||||||
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
|
|
||||||
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
|
|
||||||
)
|
|
||||||
|
|
||||||
parser.add_argument("--train_batch_size", default=8, type=int, help="Per core batch size for training.")
|
|
||||||
parser.add_argument("--eval_batch_size", default=8, type=int, help="Per core batch size for evaluation.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--gradient_accumulation_steps",
|
|
||||||
type=int,
|
|
||||||
default=1,
|
|
||||||
help="Number of updates steps to accumulate before performing a backward/update pass.",
|
|
||||||
)
|
|
||||||
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
|
|
||||||
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
|
|
||||||
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
|
|
||||||
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--max_steps",
|
|
||||||
default=-1,
|
|
||||||
type=int,
|
|
||||||
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
|
|
||||||
)
|
|
||||||
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
|
|
||||||
parser.add_argument("--tensorboard_logdir", default="./runs", type=str, help="Where to write tensorboard metrics.")
|
|
||||||
parser.add_argument("--logging_steps", type=int, default=50, help="Log every X update steps.")
|
|
||||||
parser.add_argument("--only_log_master", action="store_true", help="Whether to log only from each hosts master.")
|
|
||||||
parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X update steps.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--eval_all_checkpoints",
|
|
||||||
action="store_true",
|
|
||||||
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
|
|
||||||
)
|
|
||||||
parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
|
|
||||||
|
|
||||||
return parser.parse_args()
|
|
||||||
|
|
||||||
|
|
||||||
def _mp_fn(rank, args):
|
|
||||||
main(args)
|
|
||||||
|
|
||||||
|
|
||||||
def main_cli():
|
|
||||||
args = get_args()
|
|
||||||
xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main_cli()
|
|
@ -7,7 +7,7 @@ import time
|
|||||||
import torch
|
import torch
|
||||||
from torch.utils.data import DataLoader
|
from torch.utils.data import DataLoader
|
||||||
|
|
||||||
from transformer_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
|
from lightning_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
|
||||||
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
@ -5,7 +5,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
|
|||||||
# Make output directory if it doesn't exist
|
# Make output directory if it doesn't exist
|
||||||
mkdir -p $OUTPUT_DIR
|
mkdir -p $OUTPUT_DIR
|
||||||
|
|
||||||
# Add parent directory to python path to access transformer_base.py
|
# Add parent directory to python path to access lightning_base.py
|
||||||
export PYTHONPATH="../../":"${PYTHONPATH}"
|
export PYTHONPATH="../../":"${PYTHONPATH}"
|
||||||
|
|
||||||
python finetune.py \
|
python finetune.py \
|
||||||
|
@ -12,7 +12,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
|
|||||||
# Make output directory if it doesn't exist
|
# Make output directory if it doesn't exist
|
||||||
mkdir -p $OUTPUT_DIR
|
mkdir -p $OUTPUT_DIR
|
||||||
|
|
||||||
# Add parent directory to python path to access transformer_base.py and utils.py
|
# Add parent directory to python path to access lightning_base.py and utils.py
|
||||||
export PYTHONPATH="../../":"${PYTHONPATH}"
|
export PYTHONPATH="../../":"${PYTHONPATH}"
|
||||||
python finetune.py \
|
python finetune.py \
|
||||||
--data_dir=cnn_tiny/ \
|
--data_dir=cnn_tiny/ \
|
||||||
|
@ -16,14 +16,24 @@
|
|||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import logging
|
import logging
|
||||||
|
import os
|
||||||
import sys
|
import sys
|
||||||
import unittest
|
import unittest
|
||||||
from unittest.mock import patch
|
from unittest.mock import patch
|
||||||
|
|
||||||
import run_generation
|
|
||||||
import run_glue
|
SRC_DIRS = [
|
||||||
import run_language_modeling
|
os.path.join(os.path.dirname(__file__), dirname)
|
||||||
import run_squad
|
for dirname in ["text-generation", "text-classification", "language-modeling", "question-answering"]
|
||||||
|
]
|
||||||
|
sys.path.extend(SRC_DIRS)
|
||||||
|
|
||||||
|
|
||||||
|
if SRC_DIRS is not None:
|
||||||
|
import run_generation
|
||||||
|
import run_glue
|
||||||
|
import run_language_modeling
|
||||||
|
import run_squad
|
||||||
|
|
||||||
|
|
||||||
logging.basicConfig(level=logging.DEBUG)
|
logging.basicConfig(level=logging.DEBUG)
|
||||||
@ -43,24 +53,24 @@ class ExamplesTests(unittest.TestCase):
|
|||||||
stream_handler = logging.StreamHandler(sys.stdout)
|
stream_handler = logging.StreamHandler(sys.stdout)
|
||||||
logger.addHandler(stream_handler)
|
logger.addHandler(stream_handler)
|
||||||
|
|
||||||
testargs = [
|
testargs = """
|
||||||
"run_glue.py",
|
run_glue.py
|
||||||
"--data_dir=./examples/tests_samples/MRPC/",
|
--model_name_or_path bert-base-uncased
|
||||||
"--task_name=mrpc",
|
--data_dir ./tests/fixtures/tests_samples/MRPC/
|
||||||
"--do_train",
|
--task_name mrpc
|
||||||
"--do_eval",
|
--do_train
|
||||||
"--output_dir=./examples/tests_samples/temp_dir",
|
--do_eval
|
||||||
"--per_gpu_train_batch_size=2",
|
--output_dir ./tests/fixtures/tests_samples/temp_dir
|
||||||
"--per_gpu_eval_batch_size=1",
|
--per_gpu_train_batch_size=2
|
||||||
"--learning_rate=1e-4",
|
--per_gpu_eval_batch_size=1
|
||||||
"--max_steps=10",
|
--learning_rate=1e-4
|
||||||
"--warmup_steps=2",
|
--max_steps=10
|
||||||
"--overwrite_output_dir",
|
--warmup_steps=2
|
||||||
"--seed=42",
|
--overwrite_output_dir
|
||||||
"--max_seq_length=128",
|
--seed=42
|
||||||
]
|
--max_seq_length=128
|
||||||
model_name = "--model_name_or_path=bert-base-uncased"
|
""".split()
|
||||||
with patch.object(sys, "argv", testargs + [model_name]):
|
with patch.object(sys, "argv", testargs):
|
||||||
result = run_glue.main()
|
result = run_glue.main()
|
||||||
del result["loss"]
|
del result["loss"]
|
||||||
for value in result.values():
|
for value in result.values():
|
||||||
@ -78,7 +88,7 @@ class ExamplesTests(unittest.TestCase):
|
|||||||
--line_by_line
|
--line_by_line
|
||||||
--train_data_file ./tests/fixtures/sample_text.txt
|
--train_data_file ./tests/fixtures/sample_text.txt
|
||||||
--eval_data_file ./tests/fixtures/sample_text.txt
|
--eval_data_file ./tests/fixtures/sample_text.txt
|
||||||
--output_dir ./tests/fixtures
|
--output_dir ./tests/fixtures/tests_samples/temp_dir
|
||||||
--overwrite_output_dir
|
--overwrite_output_dir
|
||||||
--do_train
|
--do_train
|
||||||
--do_eval
|
--do_eval
|
||||||
@ -93,24 +103,25 @@ class ExamplesTests(unittest.TestCase):
|
|||||||
stream_handler = logging.StreamHandler(sys.stdout)
|
stream_handler = logging.StreamHandler(sys.stdout)
|
||||||
logger.addHandler(stream_handler)
|
logger.addHandler(stream_handler)
|
||||||
|
|
||||||
testargs = [
|
testargs = """
|
||||||
"run_squad.py",
|
run_squad.py
|
||||||
"--data_dir=./examples/tests_samples/SQUAD",
|
--model_type=bert
|
||||||
"--model_name=bert-base-uncased",
|
--model_name_or_path=bert-base-uncased
|
||||||
"--output_dir=./examples/tests_samples/temp_dir",
|
--data_dir=./tests/fixtures/tests_samples/SQUAD
|
||||||
"--max_steps=10",
|
--model_name=bert-base-uncased
|
||||||
"--warmup_steps=2",
|
--output_dir=./tests/fixtures/tests_samples/temp_dir
|
||||||
"--do_train",
|
--max_steps=10
|
||||||
"--do_eval",
|
--warmup_steps=2
|
||||||
"--version_2_with_negative",
|
--do_train
|
||||||
"--learning_rate=2e-4",
|
--do_eval
|
||||||
"--per_gpu_train_batch_size=2",
|
--version_2_with_negative
|
||||||
"--per_gpu_eval_batch_size=1",
|
--learning_rate=2e-4
|
||||||
"--overwrite_output_dir",
|
--per_gpu_train_batch_size=2
|
||||||
"--seed=42",
|
--per_gpu_eval_batch_size=1
|
||||||
]
|
--overwrite_output_dir
|
||||||
model_type, model_name = ("--model_type=bert", "--model_name_or_path=bert-base-uncased")
|
--seed=42
|
||||||
with patch.object(sys, "argv", testargs + [model_type, model_name]):
|
""".split()
|
||||||
|
with patch.object(sys, "argv", testargs):
|
||||||
result = run_squad.main()
|
result = run_squad.main()
|
||||||
self.assertGreaterEqual(result["f1"], 30)
|
self.assertGreaterEqual(result["f1"], 30)
|
||||||
self.assertGreaterEqual(result["exact"], 30)
|
self.assertGreaterEqual(result["exact"], 30)
|
||||||
|
299
examples/text-classification/README.md
Normal file
299
examples/text-classification/README.md
Normal file
@ -0,0 +1,299 @@
|
|||||||
|
## GLUE Benchmark
|
||||||
|
|
||||||
|
# Run TensorFlow 2.0 version
|
||||||
|
|
||||||
|
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
|
||||||
|
|
||||||
|
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
|
||||||
|
|
||||||
|
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
|
||||||
|
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
|
||||||
|
These options and the below benchmark are provided by @tlkh.
|
||||||
|
|
||||||
|
Quick benchmarks from the script (no other modifications):
|
||||||
|
|
||||||
|
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
|
||||||
|
| --------- | -------- | ----------------------- | ----------------------|
|
||||||
|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
|
||||||
|
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
|
||||||
|
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
|
||||||
|
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
|
||||||
|
| 1080 Ti | FP32 | 55s | - |
|
||||||
|
|
||||||
|
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Run PyTorch version
|
||||||
|
|
||||||
|
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
|
||||||
|
|
||||||
|
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
||||||
|
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
||||||
|
|
||||||
|
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
|
||||||
|
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
|
||||||
|
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
|
||||||
|
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
|
||||||
|
|
||||||
|
| Task | Metric | Result |
|
||||||
|
|-------|------------------------------|-------------|
|
||||||
|
| CoLA | Matthew's corr | 49.23 |
|
||||||
|
| SST-2 | Accuracy | 91.97 |
|
||||||
|
| MRPC | F1/Accuracy | 89.47/85.29 |
|
||||||
|
| STS-B | Person/Spearman corr. | 83.95/83.70 |
|
||||||
|
| QQP | Accuracy/F1 | 88.40/84.31 |
|
||||||
|
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
|
||||||
|
| QNLI | Accuracy | 87.46 |
|
||||||
|
| RTE | Accuracy | 61.73 |
|
||||||
|
| WNLI | Accuracy | 45.07 |
|
||||||
|
|
||||||
|
Some of these results are significantly different from the ones reported on the test set
|
||||||
|
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
|
||||||
|
|
||||||
|
Before running any one of these GLUE tasks you should download the
|
||||||
|
[GLUE data](https://gluebenchmark.com/tasks) by running
|
||||||
|
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
||||||
|
and unpack it to some directory `$GLUE_DIR`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
export TASK_NAME=MRPC
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name $TASK_NAME \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/$TASK_NAME \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/$TASK_NAME/
|
||||||
|
```
|
||||||
|
|
||||||
|
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
|
||||||
|
|
||||||
|
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
|
||||||
|
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
|
||||||
|
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
|
||||||
|
|
||||||
|
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
|
||||||
|
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
|
||||||
|
said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
|
||||||
|
since the data processor for each task inherits from the base class DataProcessor.
|
||||||
|
|
||||||
|
## Running on TPUs
|
||||||
|
|
||||||
|
You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
|
||||||
|
[README](https://github.com/pytorch/xla/blob/master/README.md).
|
||||||
|
|
||||||
|
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
|
||||||
|
identical to your normal GPU + Huggingface setup.
|
||||||
|
|
||||||
|
For running your GLUE task on MNLI dataset you can run something like the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
export TASK_NAME=MNLI
|
||||||
|
|
||||||
|
python run_glue_tpu.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name $TASK_NAME \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/$TASK_NAME \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--train_batch_size 32 \
|
||||||
|
--learning_rate 3e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/$TASK_NAME \
|
||||||
|
--overwrite_output_dir \
|
||||||
|
--logging_steps 50 \
|
||||||
|
--save_steps 200 \
|
||||||
|
--num_cores=8 \
|
||||||
|
--only_log_master
|
||||||
|
```
|
||||||
|
|
||||||
|
### MRPC
|
||||||
|
|
||||||
|
#### Fine-tuning example
|
||||||
|
|
||||||
|
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
|
||||||
|
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
|
||||||
|
|
||||||
|
Before running any one of these GLUE tasks you should download the
|
||||||
|
[GLUE data](https://gluebenchmark.com/tasks) by running
|
||||||
|
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
|
||||||
|
and unpack it to some directory `$GLUE_DIR`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name MRPC \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mrpc_output/
|
||||||
|
```
|
||||||
|
|
||||||
|
Our test ran on a few seeds with [the original implementation hyper-
|
||||||
|
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
|
||||||
|
results between 84% and 88%.
|
||||||
|
|
||||||
|
#### Using Apex and mixed-precision
|
||||||
|
|
||||||
|
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
|
||||||
|
[apex](https://github.com/NVIDIA/apex), then run the following example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name MRPC \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mrpc_output/ \
|
||||||
|
--fp16
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Distributed training
|
||||||
|
|
||||||
|
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
|
||||||
|
reaches F1 > 92 on MRPC.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python -m torch.distributed.launch \
|
||||||
|
--nproc_per_node 8 run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name MRPC \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 8 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mrpc_output/
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with these hyper-parameters gave us the following results:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
acc = 0.8823529411764706
|
||||||
|
acc_and_f1 = 0.901702786377709
|
||||||
|
eval_loss = 0.3418912578906332
|
||||||
|
f1 = 0.9210526315789473
|
||||||
|
global_step = 174
|
||||||
|
loss = 0.07231863956341798
|
||||||
|
```
|
||||||
|
|
||||||
|
### MNLI
|
||||||
|
|
||||||
|
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
|
python -m torch.distributed.launch \
|
||||||
|
--nproc_per_node 8 run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name mnli \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $GLUE_DIR/MNLI/ \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_gpu_train_batch_size 8 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir output_dir \
|
||||||
|
```
|
||||||
|
|
||||||
|
The results are the following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
***** Eval results *****
|
||||||
|
acc = 0.8679706601466992
|
||||||
|
eval_loss = 0.4911287787382479
|
||||||
|
global_step = 18408
|
||||||
|
loss = 0.04755385363816904
|
||||||
|
|
||||||
|
***** Eval results *****
|
||||||
|
acc = 0.8747965825874695
|
||||||
|
eval_loss = 0.45516540421714036
|
||||||
|
global_step = 18408
|
||||||
|
loss = 0.04755385363816904
|
||||||
|
```
|
||||||
|
|
||||||
|
# Run PyTorch version using PyTorch-Lightning
|
||||||
|
|
||||||
|
Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
|
||||||
|
|
||||||
|
Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are:
|
||||||
|
|
||||||
|
```
|
||||||
|
TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
# XNLI
|
||||||
|
|
||||||
|
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
|
||||||
|
|
||||||
|
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
|
||||||
|
|
||||||
|
#### Fine-tuning on XNLI
|
||||||
|
|
||||||
|
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
|
||||||
|
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
|
||||||
|
`$XNLI_DIR` directory.
|
||||||
|
|
||||||
|
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
|
||||||
|
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export XNLI_DIR=/path/to/XNLI
|
||||||
|
|
||||||
|
python run_xnli.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-multilingual-cased \
|
||||||
|
--language de \
|
||||||
|
--train_language en \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $XNLI_DIR \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 5e-5 \
|
||||||
|
--num_train_epochs 2.0 \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--output_dir /tmp/debug_xnli/ \
|
||||||
|
--save_steps -1
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results on the **test** set:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
acc = 0.7093812375249501
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -20,7 +20,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
|
|||||||
|
|
||||||
# Make output directory if it doesn't exist
|
# Make output directory if it doesn't exist
|
||||||
mkdir -p $OUTPUT_DIR
|
mkdir -p $OUTPUT_DIR
|
||||||
# Add parent directory to python path to access transformer_base.py
|
# Add parent directory to python path to access lightning_base.py
|
||||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||||
|
|
||||||
python3 run_pl_glue.py --data_dir $DATA_DIR \
|
python3 run_pl_glue.py --data_dir $DATA_DIR \
|
@ -8,7 +8,7 @@ import numpy as np
|
|||||||
import torch
|
import torch
|
||||||
from torch.utils.data import DataLoader, TensorDataset
|
from torch.utils.data import DataLoader, TensorDataset
|
||||||
|
|
||||||
from transformer_base import BaseTransformer, add_generic_args, generic_train
|
from lightning_base import BaseTransformer, add_generic_args, generic_train
|
||||||
from transformers import glue_compute_metrics as compute_metrics
|
from transformers import glue_compute_metrics as compute_metrics
|
||||||
from transformers import glue_convert_examples_to_features as convert_examples_to_features
|
from transformers import glue_convert_examples_to_features as convert_examples_to_features
|
||||||
from transformers import glue_output_modes
|
from transformers import glue_output_modes
|
@ -14,7 +14,7 @@
|
|||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
""" Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
|
""" Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
|
||||||
Adapted from `examples/run_glue.py`"""
|
Adapted from `examples/text-classification/run_glue.py`"""
|
||||||
|
|
||||||
|
|
||||||
import argparse
|
import argparse
|
15
examples/text-generation/README.md
Normal file
15
examples/text-generation/README.md
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
## Language generation
|
||||||
|
|
||||||
|
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
|
||||||
|
|
||||||
|
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
|
||||||
|
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
||||||
|
can try out the different models available in the library.
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_generation.py \
|
||||||
|
--model_type=gpt2 \
|
||||||
|
--model_name_or_path=gpt2
|
||||||
|
```
|
Before Width: | Height: | Size: 653 KiB After Width: | Height: | Size: 653 KiB |
Before Width: | Height: | Size: 664 KiB After Width: | Height: | Size: 664 KiB |
@ -27,7 +27,7 @@ export CURRENT_DIR=${PWD}
|
|||||||
export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
|
export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
|
||||||
mkdir -p $OUTPUT_DIR
|
mkdir -p $OUTPUT_DIR
|
||||||
|
|
||||||
# Add parent directory to python path to access transformer_base.py
|
# Add parent directory to python path to access lightning_base.py
|
||||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||||
|
|
||||||
python3 run_pl_ner.py --data_dir ./ \
|
python3 run_pl_ner.py --data_dir ./ \
|
@ -9,7 +9,7 @@ from seqeval.metrics import f1_score, precision_score, recall_score
|
|||||||
from torch.nn import CrossEntropyLoss
|
from torch.nn import CrossEntropyLoss
|
||||||
from torch.utils.data import DataLoader, TensorDataset
|
from torch.utils.data import DataLoader, TensorDataset
|
||||||
|
|
||||||
from transformer_base import BaseTransformer, add_generic_args, generic_train
|
from lightning_base import BaseTransformer, add_generic_args, generic_train
|
||||||
from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
|
from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
|
||||||
|
|
||||||
|
|
@ -18,10 +18,10 @@ class ExamplesTests(unittest.TestCase):
|
|||||||
|
|
||||||
testargs = """
|
testargs = """
|
||||||
--model_name distilbert-base-german-cased
|
--model_name distilbert-base-german-cased
|
||||||
--output_dir ./examples/tests_samples/temp_dir
|
--output_dir ./tests/fixtures/tests_samples/temp_dir
|
||||||
--overwrite_output_dir
|
--overwrite_output_dir
|
||||||
--data_dir ./examples/tests_samples/GermEval
|
--data_dir ./tests/fixtures/tests_samples/GermEval
|
||||||
--labels ./examples/tests_samples/GermEval/labels.txt
|
--labels ./tests/fixtures/tests_samples/GermEval/labels.txt
|
||||||
--max_seq_length 128
|
--max_seq_length 128
|
||||||
--num_train_epochs 6
|
--num_train_epochs 6
|
||||||
--logging_steps 1
|
--logging_steps 1
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
This model is a fine tuned RoBERTA model over STS-B.
|
This model is a fine tuned RoBERTA model over STS-B.
|
||||||
It was trained with these params:
|
It was trained with these params:
|
||||||
!python /content/transformers/examples/run_glue.py \
|
!python /content/transformers/examples/text-classification/run_glue.py \
|
||||||
--model_type roberta \
|
--model_type roberta \
|
||||||
--model_name_or_path roberta-large \
|
--model_name_or_path roberta-large \
|
||||||
--task_name STS-B \
|
--task_name STS-B \
|
||||||
|
@ -333,7 +333,7 @@ class Trainer:
|
|||||||
total_train_batch_size = (
|
total_train_batch_size = (
|
||||||
self.args.train_batch_size
|
self.args.train_batch_size
|
||||||
* self.args.gradient_accumulation_steps
|
* self.args.gradient_accumulation_steps
|
||||||
* (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1),
|
* (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)
|
||||||
)
|
)
|
||||||
logger.info("***** Running training *****")
|
logger.info("***** Running training *****")
|
||||||
logger.info(" Num examples = %d", num_examples)
|
logger.info(" Num examples = %d", num_examples)
|
||||||
|
Can't render this file because it contains an unexpected character in line 3 and column 155.
|
Can't render this file because it contains an unexpected character in line 3 and column 155.
|
@ -28,7 +28,7 @@ class DataCollatorIntegrationTest(unittest.TestCase):
|
|||||||
MODEL_ID = "bert-base-cased-finetuned-mrpc"
|
MODEL_ID = "bert-base-cased-finetuned-mrpc"
|
||||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||||
data_args = GlueDataTrainingArguments(
|
data_args = GlueDataTrainingArguments(
|
||||||
task_name="mrpc", data_dir="./examples/tests_samples/MRPC", overwrite_cache=True
|
task_name="mrpc", data_dir="./tests/fixtures/tests_samples/MRPC", overwrite_cache=True
|
||||||
)
|
)
|
||||||
dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
|
dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
|
||||||
data_collator = DefaultDataCollator()
|
data_collator = DefaultDataCollator()
|
||||||
@ -39,7 +39,7 @@ class DataCollatorIntegrationTest(unittest.TestCase):
|
|||||||
MODEL_ID = "distilroberta-base"
|
MODEL_ID = "distilroberta-base"
|
||||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||||
data_args = GlueDataTrainingArguments(
|
data_args = GlueDataTrainingArguments(
|
||||||
task_name="sts-b", data_dir="./examples/tests_samples/STS-B", overwrite_cache=True
|
task_name="sts-b", data_dir="./tests/fixtures/tests_samples/STS-B", overwrite_cache=True
|
||||||
)
|
)
|
||||||
dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
|
dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
|
||||||
data_collator = DefaultDataCollator()
|
data_collator = DefaultDataCollator()
|
||||||
@ -91,7 +91,7 @@ class TrainerIntegrationTest(unittest.TestCase):
|
|||||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||||
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
|
||||||
data_args = GlueDataTrainingArguments(
|
data_args = GlueDataTrainingArguments(
|
||||||
task_name="mrpc", data_dir="./examples/tests_samples/MRPC", overwrite_cache=True
|
task_name="mrpc", data_dir="./tests/fixtures/tests_samples/MRPC", overwrite_cache=True
|
||||||
)
|
)
|
||||||
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
|
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True)
|
||||||
|
|
||||||
|
@ -1,13 +1,13 @@
|
|||||||
---
|
---
|
||||||
|
|
||||||
- step:
|
- step:
|
||||||
name: Execute python examples/run_glue.py
|
name: Execute python examples/text-classification/run_glue.py
|
||||||
image: pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
|
image: pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
|
||||||
command:
|
command:
|
||||||
- python /valohai/repository/utils/download_glue_data.py --data_dir=/glue_data
|
- python /valohai/repository/utils/download_glue_data.py --data_dir=/glue_data
|
||||||
- pip install -e .
|
- pip install -e .
|
||||||
- pip install -r examples/requirements.txt
|
- pip install -r examples/requirements.txt
|
||||||
- python examples/run_glue.py --do_train --data_dir=/glue_data/{parameter-value:task_name} {parameters}
|
- python examples/text-classification/run_glue.py --do_train --data_dir=/glue_data/{parameter-value:task_name} {parameters}
|
||||||
parameters:
|
parameters:
|
||||||
- name: model_type
|
- name: model_type
|
||||||
pass-as: --model_type={v}
|
pass-as: --model_type={v}
|
||||||
|
Loading…
Reference in New Issue
Block a user