mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 22:30:09 +06:00

* Start simplification * More progress * Finished script * Address comments and update tests instructions * Wrong test * Accept files as inputs and fix test * Update src/transformers/trainer_utils.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Fix labels and add combined score * Add special labels * Update TPU command * Revert to old label strategy * Use model labels * Fix for STT-B * Styling * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Code styling * Fix review comments Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
274 lines
10 KiB
Markdown
274 lines
10 KiB
Markdown
## GLUE Benchmark
|
||
|
||
# Run TensorFlow 2.0 version
|
||
|
||
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
|
||
|
||
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
|
||
|
||
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
|
||
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
|
||
These options and the below benchmark are provided by @tlkh.
|
||
|
||
Quick benchmarks from the script (no other modifications):
|
||
|
||
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
|
||
| --------- | -------- | ----------------------- | ----------------------|
|
||
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
|
||
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
|
||
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
|
||
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
|
||
| 1080 Ti | FP32 | 55s | - |
|
||
|
||
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
|
||
|
||
|
||
## Run generic text classification script in TensorFlow
|
||
|
||
The script [run_tf_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_text_classification.py) allows users to run a text classification on their own CSV files. For now there are few restrictions, the CSV files must have a header corresponding to the column names and not more than three columns: one column for the id, one column for the text and another column for a second piece of text in case of an entailment classification for example.
|
||
|
||
To use the script, one as to run the following command line:
|
||
```bash
|
||
python run_tf_text_classification.py \
|
||
--train_file train.csv \ ### training dataset file location (mandatory if running with --do_train option)
|
||
--dev_file dev.csv \ ### development dataset file location (mandatory if running with --do_eval option)
|
||
--test_file test.csv \ ### test dataset file location (mandatory if running with --do_predict option)
|
||
--label_column_id 0 \ ### which column corresponds to the labels
|
||
--model_name_or_path bert-base-multilingual-uncased \
|
||
--output_dir model \
|
||
--num_train_epochs 4 \
|
||
--per_device_train_batch_size 16 \
|
||
--per_device_eval_batch_size 32 \
|
||
--do_train \
|
||
--do_eval \
|
||
--do_predict \
|
||
--logging_steps 10 \
|
||
--evaluation_strategy steps \
|
||
--save_steps 10 \
|
||
--overwrite_output_dir \
|
||
--max_seq_length 128
|
||
```
|
||
|
||
# Run PyTorch version
|
||
|
||
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
|
||
|
||
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
||
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
||
|
||
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
|
||
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
|
||
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
|
||
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
|
||
|
||
| Task | Metric | Result |
|
||
|-------|------------------------------|-------------|
|
||
| CoLA | Matthew's corr | 49.23 |
|
||
| SST-2 | Accuracy | 91.97 |
|
||
| MRPC | F1/Accuracy | 89.47/85.29 |
|
||
| STS-B | Person/Spearman corr. | 83.95/83.70 |
|
||
| QQP | Accuracy/F1 | 88.40/84.31 |
|
||
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
|
||
| QNLI | Accuracy | 87.46 |
|
||
| RTE | Accuracy | 61.73 |
|
||
| WNLI | Accuracy | 45.07 |
|
||
|
||
Some of these results are significantly different from the ones reported on the test set
|
||
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the
|
||
website.
|
||
|
||
```bash
|
||
export TASK_NAME=MRPC
|
||
|
||
python run_glue.py \
|
||
--model_name_or_path bert-base-cased \
|
||
--task_name $TASK_NAME \
|
||
--do_train \
|
||
--do_eval \
|
||
--max_seq_length 128 \
|
||
--per_device_train_batch_size 32 \
|
||
--learning_rate 2e-5 \
|
||
--num_train_epochs 3.0 \
|
||
--output_dir /tmp/$TASK_NAME/
|
||
```
|
||
|
||
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
|
||
|
||
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
|
||
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
|
||
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
|
||
|
||
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
|
||
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
|
||
said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
|
||
since the data processor for each task inherits from the base class DataProcessor.
|
||
|
||
## Running on TPUs in PyTorch
|
||
|
||
Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on
|
||
how to setup your TPU environment refer to the
|
||
[pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
|
||
|
||
For running your GLUE task on MNLI dataset you can run something like the following form the root of the transformers
|
||
repo:
|
||
|
||
```
|
||
python examples/xla_spawn.py \
|
||
--num_cores=8 \
|
||
transformers/examples/text-classification/run_glue.py \
|
||
--do_train \
|
||
--do_eval \
|
||
--task_name=mrpc \
|
||
--num_train_epochs=3 \
|
||
--max_seq_length=128 \
|
||
--learning_rate=5e-5 \
|
||
--output_dir=/tmp/mrpc \
|
||
--overwrite_output_dir \
|
||
--logging_steps=5 \
|
||
--save_steps=5 \
|
||
--tpu_metrics_debug \
|
||
--model_name_or_path=bert-base-cased \
|
||
--per_device_train_batch_size=64 \
|
||
--per_device_eval_batch_size=64
|
||
```
|
||
|
||
|
||
#### Using Apex and mixed-precision
|
||
|
||
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
|
||
[apex](https://github.com/NVIDIA/apex), then run the following example:
|
||
|
||
```bash
|
||
|
||
python run_glue.py \
|
||
--model_name_or_path bert-base-cased \
|
||
--task_name MRPC \
|
||
--do_train \
|
||
--do_eval \
|
||
--max_seq_length 128 \
|
||
--per_device_train_batch_size 32 \
|
||
--learning_rate 2e-5 \
|
||
--num_train_epochs 3.0 \
|
||
--output_dir /tmp/mrpc_output/ \
|
||
--fp16
|
||
```
|
||
|
||
#### Distributed training
|
||
|
||
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
|
||
reaches F1 > 92 on MRPC.
|
||
|
||
```bash
|
||
|
||
python -m torch.distributed.launch \
|
||
--nproc_per_node 8 run_glue.py \
|
||
--model_name_or_path bert-base-cased \
|
||
--task_name mrpc \
|
||
--do_train \
|
||
--do_eval \
|
||
--max_seq_length 128 \
|
||
--per_device_train_batch_size 8 \
|
||
--learning_rate 2e-5 \
|
||
--num_train_epochs 3.0 \
|
||
--output_dir /tmp/mrpc_output/
|
||
```
|
||
|
||
Training with these hyper-parameters gave us the following results:
|
||
|
||
```bash
|
||
acc = 0.8823529411764706
|
||
acc_and_f1 = 0.901702786377709
|
||
eval_loss = 0.3418912578906332
|
||
f1 = 0.9210526315789473
|
||
global_step = 174
|
||
loss = 0.07231863956341798
|
||
```
|
||
|
||
### MNLI
|
||
|
||
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
|
||
|
||
```bash
|
||
export GLUE_DIR=/path/to/glue
|
||
|
||
python -m torch.distributed.launch \
|
||
--nproc_per_node 8 run_glue.py \
|
||
--model_name_or_path bert-base-cased \
|
||
--task_name mnli \
|
||
--do_train \
|
||
--do_eval \
|
||
--max_seq_length 128 \
|
||
--per_device_train_batch_size 8 \
|
||
--learning_rate 2e-5 \
|
||
--num_train_epochs 3.0 \
|
||
--output_dir output_dir \
|
||
```
|
||
|
||
The results are the following:
|
||
|
||
```bash
|
||
***** Eval results *****
|
||
acc = 0.8679706601466992
|
||
eval_loss = 0.4911287787382479
|
||
global_step = 18408
|
||
loss = 0.04755385363816904
|
||
|
||
***** Eval results *****
|
||
acc = 0.8747965825874695
|
||
eval_loss = 0.45516540421714036
|
||
global_step = 18408
|
||
loss = 0.04755385363816904
|
||
```
|
||
|
||
# Run PyTorch version using PyTorch-Lightning
|
||
|
||
Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in
|
||
`examples/requirements.txt`. It is a shell pipeline that will automatically download, preprocess the data and run the
|
||
specified models. Logs are saved in `lightning_logs` directory.
|
||
|
||
Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are:
|
||
|
||
```
|
||
TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}
|
||
```
|
||
|
||
|
||
# XNLI
|
||
|
||
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
|
||
|
||
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is a crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
|
||
|
||
#### Fine-tuning on XNLI
|
||
|
||
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
|
||
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
|
||
`$XNLI_DIR` directory.
|
||
|
||
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
|
||
* [XNLI-MT 1.0](https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip)
|
||
|
||
```bash
|
||
export XNLI_DIR=/path/to/XNLI
|
||
|
||
python run_xnli.py \
|
||
--model_name_or_path bert-base-multilingual-cased \
|
||
--language de \
|
||
--train_language en \
|
||
--do_train \
|
||
--do_eval \
|
||
--data_dir $XNLI_DIR \
|
||
--per_device_train_batch_size 32 \
|
||
--learning_rate 5e-5 \
|
||
--num_train_epochs 2.0 \
|
||
--max_seq_length 128 \
|
||
--output_dir /tmp/debug_xnli/ \
|
||
--save_steps -1
|
||
```
|
||
|
||
Training with the previously defined hyper-parameters yields the following results on the **test** set:
|
||
|
||
```bash
|
||
acc = 0.7093812375249501
|
||
```
|