mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 13:20:12 +06:00
400 lines
15 KiB
Plaintext
400 lines
15 KiB
Plaintext
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# Fine-tuning a pretrained model
|
|
|
|
[[open-in-colab]]
|
|
|
|
In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow,
|
|
models can be directly trained using Keras and the `fit` method. In PyTorch, there is no generic training loop so
|
|
the 🤗 Transformers library provides an API with the class [`Trainer`] to let you fine-tune or train
|
|
a model from scratch easily. Then we will show you how to alternatively write the whole training loop in PyTorch.
|
|
|
|
Before we can fine-tune a model, we need a dataset. In this tutorial, we will show you how to fine-tune BERT on the
|
|
[IMDB dataset](https://www.imdb.com/interfaces/): the task is to classify whether movie reviews are positive or
|
|
negative. For examples of other tasks, refer to the [additional-resources](#additional-resources) section!
|
|
|
|
<a id='data-processing'></a>
|
|
|
|
## Preparing the datasets
|
|
|
|
<Youtube id="_BZearw7f0w"/>
|
|
|
|
We will use the [🤗 Datasets](https://github.com/huggingface/datasets/) library to download and preprocess the IMDB
|
|
datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
|
|
to the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/) or the [preprocessing](preprocessing) tutorial for
|
|
more information.
|
|
|
|
First, we can use the `load_dataset` function to download and cache the dataset:
|
|
|
|
```python
|
|
from datasets import load_dataset
|
|
|
|
raw_datasets = load_dataset("imdb")
|
|
```
|
|
|
|
This works like the `from_pretrained` method we saw for the models and tokenizers (except the cache directory is
|
|
_~/.cache/huggingface/dataset_ by default).
|
|
|
|
The `raw_datasets` object is a dictionary with three keys: `"train"`, `"test"` and `"unsupervised"`
|
|
(which correspond to the three splits of that dataset). We will use the `"train"` split for training and the
|
|
`"test"` split for validation.
|
|
|
|
To preprocess our data, we will need a tokenizer:
|
|
|
|
```python
|
|
from transformers import AutoTokenizer
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
|
```
|
|
|
|
As we saw in [preprocessing](preprocessing), we can prepare the text inputs for the model with the following command (this is an
|
|
example, not a command you can execute):
|
|
|
|
```python
|
|
inputs = tokenizer(sentences, padding="max_length", truncation=True)
|
|
```
|
|
|
|
This will make all the samples have the maximum length the model can accept (here 512), either by padding or truncating
|
|
them.
|
|
|
|
However, we can instead apply these preprocessing steps to all the splits of our dataset at once by using the
|
|
`map` method:
|
|
|
|
```python
|
|
def tokenize_function(examples):
|
|
return tokenizer(examples["text"], padding="max_length", truncation=True)
|
|
|
|
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
|
|
```
|
|
|
|
You can learn more about the map method or the other ways to preprocess the data in the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/).
|
|
|
|
Next we will generate a small subset of the training and validation set, to enable faster training:
|
|
|
|
```python
|
|
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
|
|
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
|
|
full_train_dataset = tokenized_datasets["train"]
|
|
full_eval_dataset = tokenized_datasets["test"]
|
|
```
|
|
|
|
In all the examples below, we will always use `small_train_dataset` and `small_eval_dataset`. Just replace
|
|
them by their _full_ equivalent to train or evaluate on the full dataset.
|
|
|
|
<a id='trainer'></a>
|
|
|
|
## Fine-tuning in PyTorch with the Trainer API
|
|
|
|
<Youtube id="nvBXf7s7vTI"/>
|
|
|
|
Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a [`Trainer`]
|
|
API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
|
|
logging, gradient accumulation, and mixed precision.
|
|
|
|
First, let's define our model:
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
|
|
```
|
|
|
|
This will issue a warning about some of the pretrained weights not being used and some weights being randomly
|
|
initialized. That's because we are throwing away the pretraining head of the BERT model to replace it with a
|
|
classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge
|
|
of the pretrained model to it (which is why doing this is called transfer learning).
|
|
|
|
Then, to define our [`Trainer`], we will need to instantiate a
|
|
[`TrainingArguments`]. This class contains all the hyperparameters we can tune for the
|
|
[`Trainer`] or the flags to activate the different training options it supports. Let's begin by
|
|
using all the defaults, the only thing we then have to provide is a directory in which the checkpoints will be saved:
|
|
|
|
```python
|
|
from transformers import TrainingArguments
|
|
|
|
training_args = TrainingArguments("test_trainer")
|
|
```
|
|
|
|
Then we can instantiate a [`Trainer`] like this:
|
|
|
|
```python
|
|
from transformers import Trainer
|
|
|
|
trainer = Trainer(
|
|
model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
|
|
)
|
|
```
|
|
|
|
To fine-tune our model, we just need to call
|
|
|
|
```python
|
|
trainer.train()
|
|
```
|
|
|
|
which will start a training that you can follow with a progress bar, which should take a couple of minutes to complete
|
|
(as long as you have access to a GPU). It won't actually tell you anything useful about how well (or badly) your model
|
|
is performing however as by default, there is no evaluation during training, and we didn't tell the
|
|
[`Trainer`] to compute any metrics. Let's have a look on how to do that now!
|
|
|
|
To have the [`Trainer`] compute and report metrics, we need to give it a `compute_metrics`
|
|
function that takes predictions and labels (grouped in a namedtuple called [`EvalPrediction`]) and
|
|
return a dictionary with string items (the metric names) and float values (the metric values).
|
|
|
|
The 🤗 Datasets library provides an easy way to get the common metrics used in NLP with the `load_metric` function.
|
|
here we simply use accuracy. Then we define the `compute_metrics` function that just convert logits to predictions
|
|
(remember that all 🤗 Transformers models return the logits) and feed them to `compute` method of this metric.
|
|
|
|
```python
|
|
import numpy as np
|
|
from datasets import load_metric
|
|
|
|
metric = load_metric("accuracy")
|
|
|
|
def compute_metrics(eval_pred):
|
|
logits, labels = eval_pred
|
|
predictions = np.argmax(logits, axis=-1)
|
|
return metric.compute(predictions=predictions, references=labels)
|
|
```
|
|
|
|
The compute function needs to receive a tuple (with logits and labels) and has to return a dictionary with string keys
|
|
(the name of the metric) and float values. It will be called at the end of each evaluation phase on the whole arrays of
|
|
predictions/labels.
|
|
|
|
To check if this works on practice, let's create a new [`Trainer`] with our fine-tuned model:
|
|
|
|
```python
|
|
trainer = Trainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=small_train_dataset,
|
|
eval_dataset=small_eval_dataset,
|
|
compute_metrics=compute_metrics,
|
|
)
|
|
trainer.evaluate()
|
|
```
|
|
|
|
which showed an accuracy of 87.5% in our case.
|
|
|
|
If you want to fine-tune your model and regularly report the evaluation metrics (for instance at the end of each
|
|
epoch), here is how you should define your training arguments:
|
|
|
|
```python
|
|
from transformers import TrainingArguments
|
|
|
|
training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")
|
|
```
|
|
|
|
See the documentation of [`TrainingArguments`] for more options.
|
|
|
|
|
|
<a id='keras'></a>
|
|
|
|
## Fine-tuning with Keras
|
|
|
|
<Youtube id="rnTGBy2ax1c"/>
|
|
|
|
Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
|
|
|
|
```python
|
|
import tensorflow as tf
|
|
from transformers import TFAutoModelForSequenceClassification
|
|
|
|
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
|
|
```
|
|
|
|
Then we will need to convert our datasets from before in standard `tf.data.Dataset`. Since we have fixed shapes,
|
|
it can easily be done like this. First we remove the _"text"_ column from our datasets and set them in TensorFlow
|
|
format:
|
|
|
|
```python
|
|
tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
|
|
tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")
|
|
```
|
|
|
|
Then we convert everything in big tensors and use the `tf.data.Dataset.from_tensor_slices` method:
|
|
|
|
```python
|
|
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
|
|
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
|
|
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)
|
|
|
|
eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
|
|
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
|
|
eval_tf_dataset = eval_tf_dataset.batch(8)
|
|
```
|
|
|
|
With this done, the model can then be compiled and trained as any Keras model:
|
|
|
|
```python
|
|
model.compile(
|
|
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
|
|
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
|
|
metrics=tf.metrics.SparseCategoricalAccuracy(),
|
|
)
|
|
|
|
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)
|
|
```
|
|
|
|
With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
|
|
as a PyTorch model (or vice-versa):
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification
|
|
|
|
model.save_pretrained("my_imdb_model")
|
|
pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
|
|
```
|
|
|
|
<a id='pytorch_native'></a>
|
|
|
|
## Fine-tuning in native PyTorch
|
|
|
|
<Youtube id="Dh9CL8fyG80"/>
|
|
|
|
You might need to restart your notebook at this stage to free some memory, or execute the following code:
|
|
|
|
```python
|
|
del model
|
|
del pytorch_model
|
|
del trainer
|
|
torch.cuda.empty_cache()
|
|
```
|
|
|
|
Let's now see how to achieve the same results as in [trainer section](#trainer) in PyTorch. First we need to
|
|
define the dataloaders, which we will use to iterate over batches. We just need to apply a bit of post-processing to
|
|
our `tokenized_datasets` before doing that to:
|
|
|
|
- remove the columns corresponding to values the model does not expect (here the `"text"` column)
|
|
- rename the column `"label"` to `"labels"` (because the model expect the argument to be named `labels`)
|
|
- set the format of the datasets so they return PyTorch Tensors instead of lists.
|
|
|
|
Our _tokenized_datasets_ has one method for each of those steps:
|
|
|
|
```python
|
|
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
|
|
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
|
|
tokenized_datasets.set_format("torch")
|
|
|
|
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
|
|
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
|
|
```
|
|
|
|
Now that this is done, we can easily define our dataloaders:
|
|
|
|
```python
|
|
from torch.utils.data import DataLoader
|
|
|
|
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
|
|
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
|
|
```
|
|
|
|
Next, we define our model:
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
|
|
```
|
|
|
|
We are almost ready to write our training loop, the only two things are missing are an optimizer and a learning rate
|
|
scheduler. The default optimizer used by the [`Trainer`] is [`AdamW`]:
|
|
|
|
```python
|
|
from transformers import AdamW
|
|
|
|
optimizer = AdamW(model.parameters(), lr=5e-5)
|
|
```
|
|
|
|
Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5 here) to 0:
|
|
|
|
```python
|
|
from transformers import get_scheduler
|
|
|
|
num_epochs = 3
|
|
num_training_steps = num_epochs * len(train_dataloader)
|
|
lr_scheduler = get_scheduler(
|
|
"linear",
|
|
optimizer=optimizer,
|
|
num_warmup_steps=0,
|
|
num_training_steps=num_training_steps
|
|
)
|
|
```
|
|
|
|
One last thing, we will want to use the GPU if we have access to one (otherwise training might take several hours
|
|
instead of a couple of minutes). To do this, we define a `device` we will put our model and our batches on.
|
|
|
|
```python
|
|
import torch
|
|
|
|
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
|
|
model.to(device)
|
|
```
|
|
|
|
We now are ready to train! To get some sense of when it will be finished, we add a progress bar over our number of
|
|
training steps, using the _tqdm_ library.
|
|
|
|
```python
|
|
from tqdm.auto import tqdm
|
|
|
|
progress_bar = tqdm(range(num_training_steps))
|
|
|
|
model.train()
|
|
for epoch in range(num_epochs):
|
|
for batch in train_dataloader:
|
|
batch = {k: v.to(device) for k, v in batch.items()}
|
|
outputs = model(**batch)
|
|
loss = outputs.loss
|
|
loss.backward()
|
|
|
|
optimizer.step()
|
|
lr_scheduler.step()
|
|
optimizer.zero_grad()
|
|
progress_bar.update(1)
|
|
```
|
|
|
|
Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a
|
|
bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better
|
|
this way for Transformers model (so this is not an oversight on our side). If you're not familiar with what "freezing
|
|
the body" of the model means, forget you read this paragraph.
|
|
|
|
Now to check the results, we need to write the evaluation loop. Like in the [trainer section](#trainer) we will
|
|
use a metric from the datasets library. Here we accumulate the predictions at each batch before computing the final
|
|
result when the loop is finished.
|
|
|
|
```python
|
|
metric= load_metric("accuracy")
|
|
model.eval()
|
|
for batch in eval_dataloader:
|
|
batch = {k: v.to(device) for k, v in batch.items()}
|
|
with torch.no_grad():
|
|
outputs = model(**batch)
|
|
|
|
logits = outputs.logits
|
|
predictions = torch.argmax(logits, dim=-1)
|
|
metric.add_batch(predictions=predictions, references=batch["labels"])
|
|
|
|
metric.compute()
|
|
```
|
|
|
|
<a id='additional-resources'></a>
|
|
|
|
## Additional resources
|
|
|
|
To look at more fine-tuning examples you can refer to:
|
|
|
|
- [🤗 Transformers Examples](https://github.com/huggingface/transformers/tree/master/examples) which includes scripts
|
|
to train on all common NLP tasks in PyTorch and TensorFlow.
|
|
|
|
- [🤗 Transformers Notebooks](notebooks) which contains various notebooks and in particular one per task (look for
|
|
the _how to finetune a model on xxx_).
|