mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 12:50:06 +06:00
225 lines
7.5 KiB
Plaintext
225 lines
7.5 KiB
Plaintext
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# Translation
|
|
|
|
<Youtube id="1JvfrvZgi6c"/>
|
|
|
|
Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks.
|
|
|
|
This guide will show you how to fine-tune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.
|
|
|
|
<Tip>
|
|
|
|
See the translation [task page](https://huggingface.co/tasks/translation) for more information about its associated models, datasets, and metrics.
|
|
|
|
</Tip>
|
|
|
|
## Load OPUS Books dataset
|
|
|
|
Load the OPUS Books dataset from the 🤗 Datasets library:
|
|
|
|
```py
|
|
>>> from datasets import load_dataset
|
|
|
|
>>> books = load_dataset("opus_books", "en-fr")
|
|
```
|
|
|
|
Split this dataset into a train and test set:
|
|
|
|
```py
|
|
books = books["train"].train_test_split(test_size=0.2)
|
|
```
|
|
|
|
Then take a look at an example:
|
|
|
|
```py
|
|
>>> books["train"][0]
|
|
{'id': '90560',
|
|
'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
|
|
'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}
|
|
```
|
|
|
|
The `translation` field is a dictionary containing the English and French translations of the text.
|
|
|
|
## Preprocess
|
|
|
|
<Youtube id="XAR8jnZZuUs"/>
|
|
|
|
Load the T5 tokenizer to process the language pairs:
|
|
|
|
```py
|
|
>>> from transformers import AutoTokenizer
|
|
|
|
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
|
|
```
|
|
|
|
The preprocessing function needs to:
|
|
|
|
1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
|
|
2. Tokenize the input (English) and target (French) separately. You can't tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
|
|
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.
|
|
|
|
```py
|
|
>>> source_lang = "en"
|
|
>>> target_lang = "fr"
|
|
>>> prefix = "translate English to French: "
|
|
|
|
|
|
>>> def preprocess_function(examples):
|
|
... inputs = [prefix + example[source_lang] for example in examples["translation"]]
|
|
... targets = [example[target_lang] for example in examples["translation"]]
|
|
... model_inputs = tokenizer(inputs, max_length=128, truncation=True)
|
|
|
|
... with tokenizer.as_target_tokenizer():
|
|
... labels = tokenizer(targets, max_length=128, truncation=True)
|
|
|
|
... model_inputs["labels"] = labels["input_ids"]
|
|
... return model_inputs
|
|
```
|
|
|
|
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
|
|
|
```py
|
|
>>> tokenized_books = books.map(preprocess_function, batched=True)
|
|
```
|
|
|
|
Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
|
|
|
<frameworkcontent>
|
|
<pt>
|
|
```py
|
|
>>> from transformers import DataCollatorForSeq2Seq
|
|
|
|
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
|
|
```
|
|
</pt>
|
|
<tf>
|
|
```py
|
|
>>> from transformers import DataCollatorForSeq2Seq
|
|
|
|
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
|
|
```
|
|
</tf>
|
|
</frameworkcontent>
|
|
|
|
## Train
|
|
|
|
<frameworkcontent>
|
|
<pt>
|
|
Load T5 with [`AutoModelForSeq2SeqLM`]:
|
|
|
|
```py
|
|
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
|
|
|
|
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
|
|
```
|
|
|
|
<Tip>
|
|
|
|
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
|
|
|
</Tip>
|
|
|
|
At this point, only three steps remain:
|
|
|
|
1. Define your training hyperparameters in [`Seq2SeqTrainingArguments`].
|
|
2. Pass the training arguments to [`Seq2SeqTrainer`] along with the model, dataset, tokenizer, and data collator.
|
|
3. Call [`~Trainer.train`] to fine-tune your model.
|
|
|
|
```py
|
|
>>> training_args = Seq2SeqTrainingArguments(
|
|
... output_dir="./results",
|
|
... evaluation_strategy="epoch",
|
|
... learning_rate=2e-5,
|
|
... per_device_train_batch_size=16,
|
|
... per_device_eval_batch_size=16,
|
|
... weight_decay=0.01,
|
|
... save_total_limit=3,
|
|
... num_train_epochs=1,
|
|
... fp16=True,
|
|
... )
|
|
|
|
>>> trainer = Seq2SeqTrainer(
|
|
... model=model,
|
|
... args=training_args,
|
|
... train_dataset=tokenized_books["train"],
|
|
... eval_dataset=tokenized_books["test"],
|
|
... tokenizer=tokenizer,
|
|
... data_collator=data_collator,
|
|
... )
|
|
|
|
>>> trainer.train()
|
|
```
|
|
</pt>
|
|
<tf>
|
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
|
|
|
```py
|
|
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
|
|
... columns=["attention_mask", "input_ids", "labels"],
|
|
... shuffle=True,
|
|
... batch_size=16,
|
|
... collate_fn=data_collator,
|
|
... )
|
|
|
|
>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
|
|
... columns=["attention_mask", "input_ids", "labels"],
|
|
... shuffle=False,
|
|
... batch_size=16,
|
|
... collate_fn=data_collator,
|
|
... )
|
|
```
|
|
|
|
<Tip>
|
|
|
|
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
|
|
|
</Tip>
|
|
|
|
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
|
|
|
|
```py
|
|
>>> from transformers import create_optimizer, AdamWeightDecay
|
|
|
|
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
|
|
```
|
|
|
|
Load T5 with [`TFAutoModelForSeq2SeqLM`]:
|
|
|
|
```py
|
|
>>> from transformers import TFAutoModelForSeq2SeqLM
|
|
|
|
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
|
|
```
|
|
|
|
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
|
|
|
```py
|
|
>>> model.compile(optimizer=optimizer)
|
|
```
|
|
|
|
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
|
|
|
```py
|
|
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
|
|
```
|
|
</tf>
|
|
</frameworkcontent>
|
|
|
|
<Tip>
|
|
|
|
For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding
|
|
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
|
|
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).
|
|
|
|
</Tip> |