# Translation
Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks.
This guide will show you how to fine-tune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.
See the translation [task page](https://huggingface.co/tasks/translation) for more information about its associated models, datasets, and metrics.
## Load OPUS Books dataset
Load the OPUS Books dataset from the 🤗 Datasets library:
```py
>>> from datasets import load_dataset
>>> books = load_dataset("opus_books", "en-fr")
```
Split this dataset into a train and test set:
```py
books = books["train"].train_test_split(test_size=0.2)
```
Then take a look at an example:
```py
>>> books["train"][0]
{'id': '90560',
'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}
```
The `translation` field is a dictionary containing the English and French translations of the text.
## Preprocess
Load the T5 tokenizer to process the language pairs:
```py
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
```
The preprocessing function needs to:
1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Tokenize the input (English) and target (French) separately. You can't tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.
```py
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "
>>> def preprocess_function(examples):
... inputs = [prefix + example[source_lang] for example in examples["translation"]]
... targets = [example[target_lang] for example in examples["translation"]]
... model_inputs = tokenizer(inputs, max_length=128, truncation=True)
... with tokenizer.as_target_tokenizer():
... labels = tokenizer(targets, max_length=128, truncation=True)
... model_inputs["labels"] = labels["input_ids"]
... return model_inputs
```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
```py
>>> tokenized_books = books.map(preprocess_function, batched=True)
```
Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
```py
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
```
```py
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
```
## Fine-tune with Trainer
Load T5 with [`AutoModelForSeq2SeqLM`]:
```py
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
At this point, only three steps remain:
1. Define your training hyperparameters in [`Seq2SeqTrainingArguments`].
2. Pass the training arguments to [`Seq2SeqTrainer`] along with the model, dataset, tokenizer, and data collator.
3. Call [`~Trainer.train`] to fine-tune your model.
```py
>>> training_args = Seq2SeqTrainingArguments(
... output_dir="./results",
... evaluation_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... weight_decay=0.01,
... save_total_limit=3,
... num_train_epochs=1,
... fp16=True,
... )
>>> trainer = Seq2SeqTrainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_books["train"],
... eval_dataset=tokenized_books["test"],
... tokenizer=tokenizer,
... data_collator=data_collator,
... )
>>> trainer.train()
```
## Fine-tune with TensorFlow
To fine-tune a model in TensorFlow is just as easy, with only a few differences.
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](../training#finetune-with-keras)!
Convert your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
```
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
```py
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
```
Load T5 with [`TFAutoModelForSeq2SeqLM`]:
```py
>>> from transformers import TFAutoModelForSeq2SeqLM
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
```py
>>> model.compile(optimizer=optimizer)
```
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
```py
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
```
For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/translation.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/translation-tf.ipynb).