# Translation Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks. This guide will show you how to fine-tune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French. See the translation [task page](https://huggingface.co/tasks/translation) for more information about its associated models, datasets, and metrics. ## Load OPUS Books dataset Load the OPUS Books dataset from the 🤗 Datasets library: ```py >>> from datasets import load_dataset >>> books = load_dataset("opus_books", "en-fr") ``` Split this dataset into a train and test set: ```py books = books["train"].train_test_split(test_size=0.2) ``` Then take a look at an example: ```py >>> books["train"][0] {'id': '90560', 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.', 'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}} ``` The `translation` field is a dictionary containing the English and French translations of the text. ## Preprocess Load the T5 tokenizer to process the language pairs: ```py >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("t5-small") ``` The preprocessing function needs to: 1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks. 2. Tokenize the input (English) and target (French) separately. You can't tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it. 3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter. ```py >>> source_lang = "en" >>> target_lang = "fr" >>> prefix = "translate English to French: " >>> def preprocess_function(examples): ... inputs = [prefix + example[source_lang] for example in examples["translation"]] ... targets = [example[target_lang] for example in examples["translation"]] ... model_inputs = tokenizer(inputs, max_length=128, truncation=True) ... with tokenizer.as_target_tokenizer(): ... labels = tokenizer(targets, max_length=128, truncation=True) ... model_inputs["labels"] = labels["input_ids"] ... return model_inputs ``` Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once: ```py >>> tokenized_books = books.map(preprocess_function, batched=True) ``` Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient. ```py >>> from transformers import DataCollatorForSeq2Seq >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) ``` ```py >>> from transformers import DataCollatorForSeq2Seq >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf") ``` ## Train Load T5 with [`AutoModelForSeq2SeqLM`]: ```py >>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer >>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") ``` If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)! At this point, only three steps remain: 1. Define your training hyperparameters in [`Seq2SeqTrainingArguments`]. 2. Pass the training arguments to [`Seq2SeqTrainer`] along with the model, dataset, tokenizer, and data collator. 3. Call [`~Trainer.train`] to fine-tune your model. ```py >>> training_args = Seq2SeqTrainingArguments( ... output_dir="./results", ... evaluation_strategy="epoch", ... learning_rate=2e-5, ... per_device_train_batch_size=16, ... per_device_eval_batch_size=16, ... weight_decay=0.01, ... save_total_limit=3, ... num_train_epochs=1, ... fp16=True, ... ) >>> trainer = Seq2SeqTrainer( ... model=model, ... args=training_args, ... train_dataset=tokenized_books["train"], ... eval_dataset=tokenized_books["test"], ... tokenizer=tokenizer, ... data_collator=data_collator, ... ) >>> trainer.train() ``` To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: ```py >>> tf_train_set = tokenized_books["train"].to_tf_dataset( ... columns=["attention_mask", "input_ids", "labels"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) >>> tf_test_set = tokenized_books["test"].to_tf_dataset( ... columns=["attention_mask", "input_ids", "labels"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, ... ) ``` If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)! Set up an optimizer function, learning rate schedule, and some training hyperparameters: ```py >>> from transformers import create_optimizer, AdamWeightDecay >>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) ``` Load T5 with [`TFAutoModelForSeq2SeqLM`]: ```py >>> from transformers import TFAutoModelForSeq2SeqLM >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small") ``` Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): ```py >>> model.compile(optimizer=optimizer) ``` Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model: ```py >>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3) ``` For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).