transformers/docs/source/en/preprocessing.md
Nilay Bhatnagar eedd21b9e7
Fixed Majority of the Typos in transformers[en] Documentation (#33350)
* Fixed typo: insted to instead

* Fixed typo: relase to release

* Fixed typo: nighlty to nightly

* Fixed typos: versatible, benchamarks, becnhmark to versatile, benchmark, benchmarks

* Fixed typo in comment: quantizd to quantized

* Fixed typo: architecutre to architecture

* Fixed typo: contibution to contribution

* Fixed typo: Presequities to Prerequisites

* Fixed typo: faste to faster

* Fixed typo: extendeding to extending

* Fixed typo: segmetantion_maps to segmentation_maps

* Fixed typo: Alternativelly to Alternatively

* Fixed incorrectly defined variable: output to output_disabled

* Fixed typo in library name: tranformers.onnx to transformers.onnx

* Fixed missing import: import tensorflow as tf

* Fixed incorrectly defined variable: token_tensor to tokens_tensor

* Fixed missing import: import torch

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed typo in function args: numpy.ndarry to numpy.ndarray

* Fixed Inconsistent Library Name: Torchscript to TorchScript

* Fixed Inconsistent Class Name: OneformerProcessor to OneFormerProcessor

* Fixed Inconsistent Class Named Typo: TFLNetForMultipleChoice to TFXLNetForMultipleChoice

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Function Name Typo: captureWarning to captureWarnings

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Class Name Typo: TrainingArgument to TrainingArguments

* Fixed Inconsistent Model Name Typo: Swin2R to Swin2SR

* Fixed Inconsistent Model Name Typo: EART to BERT

* Fixed Inconsistent Library Name Typo: TensorFLow to TensorFlow

* Fixed Broken Link for Speech Emotion Classification with Wav2Vec2

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed Punctuation: Two commas

* Fixed Punctuation: No Space between XLM-R and is

* Fixed Punctuation: No Space between [~accelerate.Accelerator.backward] and method

* Added backticks to display model.fit() in codeblock

* Added backticks to display openai-community/gpt2 in codeblock

* Fixed Minor Typo: will to with

* Fixed Minor Typo: is to are

* Fixed Minor Typo: in to on

* Fixed Minor Typo: inhibits to exhibits

* Fixed Minor Typo: they need to it needs

* Fixed Minor Typo: cast the load the checkpoints To load the checkpoints

* Fixed Inconsistent Class Name Typo: TFCamembertForCasualLM to TFCamembertForCausalLM

* Fixed typo in attribute name: outputs.last_hidden_states to outputs.last_hidden_state

* Added missing verbosity level: fatal

* Fixed Minor Typo: take To takes

* Fixed Minor Typo: heuristic To heuristics

* Fixed Minor Typo: setting To settings

* Fixed Minor Typo: Content To Contents

* Fixed Minor Typo: millions To million

* Fixed Minor Typo: difference To differences

* Fixed Minor Typo: while extract To which extracts

* Fixed Minor Typo: Hereby To Here

* Fixed Minor Typo: addition To additional

* Fixed Minor Typo: supports To supported

* Fixed Minor Typo: so that benchmark results TO as a consequence, benchmark

* Fixed Minor Typo: a To an

* Fixed Minor Typo: a To an

* Fixed Minor Typo: Chain-of-though To Chain-of-thought
2024-09-09 10:47:24 +02:00

24 KiB
Raw Blame History

Preprocess

open-in-colab

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, it needs to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:

  • Text, use a Tokenizer to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
  • Speech and audio, use a Feature extractor to extract sequential features from audio waveforms and convert them into tensors.
  • Image inputs use a ImageProcessor to convert images into tensors.
  • Multimodal inputs, use a Processor to combine a tokenizer and a feature extractor or image processor.

AutoProcessor always works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.

Before you begin, install 🤗 Datasets so you can load some datasets to experiment with:

pip install datasets

Natural Language Processing

The main tool for preprocessing textual data is a tokenizer. A tokenizer splits text into tokens according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.

If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the vocab) during pretraining.

Get started by loading a pretrained tokenizer with the [AutoTokenizer.from_pretrained] method. This downloads the vocab a model was pretrained with:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

Then pass your text to the tokenizer:

>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
>>> print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer returns a dictionary with three important items:

  • input_ids are the indices corresponding to each token in the sentence.
  • attention_mask indicates whether a token should be attended to or not.
  • token_type_ids identifies which sequence a token belongs to when there is more than one sequence.

Return your input by decoding the input_ids:

>>> tokenizer.decode(encoded_input["input_ids"])
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

As you can see, the tokenizer added two special tokens - CLS and SEP (classifier and separator) - to the sentence. Not all models need special tokens, but if they do, the tokenizer automatically adds them for you.

If there are several sentences you want to preprocess, pass them as a list to the tokenizer:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1]]}

Pad

Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.

Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

The first and third sentences are now padded with 0's because they are shorter.

Truncation

On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length.

Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
>>> print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

Check out the Padding and truncation concept guide to learn more different padding and truncation arguments.

Build tensors

Finally, you want the tokenizer to return the actual tensors that get fed to the model.

Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow:

>>> batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
>>> print(encoded_input)
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
```py >>> batch_sentences = [ ... "But what about second breakfast?", ... "Don't think he knows about second breakfast, Pip.", ... "What about elevensies?", ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf") >>> print(encoded_input) {'input_ids': , 'token_type_ids': , 'attention_mask': } ``` Different pipelines support tokenizer arguments in their `__call__()` differently. `text-2-text-generation` pipelines support (i.e. pass on) only `truncation`. `text-generation` pipelines support `max_length`, `truncation`, `padding` and `add_special_tokens`. In `fill-mask` pipelines, tokenizer arguments can be passed in the `tokenizer_kwargs` argument (dictionary).

Audio

For audio tasks, you'll need a feature extractor to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors.

Load the MInDS-14 dataset (see the 🤗 Datasets tutorial for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets:

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

Access the first element of the audio column to take a look at the input. Calling the audio column automatically loads and resamples the audio file:

>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 8000}

This returns three items:

  • array is the speech signal loaded - and potentially resampled - as a 1D array.
  • path points to the location of the audio file.
  • sampling_rate refers to how many data points in the speech signal are measured per second.

For this tutorial, you'll use the Wav2Vec2 model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data.

  1. Use 🤗 Datasets' [~datasets.Dataset.cast_column] method to upsample the sampling rate to 16kHz:
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
  1. Call the audio column again to resample the audio file:
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}

Next, load a feature extractor to normalize and pad the input. When padding textual data, a 0 is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a 0 - interpreted as silence - to array.

Load the feature extractor with [AutoFeatureExtractor.from_pretrained]:

>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

Pass the audio array to the feature extractor. We also recommend adding the sampling_rate argument in the feature extractor in order to better debug any silent errors that may occur.

>>> audio_input = [dataset[0]["audio"]["array"]]
>>> feature_extractor(audio_input, sampling_rate=16000)
{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,
        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}

Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:

>>> dataset[0]["audio"]["array"].shape
(173398,)

>>> dataset[1]["audio"]["array"].shape
(106496,)

Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:

>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
...         max_length=100000,
...         truncation=True,
...     )
...     return inputs

Apply the preprocess_function to the first few examples in the dataset:

>>> processed_dataset = preprocess_function(dataset[:5])

The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now!

>>> processed_dataset["input_values"][0].shape
(100000,)

>>> processed_dataset["input_values"][1].shape
(100000,)

Computer vision

For computer vision tasks, you'll need an image processor to prepare your dataset for the model. Image preprocessing consists of several steps that convert images into the input expected by the model. These steps include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors.

Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation transform image data, but they serve different purposes:

  • Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations.
  • Image preprocessing guarantees that the images match the models expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained.

You can use any library you like for image augmentation. For image preprocessing, use the ImageProcessor associated with the model.

Load the food101 dataset (see the 🤗 Datasets tutorial for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:

Use 🤗 Datasets split parameter to only load a small sample from the training split since the dataset is quite large!

>>> from datasets import load_dataset

>>> dataset = load_dataset("food101", split="train[:100]")

Next, take a look at the image with 🤗 Datasets Image feature:

>>> dataset[0]["image"]

Load the image processor with [AutoImageProcessor.from_pretrained]:

>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's transforms module. If you're interested in using another data augmentation library, learn how in the Albumentations or Kornia notebooks.

  1. Here we use Compose to chain together a couple of transforms - RandomResizedCrop and ColorJitter. Note that for resizing, we can get the image size requirements from the image_processor. For some models, an exact height and width are expected, for others only the shortest_edge is defined.
>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose

>>> size = (
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
... )

>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
  1. The model accepts pixel_values as its input. ImageProcessor can take care of normalizing the images, and generating appropriate tensors. Create a function that combines image augmentation and image preprocessing for a batch of images and generates pixel_values:
>>> def transforms(examples):
...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
...     return examples

In the example above we set do_resize=False because we have already resized the images in the image augmentation transformation, and leveraged the size attribute from the appropriate image_processor. If you do not resize images during image augmentation, leave this parameter out. By default, ImageProcessor will handle the resizing.

If you wish to normalize images as a part of the augmentation transformation, use the image_processor.image_mean, and image_processor.image_std values.

  1. Then use 🤗 Datasets[~datasets.Dataset.set_transform] to apply the transforms on the fly:
>>> dataset.set_transform(transforms)
  1. Now when you access the image, you'll notice the image processor has added pixel_values. You can pass your processed dataset to the model now!
>>> dataset[0].keys()

Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different.

>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> img = dataset[0]["pixel_values"]
>>> plt.imshow(img.permute(1, 2, 0))

For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, ImageProcessor offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes, or segmentation maps.

Pad

In some cases, for instance, when fine-tuning DETR, the model applies scale augmentation at training time. This may cause images to be different sizes in a batch. You can use [DetrImageProcessor.pad] from [DetrImageProcessor] and define a custom collate_fn to batch images together.

>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
...     encoding = image_processor.pad(pixel_values, return_tensors="pt")
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch

Multimodal

For tasks involving multimodal inputs, you'll need a processor to prepare your dataset for the model. A processor couples together two processing objects such as tokenizer and feature extractor.

Load the LJ Speech dataset (see the 🤗 Datasets tutorial for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):

>>> from datasets import load_dataset

>>> lj_speech = load_dataset("lj_speech", split="train")

For ASR, you're mainly focused on audio and text so you can remove the other columns:

>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])

Now take a look at the audio and text columns:

>>> lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
 'sampling_rate': 22050}

>>> lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'

Remember you should always resample your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!

>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))

Load a processor with [AutoProcessor.from_pretrained]:

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
  1. Create a function to process the audio data contained in array to input_values, and tokenize text to labels. These are the inputs to the model:
>>> def prepare_dataset(example):
...     audio = example["audio"]

...     example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))

...     return example
  1. Apply the prepare_dataset function to a sample:
>>> prepare_dataset(lj_speech[0])

The processor has now added input_values and labels, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now!