mirror of
https://github.com/huggingface/transformers.git
synced 2025-08-01 02:31:11 +06:00
undo
This commit is contained in:
parent
ebad6b5d1e
commit
5d38d76875
@ -7,6 +7,4 @@
|
||||
- sections:
|
||||
- local: pipeline_tutorial
|
||||
title: Pipeline per l'inferenza
|
||||
- local: preprocessing
|
||||
title: Preprocess
|
||||
title: Esercitazione
|
||||
|
@ -1,503 +0,0 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Preprocess
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Prima di poter usare i dati in un modello, bisogna processarli in un formato accettabile per quest'ultimo. Un modello non comprende il testo piano, le immagini o l'audio. Bisogna convertire questi input in numeri e assemblarli all'interno di tensori. In questa esercitazione, tu potrai:
|
||||
|
||||
* Preprocessare dati testuali con un tokenizer.
|
||||
* Preprocessare immagini o dati audio con un estrattore di caratteristiche.
|
||||
* Preprocessare dati per attività multimodali mediante un processore.
|
||||
|
||||
## NLP
|
||||
|
||||
<Youtube id="Yffk5aydLzg"/>
|
||||
|
||||
Lo strumento principale per processare dati testuali è un [tokenizer](main_classes/tokenizer). Un tokenizer inizia separando il testo in *tokens* secondo una serie di regole. I tokens sono convertiti in numeri, questi vengono utilizzati per costruire i tensori di input del modello. Anche altri input addizionali se richiesti dal modello vengono aggiunti dal tokenizer.
|
||||
|
||||
<Tip>
|
||||
|
||||
Se stai pensando si utilizzare un modello preaddestrato, è importante utilizzare il tokenizer preaddestrato associato. Questo assicura che il testo sia separato allo stesso modo che nel corpus usato per l'addestramento, e venga usata la stessa mappatura tokens-to-index (solitamente indicato come il *vocabolario*) come nel preaddestramento.
|
||||
|
||||
</Tip>
|
||||
|
||||
Iniziamo subito caricando un tokenizer preaddestrato con la classe [`AutoTokenizer`]. Questo scarica il *vocabolario* usato quando il modello è stato preaddestrato.
|
||||
|
||||
### Tokenize
|
||||
|
||||
Carica un tokenizer preaddestrato con [`AutoTokenizer.from_pretrained`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
||||
```
|
||||
|
||||
Poi inserisci le tue frasi nel tokenizer:
|
||||
|
||||
```py
|
||||
>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
|
||||
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||
```
|
||||
|
||||
Il tokenizer restituisce un dizionario contenente tre oggetti importanti:
|
||||
|
||||
* [input_ids](glossary#input-ids) sono gli indici che corrispondono ad ogni token nella frase.
|
||||
* [attention_mask](glossary#attention-mask) indicata se un token deve essere elaborato o no.
|
||||
* [token_type_ids](glossary#token-type-ids) identifica a quale sequenza appartiene un token se è presente più di una sequenza.
|
||||
|
||||
Si possono decodificare gli `input_ids` per farsi restituire l'input originale:
|
||||
|
||||
```py
|
||||
>>> tokenizer.decode(encoded_input["input_ids"])
|
||||
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
|
||||
```
|
||||
|
||||
Come si può vedere, il tokenizer aggiunge due token speciali - `CLS` and `SEP` (classifier and separator) - alla frase. Non tutti i modelli hanno bisogno dei token speciali, ma se servono, il tokenizer li aggiungerà automaticamente.
|
||||
|
||||
Se ci sono più frasi che vuoi processare, passale come una lista al tokenizer:
|
||||
|
||||
```py
|
||||
>>> batch_sentences = [
|
||||
... "But what about second breakfast?",
|
||||
... "Don't think he knows about second breakfast, Pip.",
|
||||
... "What about elevensies?",
|
||||
... ]
|
||||
>>> encoded_inputs = tokenizer(batch_sentences)
|
||||
>>> print(encoded_inputs)
|
||||
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
|
||||
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
|
||||
[101, 1327, 1164, 5450, 23434, 136, 102]],
|
||||
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0]],
|
||||
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1, 1]]}
|
||||
```
|
||||
|
||||
### Pad
|
||||
|
||||
Questo è un argomento importante. Quando processi un insieme di frasi potrebbero non avere tutte la stessa lunghezza. Questo è un problema perchè i tensori, in input del modello, devono avere dimensioni uniformi. Il padding è una strategia per assicurarsi che i tensori siano rettangolari aggiungendo uno speciale *padding token* alle frasi più corte.
|
||||
|
||||
Imposta il parametro `padding` a `True` per imbottire le frasi più corte nel gruppo in modo che combacino con la massima lunghezza presente:
|
||||
|
||||
```py
|
||||
>>> batch_sentences = [
|
||||
... "But what about second breakfast?",
|
||||
... "Don't think he knows about second breakfast, Pip.",
|
||||
... "What about elevensies?",
|
||||
... ]
|
||||
>>> encoded_input = tokenizer(batch_sentences, padding=True)
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
|
||||
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
|
||||
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
|
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
|
||||
```
|
||||
|
||||
Nota che il tokenizer aggiunge alle sequenze degli `0` perchè sono troppo corte!
|
||||
|
||||
### Truncation
|
||||
|
||||
On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length.
|
||||
|
||||
Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
|
||||
|
||||
```py
|
||||
>>> batch_sentences = [
|
||||
... "But what about second breakfast?",
|
||||
... "Don't think he knows about second breakfast, Pip.",
|
||||
... "What about elevensies?",
|
||||
... ]
|
||||
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
|
||||
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
|
||||
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
|
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
|
||||
```
|
||||
|
||||
### Build tensors
|
||||
|
||||
Finally, you want the tokenizer to return the actual tensors that are fed to the model.
|
||||
|
||||
Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
|
||||
```py
|
||||
>>> batch_sentences = [
|
||||
... "But what about second breakfast?",
|
||||
... "Don't think he knows about second breakfast, Pip.",
|
||||
... "What about elevensies?",
|
||||
... ]
|
||||
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
|
||||
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
|
||||
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
|
||||
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
|
||||
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
|
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> batch_sentences = [
|
||||
... "But what about second breakfast?",
|
||||
... "Don't think he knows about second breakfast, Pip.",
|
||||
... "What about elevensies?",
|
||||
... ]
|
||||
>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
|
||||
array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
|
||||
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
|
||||
[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||
dtype=int32)>,
|
||||
'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
|
||||
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
|
||||
'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
|
||||
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
|
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Audio
|
||||
|
||||
Audio inputs are preprocessed differently than textual inputs, but the end goal remains the same: create numerical sequences the model can understand. A [feature extractor](main_classes/feature_extractor) is designed for the express purpose of extracting features from raw image or audio data and converting them into tensors. Before you begin, install 🤗 Datasets to load an audio dataset to experiment with:
|
||||
|
||||
```bash
|
||||
pip install datasets
|
||||
```
|
||||
|
||||
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset):
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset, Audio
|
||||
|
||||
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
|
||||
```
|
||||
|
||||
Access the first element of the `audio` column to take a look at the input. Calling the `audio` column will automatically load and resample the audio file:
|
||||
|
||||
```py
|
||||
>>> dataset[0]["audio"]
|
||||
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
|
||||
0. , 0. ], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
|
||||
'sampling_rate': 8000}
|
||||
```
|
||||
|
||||
This returns three items:
|
||||
|
||||
* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
|
||||
* `path` points to the location of the audio file.
|
||||
* `sampling_rate` refers to how many data points in the speech signal are measured per second.
|
||||
|
||||
### Resample
|
||||
|
||||
For this tutorial, you will use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. As you can see from the model card, the Wav2Vec2 model is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your audio data.
|
||||
|
||||
For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000kHz. In order to use the Wav2Vec2 model with this dataset, upsample the sampling rate to 16kHz:
|
||||
|
||||
```py
|
||||
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
|
||||
>>> dataset[0]["audio"]
|
||||
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
|
||||
0. , 0. ], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
|
||||
'sampling_rate': 8000}
|
||||
```
|
||||
|
||||
1. Use 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to upsample the sampling rate to 16kHz:
|
||||
|
||||
```py
|
||||
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
|
||||
```
|
||||
|
||||
2. Load the audio file:
|
||||
|
||||
```py
|
||||
>>> dataset[0]["audio"]
|
||||
{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ...,
|
||||
3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
|
||||
'sampling_rate': 16000}
|
||||
```
|
||||
|
||||
As you can see, the `sampling_rate` is now 16kHz!
|
||||
|
||||
### Feature extractor
|
||||
|
||||
The next step is to load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data, and the audio feature extractor will add a `0` - interpreted as silence - to `array`.
|
||||
|
||||
Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
|
||||
```
|
||||
|
||||
Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.
|
||||
|
||||
```py
|
||||
>>> audio_input = [dataset[0]["audio"]["array"]]
|
||||
>>> feature_extractor(audio_input, sampling_rate=16000)
|
||||
{'input_values': [array([ 3.8106556e-04, 2.7506407e-03, 2.8015103e-03, ...,
|
||||
5.6335266e-04, 4.6588284e-06, -1.7142107e-04], dtype=float32)]}
|
||||
```
|
||||
|
||||
### Pad and truncate
|
||||
|
||||
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
|
||||
|
||||
```py
|
||||
>>> dataset[0]["audio"]["array"].shape
|
||||
(173398,)
|
||||
|
||||
>>> dataset[1]["audio"]["array"].shape
|
||||
(106496,)
|
||||
```
|
||||
|
||||
As you can see, the first sample has a longer sequence than the second sample. Let's create a function that will preprocess the dataset. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
|
||||
|
||||
```py
|
||||
>>> def preprocess_function(examples):
|
||||
... audio_arrays = [x["array"] for x in examples["audio"]]
|
||||
... inputs = feature_extractor(
|
||||
... audio_arrays,
|
||||
... sampling_rate=16000,
|
||||
... padding=True,
|
||||
... max_length=100000,
|
||||
... truncation=True,
|
||||
... )
|
||||
... return inputs
|
||||
```
|
||||
|
||||
Apply the function to the the first few examples in the dataset:
|
||||
|
||||
```py
|
||||
>>> processed_dataset = preprocess_function(dataset[:5])
|
||||
```
|
||||
|
||||
Now take another look at the processed sample lengths:
|
||||
|
||||
```py
|
||||
>>> processed_dataset["input_values"][0].shape
|
||||
(100000,)
|
||||
|
||||
>>> processed_dataset["input_values"][1].shape
|
||||
(100000,)
|
||||
```
|
||||
|
||||
The lengths of the first two samples now match the maximum length you specified.
|
||||
|
||||
## Vision
|
||||
|
||||
A feature extractor is also used to process images for vision tasks. Once again, the goal is to convert the raw image into a batch of tensors as input.
|
||||
|
||||
Let's load the [food101](https://huggingface.co/datasets/food101) dataset for this tutorial. Use 🤗 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> dataset = load_dataset("food101", split="train[:100]")
|
||||
```
|
||||
|
||||
Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:
|
||||
|
||||
```py
|
||||
>>> dataset[0]["image"]
|
||||
```
|
||||
|
||||

|
||||
|
||||
### Feature extractor
|
||||
|
||||
Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
|
||||
```
|
||||
|
||||
### Data augmentation
|
||||
|
||||
For vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you will use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module.
|
||||
|
||||
1. Normalize the image and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
|
||||
|
||||
```py
|
||||
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
|
||||
|
||||
>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
|
||||
>>> _transforms = Compose(
|
||||
... [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
|
||||
... )
|
||||
```
|
||||
|
||||
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as it's input. This value is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
|
||||
|
||||
```py
|
||||
>>> def transforms(examples):
|
||||
... examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
|
||||
... return examples
|
||||
```
|
||||
|
||||
3. Then use 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on-the-fly:
|
||||
|
||||
```py
|
||||
>>> dataset.set_transform(transforms)
|
||||
```
|
||||
|
||||
4. Now when you access the image, you will notice the feature extractor has added the model input `pixel_values`:
|
||||
|
||||
```py
|
||||
>>> dataset[0]["image"]
|
||||
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F1A7B0630D0>,
|
||||
'label': 6,
|
||||
'pixel_values': tensor([[[ 0.0353, 0.0745, 0.1216, ..., -0.9922, -0.9922, -0.9922],
|
||||
[-0.0196, 0.0667, 0.1294, ..., -0.9765, -0.9843, -0.9922],
|
||||
[ 0.0196, 0.0824, 0.1137, ..., -0.9765, -0.9686, -0.8667],
|
||||
...,
|
||||
[ 0.0275, 0.0745, 0.0510, ..., -0.1137, -0.1216, -0.0824],
|
||||
[ 0.0667, 0.0824, 0.0667, ..., -0.0588, -0.0745, -0.0980],
|
||||
[ 0.0353, 0.0353, 0.0431, ..., -0.0039, -0.0039, -0.0588]],
|
||||
|
||||
[[ 0.2078, 0.2471, 0.2863, ..., -0.9451, -0.9373, -0.9451],
|
||||
[ 0.1608, 0.2471, 0.3098, ..., -0.9373, -0.9451, -0.9373],
|
||||
[ 0.2078, 0.2706, 0.3020, ..., -0.9608, -0.9373, -0.8275],
|
||||
...,
|
||||
[-0.0353, 0.0118, -0.0039, ..., -0.2392, -0.2471, -0.2078],
|
||||
[ 0.0196, 0.0353, 0.0196, ..., -0.1843, -0.2000, -0.2235],
|
||||
[-0.0118, -0.0039, -0.0039, ..., -0.0980, -0.0980, -0.1529]],
|
||||
|
||||
[[ 0.3961, 0.4431, 0.4980, ..., -0.9216, -0.9137, -0.9216],
|
||||
[ 0.3569, 0.4510, 0.5216, ..., -0.9059, -0.9137, -0.9137],
|
||||
[ 0.4118, 0.4745, 0.5216, ..., -0.9137, -0.8902, -0.7804],
|
||||
...,
|
||||
[-0.2314, -0.1922, -0.2078, ..., -0.4196, -0.4275, -0.3882],
|
||||
[-0.1843, -0.1686, -0.2000, ..., -0.3647, -0.3804, -0.4039],
|
||||
[-0.1922, -0.1922, -0.1922, ..., -0.2941, -0.2863, -0.3412]]])}
|
||||
```
|
||||
|
||||
Here is what the image looks like after you preprocess it. Just as you'd expect from the applied transforms, the image has been randomly cropped and it's color properties are different.
|
||||
|
||||
```py
|
||||
>>> import numpy as np
|
||||
>>> import matplotlib.pyplot as plt
|
||||
|
||||
>>> img = dataset[0]["pixel_values"]
|
||||
>>> plt.imshow(img.permute(1, 2, 0))
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Multimodal
|
||||
|
||||
For multimodal tasks. you will use a combination of everything you've learned so far and apply your skills to a automatic speech recognition (ASR) task. This means you will need a:
|
||||
|
||||
* Feature extractor to preprocess the audio data.
|
||||
* Tokenizer to process the text.
|
||||
|
||||
Let's return to the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> lj_speech = load_dataset("lj_speech", split="train")
|
||||
```
|
||||
|
||||
Since you are mainly interested in the `audio` and `text` column, remove the other columns:
|
||||
|
||||
```py
|
||||
>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
|
||||
```
|
||||
|
||||
Now take a look at the `audio` and `text` columns:
|
||||
|
||||
```py
|
||||
>>> lj_speech[0]["audio"]
|
||||
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
|
||||
7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
|
||||
'sampling_rate': 22050}
|
||||
|
||||
>>> lj_speech[0]["text"]
|
||||
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
|
||||
```
|
||||
|
||||
Remember from the earlier section on processing audio data, you should always [resample](preprocessing#audio) your audio data's sampling rate to match the sampling rate of the dataset used to pretrain a model:
|
||||
|
||||
```py
|
||||
>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
|
||||
```
|
||||
|
||||
### Processor
|
||||
|
||||
A processor combines a feature extractor and tokenizer. Load a processor with [`AutoProcessor.from_pretrained]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoProcessor
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
|
||||
```
|
||||
|
||||
1. Create a function to process the audio data to `input_values`, and tokenizes the text to `labels`. These are your inputs to the model:
|
||||
|
||||
```py
|
||||
>>> def prepare_dataset(example):
|
||||
... audio = example["audio"]
|
||||
|
||||
... example["input_values"] = processor(audio["array"], sampling_rate=16000)
|
||||
|
||||
... with processor.as_target_processor():
|
||||
... example["labels"] = processor(example["text"]).input_ids
|
||||
... return example
|
||||
```
|
||||
|
||||
2. Apply the `prepare_dataset` function to a sample:
|
||||
|
||||
```py
|
||||
>>> prepare_dataset(lj_speech[0])
|
||||
```
|
||||
|
||||
Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.
|
||||
|
||||
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
|
Loading…
Reference in New Issue
Block a user