
* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
5.5 KiB
Processors
Multimodal models require a preprocessor capable of handling inputs that combine more than one modality. Depending on the input modality, a processor needs to convert text into an array of tensors, images into pixel values, and audio into an array with tensors with the correct sampling rate.
For example, PaliGemma is a vision-language model that uses the SigLIP image processor and the Llama tokenizer. A [ProcessorMixin
] class wraps both of these preprocessor types, providing a single and unified processor class for a multimodal model.
Call [~ProcessorMixin.from_pretrained
] to load a processor. Pass the input type to the processor to generate the expected model inputs, input ids and pixel values.
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
processor = AutoProcessor.from_pretrained("google/paligemma-3b-pt-224")
prompt = "answer en Where is the cow standing?"
url = "https://huggingface.co/gv-hf/PaliGemma-test-224px-hf/resolve/main/cow_beach_1.png"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs
This guide describes the processor class and how to preprocess multimodal inputs.
Processor classes
All processors inherit from the [ProcessorMixin
] class which provides methods like [~ProcessorMixin.from_pretrained
], [~ProcessorMixin.save_pretrained
], and [~ProcessorMixin.push_to_hub
] for loading, saving, and sharing processors to the Hub.
There are two ways to load a processor, with an [AutoProcessor
] and with a model-specific processor class.
The AutoClass API provides a simple interface to load processors without directly specifying the specific model class it belongs to.
Use [~AutoProcessor.from_pretrained
] to load a processor.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("google/paligemma-3b-pt-224")
Processors are also associated with a specific pretrained multimodal model class. You can load a processor directly from the model class with [~ProcessorMixin.from_pretrained
].
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
You could also separately load the two preprocessor types, [WhisperTokenizerFast
] and [WhisperFeatureExtractor
].
from transformers import WhisperTokenizerFast, WhisperFeatureExtractor, WhisperProcessor
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny")
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
Preprocess
Processors preprocess multimodal inputs into the expected Transformers format. There are a couple combinations of input modalities that a processor can handle such as text and audio or text and image.
Automatic speech recognition (ASR) tasks require a processor that can handle text and audio inputs. Load a dataset and take a look at the audio
and text
columns (you can remove the other columns which aren't needed).
from datasets import load_dataset
dataset = load_dataset("lj_speech", split="train")
dataset = dataset.map(remove_columns=["file", "id", "normalized_text"])
dataset[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
'sampling_rate': 22050}
dataset[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
Remember to resample the sampling rate to match the pretrained models required sampling rate.
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
Load a processor and pass the audio array
and text
columns to it.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
def prepare_dataset(example):
audio = example["audio"]
example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000))
return example
Apply the prepare_dataset
function to preprocess the dataset. The processor returns input_features
for the audio
column and labels
for the text column.
prepare_dataset(dataset[0])