# Processors Multimodal models require a preprocessor capable of handling inputs that combine more than one modality. Depending on the input modality, a processor needs to convert text into an array of tensors, images into pixel values, and audio into an array with tensors with the correct sampling rate. For example, [PaliGemma](./model_doc/paligemma) is a vision-language model that uses the [SigLIP](./model_doc/siglip) image processor and the [Llama](./model_doc/llama) tokenizer. A [`ProcessorMixin`] class wraps both of these preprocessor types, providing a single and unified processor class for a multimodal model. Call [`~ProcessorMixin.from_pretrained`] to load a processor. Pass the input type to the processor to generate the expected model inputs, input ids and pixel values. ```py from transformers import AutoProcessor, PaliGemmaForConditionalGeneration from PIL import Image import requests processor = AutoProcessor.from_pretrained("google/paligemma-3b-pt-224") prompt = "answer en Where is the cat standing?" url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=prompt, images=image, return_tensors="pt") inputs ``` This guide describes the processor class and how to preprocess multimodal inputs. ## Processor classes All processors inherit from the [`ProcessorMixin`] class which provides methods like [`~ProcessorMixin.from_pretrained`], [`~ProcessorMixin.save_pretrained`], and [`~ProcessorMixin.push_to_hub`] for loading, saving, and sharing processors to the Hub. There are two ways to load a processor, with an [`AutoProcessor`] and with a model-specific processor class. The [AutoClass](./model_doc/auto) API provides a simple interface to load processors without directly specifying the specific model class it belongs to. Use [`~AutoProcessor.from_pretrained`] to load a processor. ```py from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("google/paligemma-3b-pt-224") ``` Processors are also associated with a specific pretrained multimodal model class. You can load a processor directly from the model class with [`~ProcessorMixin.from_pretrained`]. ```py from transformers import WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-tiny") ``` You could also separately load the two preprocessor types, [`WhisperTokenizerFast`] and [`WhisperFeatureExtractor`]. ```py from transformers import WhisperTokenizerFast, WhisperFeatureExtractor, WhisperProcessor tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny") feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny") processor = WhisperProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer) ``` ## Preprocess Processors preprocess multimodal inputs into the expected Transformers format. There are a couple combinations of input modalities that a processor can handle such as text and audio or text and image. Automatic speech recognition (ASR) tasks require a processor that can handle text and audio inputs. Load a dataset and take a look at the `audio` and `text` columns (you can remove the other columns which aren't needed). ```py from datasets import load_dataset dataset = load_dataset("lj_speech", split="train") dataset = dataset.map(remove_columns=["file", "id", "normalized_text"]) dataset[0]["audio"] {'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ..., 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav', 'sampling_rate': 22050} dataset[0]["text"] 'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition' ``` Remember to resample the sampling rate to match the pretrained models required sampling rate. ```py from datasets import Audio dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) ``` Load a processor and pass the audio `array` and `text` columns to it. ```py from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("openai/whisper-tiny") def prepare_dataset(example): audio = example["audio"] example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000)) return example ``` Apply the `prepare_dataset` function to preprocess the dataset. The processor returns `input_features` for the `audio` column and `labels` for the text column. ```py prepare_dataset(dataset[0]) ```