PyTorch TensorFlow Flax FlashAttention SDPA
# Whisper [Whisper](https://hf.co/papers/2212.04356) is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoders learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box. You can find all the original Whisper checkpoints under the [Whisper](https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013) collection. > [!NOTE] > The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")` > [!TIP] > Click on the Whisper models in the right sidebar for more examples of how to apply Whisper to different audio tasks. The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. ```py import torch from transformers import pipeline pipeline = pipeline( task="automatic-speech-recognition", model="openai/whisper-large-v3-turbo", torch_dtype=torch.float16, device=0 ) pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") ``` ```py # pip install datasets import torch from datasets import load_dataset from transformers import AutoProcessor, WhisperForConditionalGeneration processor = AutoProcessor.from_pretrained( "openai/whisper-large-v3-turbo", ) model = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-large-v3-turbo", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa" ).to("cuda") ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") audio_sample = ds[0]["audio"] input_features = processor( audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt" ).input_features input_features = input_features.to("cuda", dtype=torch.float16) predicted_ids = model.generate(input_features, cache_implementation="static") transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) transcription[0] ``` ## Notes - Whisper relies a custom [`generate`] for inference, make sure to check the docs below. - The [`WhisperProcessor`] can be used for preparing audio and decoding predicted ids back into text. ## WhisperConfig [[autodoc]] WhisperConfig ## WhisperTokenizer [[autodoc]] WhisperTokenizer - set_prefix_tokens - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary - batch_decode - decode - basic_normalize - normalize ## WhisperTokenizerFast [[autodoc]] WhisperTokenizerFast - set_prefix_tokens - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary - batch_decode - decode - basic_normalize - normalize ## WhisperFeatureExtractor [[autodoc]] WhisperFeatureExtractor - __call__ ## WhisperProcessor [[autodoc]] WhisperProcessor - __call__ - from_pretrained - save_pretrained - batch_decode - decode ## WhisperModel [[autodoc]] WhisperModel - forward - _mask_input_features ## WhisperForConditionalGeneration [[autodoc]] WhisperForConditionalGeneration - forward - generate ## WhisperForCausalLM [[autodoc]] WhisperForCausalLM - forward ## WhisperForAudioClassification [[autodoc]] WhisperForAudioClassification - forward ## TFWhisperModel [[autodoc]] TFWhisperModel - call ## TFWhisperForConditionalGeneration [[autodoc]] TFWhisperForConditionalGeneration - call ## FlaxWhisperModel [[autodoc]] FlaxWhisperModel - __call__ ## FlaxWhisperForConditionalGeneration [[autodoc]] FlaxWhisperForConditionalGeneration - __call__ ## FlaxWhisperForAudioClassification [[autodoc]] FlaxWhisperForAudioClassification - __call__