transformers/docs/source/en/model_doc/whisper.md
Joao Gante f8630c778c
[Whisper] handle deprecation of forced_decoder_ids (#38232)
* fix

* working saved forced_decoder_ids

* docstring

* add deprecation message

* exception message ordering

* circular import comment
2025-05-22 09:16:38 +00:00

8.5 KiB

PyTorch TensorFlow Flax FlashAttention SDPA

Whisper

Whisper is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoders learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box.

You can find all the original Whisper checkpoints under the Whisper collection.

Note

The head_mask argument is ignored when using all attention implementation other than "eager". If you have a head_mask and want it to have effect, load the model with XXXModel.from_pretrained(model_id, attn_implementation="eager")

Tip

Click on the Whisper models in the right sidebar for more examples of how to apply Whisper to different audio tasks.

The example below demonstrates how to automatically transcribe speech into text with [Pipeline] or the [AutoModel] class.

import torch
from transformers import pipeline

pipeline = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3-turbo",
    torch_dtype=torch.float16,
    device=0
)
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
# pip install datasets
import torch
from datasets import load_dataset
from transformers import AutoProcessor, WhisperForConditionalGeneration

processor = AutoProcessor.from_pretrained(
    "openai/whisper-large-v3-turbo",
)
model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v3-turbo",
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
).to("cuda")

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = ds[0]["audio"]

input_features = processor(
    audio_sample["array"],
    sampling_rate=audio_sample["sampling_rate"],
    return_tensors="pt"
).input_features
input_features = input_features.to("cuda", dtype=torch.float16)

predicted_ids = model.generate(input_features, cache_implementation="static")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
transcription[0]

Notes

  • Whisper relies a custom [generate] for inference, make sure to check the docs below.
  • The [WhisperProcessor] can be used for preparing audio and decoding predicted ids back into text.

WhisperConfig

autodoc WhisperConfig

WhisperTokenizer

autodoc WhisperTokenizer - set_prefix_tokens - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary - batch_decode - decode - basic_normalize - normalize

WhisperTokenizerFast

autodoc WhisperTokenizerFast - set_prefix_tokens - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary - batch_decode - decode - basic_normalize - normalize

WhisperFeatureExtractor

autodoc WhisperFeatureExtractor - call

WhisperProcessor

autodoc WhisperProcessor - call - from_pretrained - save_pretrained - batch_decode - decode

WhisperModel

autodoc WhisperModel - forward - _mask_input_features

WhisperForConditionalGeneration

autodoc WhisperForConditionalGeneration - forward - generate

WhisperForCausalLM

autodoc WhisperForCausalLM - forward

WhisperForAudioClassification

autodoc WhisperForAudioClassification - forward

TFWhisperModel

autodoc TFWhisperModel - call

TFWhisperForConditionalGeneration

autodoc TFWhisperForConditionalGeneration - call

FlaxWhisperModel

autodoc FlaxWhisperModel - call

FlaxWhisperForConditionalGeneration

autodoc FlaxWhisperForConditionalGeneration - call

FlaxWhisperForAudioClassification

autodoc FlaxWhisperForAudioClassification - call