mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 21:00:08 +06:00
191 lines
8.5 KiB
Markdown
191 lines
8.5 KiB
Markdown
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
|
|
<div style="float: right;">
|
|
<div class="flex flex-wrap space-x-1">
|
|
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
|
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
|
|
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
|
|
">
|
|
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
|
|
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
|
|
</div>
|
|
</div>
|
|
|
|
# Whisper
|
|
|
|
[Whisper](https://hf.co/papers/2212.04356) is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoders learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box.
|
|
|
|
You can find all the original Whisper checkpoints under the [Whisper](https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013) collection.
|
|
|
|
> [!NOTE]
|
|
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
|
|
|
|
> [!TIP]
|
|
> Click on the Whisper models in the right sidebar for more examples of how to apply Whisper to different audio tasks.
|
|
|
|
The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class.
|
|
|
|
<hfoptions id="usage">
|
|
<hfoption id="Pipeline">
|
|
|
|
```py
|
|
import torch
|
|
from transformers import pipeline
|
|
|
|
pipeline = pipeline(
|
|
task="automatic-speech-recognition",
|
|
model="openai/whisper-large-v3-turbo",
|
|
torch_dtype=torch.float16,
|
|
device=0
|
|
)
|
|
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
|
|
```
|
|
|
|
</hfoption>
|
|
<hfoption id="AutoModel">
|
|
|
|
```py
|
|
# pip install datasets
|
|
import torch
|
|
from datasets import load_dataset
|
|
from transformers import AutoProcessor, WhisperForConditionalGeneration
|
|
|
|
processor = AutoProcessor.from_pretrained(
|
|
"openai/whisper-large-v3-turbo",
|
|
)
|
|
model = WhisperForConditionalGeneration.from_pretrained(
|
|
"openai/whisper-large-v3-turbo",
|
|
torch_dtype=torch.float16,
|
|
device_map="auto",
|
|
attn_implementation="sdpa"
|
|
).to("cuda")
|
|
|
|
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
|
audio_sample = ds[0]["audio"]
|
|
|
|
input_features = processor(
|
|
audio_sample["array"],
|
|
sampling_rate=audio_sample["sampling_rate"],
|
|
return_tensors="pt"
|
|
).input_features
|
|
input_features = input_features.to("cuda", dtype=torch.float16)
|
|
|
|
predicted_ids = model.generate(input_features, cache_implementation="static")
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
|
transcription[0]
|
|
```
|
|
|
|
</hfoption>
|
|
</hfoptions>
|
|
|
|
## Notes
|
|
|
|
- Whisper relies on [`~GenerationMixin.generate`] for inference.
|
|
- The [`WhisperProcessor`] can be used for preparing audio and decoding predicted ids back into text.
|
|
|
|
## WhisperConfig
|
|
|
|
[[autodoc]] WhisperConfig
|
|
|
|
## WhisperTokenizer
|
|
|
|
[[autodoc]] WhisperTokenizer
|
|
- set_prefix_tokens
|
|
- build_inputs_with_special_tokens
|
|
- get_special_tokens_mask
|
|
- create_token_type_ids_from_sequences
|
|
- save_vocabulary
|
|
- batch_decode
|
|
- decode
|
|
- basic_normalize
|
|
- normalize
|
|
|
|
## WhisperTokenizerFast
|
|
|
|
[[autodoc]] WhisperTokenizerFast
|
|
- set_prefix_tokens
|
|
- build_inputs_with_special_tokens
|
|
- get_special_tokens_mask
|
|
- create_token_type_ids_from_sequences
|
|
- save_vocabulary
|
|
- batch_decode
|
|
- decode
|
|
- basic_normalize
|
|
- normalize
|
|
|
|
## WhisperFeatureExtractor
|
|
|
|
[[autodoc]] WhisperFeatureExtractor
|
|
- __call__
|
|
|
|
## WhisperProcessor
|
|
|
|
[[autodoc]] WhisperProcessor
|
|
- __call__
|
|
- from_pretrained
|
|
- save_pretrained
|
|
- batch_decode
|
|
- decode
|
|
|
|
## WhisperModel
|
|
|
|
[[autodoc]] WhisperModel
|
|
- forward
|
|
- _mask_input_features
|
|
|
|
## WhisperForConditionalGeneration
|
|
|
|
[[autodoc]] WhisperForConditionalGeneration
|
|
- forward
|
|
- generate
|
|
|
|
## WhisperForCausalLM
|
|
|
|
[[autodoc]] WhisperForCausalLM
|
|
- forward
|
|
|
|
## WhisperForAudioClassification
|
|
|
|
[[autodoc]] WhisperForAudioClassification
|
|
- forward
|
|
|
|
## TFWhisperModel
|
|
|
|
[[autodoc]] TFWhisperModel
|
|
- call
|
|
|
|
## TFWhisperForConditionalGeneration
|
|
|
|
[[autodoc]] TFWhisperForConditionalGeneration
|
|
- call
|
|
|
|
## FlaxWhisperModel
|
|
|
|
[[autodoc]] FlaxWhisperModel
|
|
- __call__
|
|
|
|
## FlaxWhisperForConditionalGeneration
|
|
|
|
[[autodoc]] FlaxWhisperForConditionalGeneration
|
|
- __call__
|
|
|
|
## FlaxWhisperForAudioClassification
|
|
|
|
[[autodoc]] FlaxWhisperForAudioClassification
|
|
- __call__
|