mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 21:00:08 +06:00

* Re-enable doctests for the quicktour * Re-enable doctests for task_summary (#15830) * Remove &
324 lines
14 KiB
Plaintext
324 lines
14 KiB
Plaintext
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# Quick tour
|
|
|
|
[[open-in-colab]]
|
|
|
|
Get up and running with 🤗 Transformers! Start using the [`pipeline`] for rapid inference, and quickly load a pretrained model and tokenizer with an [AutoClass](./model_doc/auto) to solve your text, vision or audio task.
|
|
|
|
<Tip>
|
|
|
|
All code examples presented in the documentation have a toggle on the top left for PyTorch and TensorFlow. If
|
|
not, the code is expected to work for both backends without any change.
|
|
|
|
</Tip>
|
|
|
|
## Pipeline
|
|
|
|
[`pipeline`] is the easiest way to use a pretrained model for a given task.
|
|
|
|
<Youtube id="tiZFewofSLM"/>
|
|
|
|
The [`pipeline`] supports many common tasks out-of-the-box:
|
|
|
|
**Text**:
|
|
* Sentiment analysis: classify the polarity of a given text.
|
|
* Text generation (in English): generate text from a given input.
|
|
* Name entity recognition (NER): label each word with the entity it represents (person, date, location, etc.).
|
|
* Question answering: extract the answer from the context, given some context and a question.
|
|
* Fill-mask: fill in the blank given a text with masked words.
|
|
* Summarization: generate a summary of a long sequence of text or document.
|
|
* Translation: translate text into another language.
|
|
* Feature extraction: create a tensor representation of the text.
|
|
|
|
**Image**:
|
|
* Image classification: classify an image.
|
|
* Image segmentation: classify every pixel in an image.
|
|
* Object detection: detect objects within an image.
|
|
|
|
**Audio**:
|
|
* Audio classification: assign a label to a given segment of audio.
|
|
* Automatic speech recognition (ASR): transcribe audio data into text.
|
|
|
|
<Tip>
|
|
|
|
For more details about the [`pipeline`] and associated tasks, refer to the documentation [here](./main_classes/pipelines).
|
|
|
|
</Tip>
|
|
|
|
### Pipeline usage
|
|
|
|
In the following example, you will use the [`pipeline`] for sentiment analysis.
|
|
|
|
Install the following dependencies if you haven't already:
|
|
|
|
```bash
|
|
pip install torch
|
|
===PT-TF-SPLIT===
|
|
pip install tensorflow
|
|
```
|
|
|
|
Import [`pipeline`] and specify the task you want to complete:
|
|
|
|
```py
|
|
>>> from transformers import pipeline
|
|
|
|
>>> classifier = pipeline("sentiment-analysis")
|
|
```
|
|
|
|
The pipeline downloads and caches a default [pretrained model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) and tokenizer for sentiment analysis. Now you can use the `classifier` on your target text:
|
|
|
|
```py
|
|
>>> classifier("We are very happy to show you the 🤗 Transformers library.")
|
|
[{'label': 'POSITIVE', 'score': 0.9998}]
|
|
```
|
|
|
|
For more than one sentence, pass a list of sentences to the [`pipeline`] which returns a list of dictionaries:
|
|
|
|
```py
|
|
>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
|
|
>>> for result in results:
|
|
... print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
|
|
label: POSITIVE, with score: 0.9998
|
|
label: NEGATIVE, with score: 0.5309
|
|
```
|
|
|
|
The [`pipeline`] can also iterate over an entire dataset. Start by installing the [🤗 Datasets](https://huggingface.co/docs/datasets/) library:
|
|
|
|
```bash
|
|
pip install datasets
|
|
```
|
|
|
|
Create a [`pipeline`] with the task you want to solve for and the model you want to use. Set the `device` parameter to `0` to place the tensors on a CUDA device:
|
|
|
|
```py
|
|
>>> from transformers import pipeline
|
|
|
|
>>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
|
|
```
|
|
|
|
Next, load a dataset (see the 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) for more details) you'd like to iterate over. For example, let's load the [SUPERB](https://huggingface.co/datasets/superb) dataset:
|
|
|
|
```py
|
|
>>> import datasets
|
|
|
|
>>> dataset = datasets.load_dataset("superb", name="asr", split="test") # doctest: +IGNORE_RESULT
|
|
```
|
|
|
|
You can pass a whole dataset pipeline:
|
|
|
|
```py
|
|
>>> files = dataset["file"]
|
|
>>> speech_recognizer(files[:4])
|
|
[{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'},
|
|
{'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'},
|
|
{'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'},
|
|
{'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}]
|
|
```
|
|
|
|
For a larger dataset where the inputs are big (like in speech or vision), you will want to pass along a generator instead of a list that loads all the inputs in memory. See the [pipeline documentation](main_classes/pipeline) for more information.
|
|
|
|
### Use another model and tokenizer in the pipeline
|
|
|
|
The [`pipeline`] can accommodate any model from the [Model Hub](https://huggingface.co/models), making it easy to adapt the [`pipeline`] for other use-cases. For example, if you'd like a model capable of handling French text, use the tags on the Model Hub to filter for an appropriate model. The top filtered result returns a multilingual [BERT model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) fine-tuned for sentiment analysis. Great, let's use this model!
|
|
|
|
```py
|
|
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
|
|
```
|
|
|
|
Use the [`AutoModelForSequenceClassification`] and ['AutoTokenizer'] to load the pretrained model and it's associated tokenizer (more on an `AutoClass` below):
|
|
|
|
```py
|
|
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
|
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
|
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
|
|
|
|
>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
|
|
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
```
|
|
|
|
Then you can specify the model and tokenizer in the [`pipeline`], and apply the `classifier` on your target text:
|
|
|
|
```py
|
|
>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
|
|
>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
|
|
[{'label': '5 stars', 'score': 0.7273}]
|
|
```
|
|
|
|
If you can't find a model for your use-case, you will need to fine-tune a pretrained model on your data. Take a look at our [fine-tuning tutorial](./training) to learn how. Finally, after you've fine-tuned your pretrained model, please consider sharing it (see tutorial [here](./model_sharing)) with the community on the Model Hub to democratize NLP for everyone! 🤗
|
|
|
|
## AutoClass
|
|
|
|
<Youtube id="AhChOFRegn4"/>
|
|
|
|
Under the hood, the [`AutoModelForSequenceClassification`] and [`AutoTokenizer`] classes work together to power the [`pipeline`]. An [AutoClass](./model_doc/auto) is a shortcut that automatically retrieves the architecture of a pretrained model from it's name or path. You only need to select the appropriate `AutoClass` for your task and it's associated tokenizer with [`AutoTokenizer`].
|
|
|
|
Let's return to our example and see how you can use the `AutoClass` to replicate the results of the [`pipeline`].
|
|
|
|
### AutoTokenizer
|
|
|
|
A tokenizer is responsible for preprocessing text into a format that is understandable to the model. First, the tokenizer will split the text into words called *tokens*. There are multiple rules that govern the tokenization process, including how to split a word and at what level (learn more about tokenization [here](./tokenizer_summary)). The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.
|
|
|
|
Load a tokenizer with [`AutoTokenizer`]:
|
|
|
|
```py
|
|
>>> from transformers import AutoTokenizer
|
|
|
|
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
|
|
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
```
|
|
|
|
Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model's *vocabulary*.
|
|
|
|
Pass your text to the tokenizer:
|
|
|
|
```py
|
|
>>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
|
|
>>> print(encoding)
|
|
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
|
|
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
|
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
|
```
|
|
|
|
The tokenizer will return a dictionary containing:
|
|
|
|
* [input_ids](./glossary#input-ids): numerical representions of your tokens.
|
|
* [atttention_mask](.glossary#attention-mask): indicates which tokens should be attended to.
|
|
|
|
Just like the [`pipeline`], the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:
|
|
|
|
```py
|
|
>>> pt_batch = tokenizer(
|
|
... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
|
|
... padding=True,
|
|
... truncation=True,
|
|
... max_length=512,
|
|
... return_tensors="pt",
|
|
... )
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> tf_batch = tokenizer(
|
|
... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
|
|
... padding=True,
|
|
... truncation=True,
|
|
... max_length=512,
|
|
... return_tensors="tf",
|
|
... )
|
|
```
|
|
|
|
Read the [preprocessing](./preprocessing) tutorial for more details about tokenization.
|
|
|
|
### AutoModel
|
|
|
|
🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an [`AutoModel`] like you would load an [`AutoTokenizer`]. The only difference is selecting the correct [`AutoModel`] for the task. Since you are doing text - or sequence - classification, load [`AutoModelForSequenceClassification`]. The TensorFlow equivalent is simply [`TFAutoModelForSequenceClassification`]:
|
|
|
|
```py
|
|
>>> from transformers import AutoModelForSequenceClassification
|
|
|
|
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
|
|
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> from transformers import TFAutoModelForSequenceClassification
|
|
|
|
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
|
|
>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
|
|
```
|
|
|
|
<Tip>
|
|
|
|
See the [task summary](./task_summary) for which [`AutoModel`] class to use for which task.
|
|
|
|
</Tip>
|
|
|
|
Now you can pass your preprocessed batch of inputs directly to the model. If you are using a PyTorch model, unpack the dictionary by adding `**`. For TensorFlow models, pass the dictionary keys directly to the tensors:
|
|
|
|
```py
|
|
>>> pt_outputs = pt_model(**pt_batch)
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> tf_outputs = tf_model(tf_batch)
|
|
```
|
|
|
|
The model outputs the final activations in the `logits` attribute. Apply the softmax function to the `logits` to retrieve the probabilities:
|
|
|
|
```py
|
|
>>> from torch import nn
|
|
|
|
>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
|
|
>>> print(pt_predictions)
|
|
tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
|
|
[0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
|
|
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> import tensorflow as tf
|
|
|
|
>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
|
|
>>> print(tf_predictions)
|
|
tf.Tensor(
|
|
[[0.00206 0.00177 0.01155 0.21209 0.77253]
|
|
[0.20842 0.18262 0.19693 0.1755 0.23652]], shape=(2, 5), dtype=float32)
|
|
```
|
|
|
|
<Tip>
|
|
|
|
All 🤗 Transformers models (PyTorch or TensorFlow) outputs the tensors *before* the final activation
|
|
function (like softmax) because the final activation function is often fused with the loss.
|
|
|
|
</Tip>
|
|
|
|
Models are a standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or a [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) so you can use them in your usual training loop. However, to make things easier, 🤗 Transformers provides a [`Trainer`] class for PyTorch that adds functionality for distributed training, mixed precision, and more. For TensorFlow, you can use the `fit` method from [Keras](https://keras.io/). Refer to the [training tutorial](./training) for more details.
|
|
|
|
<Tip>
|
|
|
|
🤗 Transformers model outputs are special dataclasses so their attributes are autocompleted in an IDE.
|
|
The model outputs also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which case the attributes that are `None` are ignored.
|
|
|
|
</Tip>
|
|
|
|
### Save a model
|
|
|
|
Once your model is fine-tuned, you can save it with its tokenizer using [`PreTrainedModel.save_pretrained`]:
|
|
|
|
```py
|
|
>>> pt_save_directory = "./pt_save_pretrained"
|
|
>>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT
|
|
>>> pt_model.save_pretrained(pt_save_directory)
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> tf_save_directory = "./tf_save_pretrained"
|
|
>>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT
|
|
>>> tf_model.save_pretrained(tf_save_directory)
|
|
```
|
|
|
|
When you are ready to use the model again, reload it with [`PreTrainedModel.from_pretrained`]:
|
|
|
|
```py
|
|
>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")
|
|
```
|
|
|
|
One particularly cool 🤗 Transformers feature is the ability to save a model and reload it as either a PyTorch or TensorFlow model. The `from_pt` or `from_tf` parameter can convert the model from one framework to the other:
|
|
|
|
```py
|
|
>>> from transformers import AutoModel
|
|
|
|
>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
|
|
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
|
|
>>> # ===PT-TF-SPLIT===
|
|
>>> from transformers import TFAutoModel
|
|
|
|
>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
|
|
>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True)
|
|
```
|