This commit is contained in:
Joao Gante 2025-07-02 22:09:15 +02:00 committed by GitHub
commit 7f3b9a31bf
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 147 additions and 565 deletions

View File

@ -100,8 +100,6 @@
title: Distributed inference title: Distributed inference
- local: perf_infer_cpu - local: perf_infer_cpu
title: CPU title: CPU
- local: tf_xla
title: XLA
title: Optimization title: Optimization
- local: agents - local: agents
title: Agents title: Agents
@ -141,8 +139,6 @@
title: GPU title: GPU
- local: perf_train_cpu - local: perf_train_cpu
title: CPU title: CPU
- local: perf_train_tpu_tf
title: TPU
- local: perf_train_special - local: perf_train_special
title: Apple Silicon title: Apple Silicon
- local: perf_train_gaudi - local: perf_train_gaudi
@ -1144,4 +1140,3 @@
title: Environment Variables title: Environment Variables
title: Reference title: Reference
title: API title: API

View File

@ -16,3 +16,5 @@ rendered properly in your Markdown viewer.
> [!WARNING] > [!WARNING]
> Agents and tools were spun out into the standalone [smolagents](https://huggingface.co/docs/smolagents/index) library. They were removed from `transformers` in v4.52. > Agents and tools were spun out into the standalone [smolagents](https://huggingface.co/docs/smolagents/index) library. They were removed from `transformers` in v4.52.
(deprecated)

View File

@ -25,10 +25,7 @@ Check model leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_
This guide shows you how to quickly start chatting with Transformers from the command line, how build and format a conversation, and how to chat using the [`TextGenerationPipeline`]. This guide shows you how to quickly start chatting with Transformers from the command line, how build and format a conversation, and how to chat using the [`TextGenerationPipeline`].
## transformers CLI ## chat CLI
### Interactive chat session
After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session. After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session.
@ -52,68 +49,7 @@ For a full list of options, run the command below.
transformers chat -h transformers chat -h
``` ```
The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)).
### Serving a model and using MCP tools
> [!WARNING]
> This section is experimental and subject to changes in future versions
Powering the `chat` interface, we have a server that takes user messages and returns completions. The server has a chat completion API compatible with the OpenAI SDK, so you can also quickly experiment with `transformers` models on existing aplications. To launch a server separately, use the `transformers serve` CLI:
```bash
transformers serve Menlo/Jan-nano
```
Under the hood, the `chat` CLI launches and uses `transformers serve`. This server is also an MCP client, which can receive information available MCP servers (i.e. tools), massage their information into the model prompt, and prepare calls to these tools when the model commands to do so. Naturally, this requires a model that is trained to use tools.
At the moment, MCP tool usage in `transformers` has the following constraints:
- `chat` can't handle tools, but the [`tiny-agents`](https://huggingface.co/blog/python-tiny-agents) CLI can;
- Only the `qwen` family of models is supported.
The first step to use MCP tools is to let the model know which tools are available. As an example, let's consider a `tiny-agents` configuration file with a reference to an [image generation MCP server](https://evalstate-flux1-schnell.hf.space/).
> [!TIP]
> Many Hugging Face Spaces can be used as MCP servers. You can find all compatible Spaces [here](https://huggingface.co/spaces?filter=mcp-server).
```json
{
"model": "http://localhost:8000",
"provider": "local",
"servers": [
{
"type": "sse",
"config": {
"url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse"
}
}
]
}
```
You can then launch your `tiny-agents` chat interface with the following command.
```bash
tiny-agents run path/to/your/config.json
```
If you have a server (from `transformers serve`) running in the background, you're ready to use MCP tools from a local model! For instance, here's the example of a chat session:
```bash
Agent loaded with 1 tools:
• flux1_schnell_infer
» Generate an image of a cat on the moon
<Tool req_0_tool_call>flux1_schnell_infer {"prompt": "a cat on the moon", "seed": 42, "randomize_seed": true, "width": 1024, "height": 1024, "num_inference_steps": 4}
Tool req_0_tool_call
[Binary Content: Image image/webp, 57732 bytes]
The task is complete and the content accessible to the User
Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/3dbddc0e53b5a865ed56a4e3dbdd30f3f61cf3b8aabf1b456f43e5241bd968b8/image.webp
380576952
I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance!
```
## TextGenerationPipeline ## TextGenerationPipeline

View File

@ -1,355 +0,0 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# TPU
TPU (Tensor Processing Unit) is a type of hardware designed to accelerate tensor computations for training and inference. TPUs are generally accessed through Google cloud services, but smaller TPUs are also available for free from [Google Colab](https://colab.research.google.com/notebooks/tpu.ipynb) or [Kaggle](https://www.kaggle.com/docs/tpu).
This guide focuses on training a Keras model for sequence classification on a TPU from Google Colab. Make sure the TPU runtime is enabled by going to **Runtime > Change runtime type** and selecting a TPU.
Run the command below to install the latest version of Transformers and [Datasets](https://huggingface.co/docs/datasets).
```py
!pip install --U transformers datasets
```
Create an instance of [tf.distribute.cluster_resolver.TPUClusterResolver](https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/TPUClusterResolver), and then connect to the remote cluster and initialize the TPUs.
```py
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
```
There are various distribution strategies for running your model on multiple TPUs. The [tpu.distribute.TPUStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/TPUStrategy) offers synchronized distributed training.
```py
strategy = tf.distribute.TPUStrategy(resolver)
```
Load and tokenize a dataset - this example uses [CoLA](https://huggingface.co/datasets/nyu-mll/glue/viewer/cola) from the GLUE benchmark - and pad all samples to the maximum length so it is easier to load as an array and to avoid [XLA compilation issues](#xla).
```py
from transformers import AutoTokenizer
from datasets import load_dataset
import numpy as np
dataset = load_dataset("glue", "cola")["train"]
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
train_data = tokenizer(
dataset["sentence"],
padding="max_length",
truncation=True,
max_length=128,
return_tensors="np",
)
train_data = dict(train_data)
train_labels = np.array(dataset["label"])
```
The model **must** be created inside [Strategy.scope](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#scope) in order to replicate the model layers on each TPU device.
```py
from transformers import TFAutoModelForSequenceClassification
with strategy.scope():
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint)
model.compile(optimizer="adam")
```
TPUs only accept [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) inputs unlike the Keras [fit](https://keras.io/api/models/model_training_apis/#fit-method) method which accepts a broader range of inputs.
```py
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
tf_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
tf_dataset = tf_dataset.shuffle(len(tf_dataset))
tf_dataset = tf_dataset.batch(BATCH_SIZE, drop_remainder=True)
```
Finally, call [fit](https://keras.io/api/models/model_training_apis/#fit-method) to start training.
```py
model.fit(tf_dataset)
```
## Large datasets
The dataset created above pads every sample to the maximum length and loads the whole dataset into memory. This may not be possible if you're working with larger datasets. When training on large datasets, you may want to create a [tf.TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) or stream the data.
### tf.TFRecord
[tf.TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) is the standard [tf.data](https://www.tensorflow.org/guide/data) format for storing training data. For very large training jobs, it's worth preprocessing your data and storing it in the `tf.TFRecord` format and building a `tf.data` pipeline on top. Refer to the table below to help you decide whether `tf.TFRecord` is helpful for you.
| pros | cons |
|---|---|
| works on all TPU instances | costs associated with cloud storage |
| supports huge datasets and massive throughput | some data types (images) can take a lot of space to store |
| suitable for training on entire TPU pods | |
| preprocessing is done in advance, maximizing training speed | |
Preprocess and tokenize the dataset before writing it to a `tf.TFRecord` to avoid writing every time the data is loaded.
An exception is made for *train-time augmentations*, because augmentations applied after writing to a `tf.TFRecord` results in the same augmentation for each epoch. Instead, apply augmentations in the `tf.data` pipeline that loads the data.
> [!TIP]
> In practice, you probably won't be able to load the entire dataset in memory. Load a chunk of the dataset at a time and convert it to `TFRecord`, and repeat until the entire dataset is in the `TFRecord` format. Then you can use a list of all the files to create a `TFRecordDataset`. The example below demonstrates a single file for simplicity.
```py
tokenized_data = tokenizer(
dataset["sentence"],
padding="max_length",
truncation=True,
max_length=128,
return_tensors="np",
)
labels = dataset["label"]
with tf.io.TFRecordWriter("dataset.tfrecords") as file_writer:
for i in range(len(labels)):
features = {
"input_ids": tf.train.Feature(
int64_list=tf.train.Int64List(value=tokenized_data["input_ids"][i])
),
"attention_mask": tf.train.Feature(
int64_list=tf.train.Int64List(value=tokenized_data["attention_mask"][i])
),
"labels": tf.train.Feature(
int64_list=tf.train.Int64List(value=[labels[i]])
),
}
features = tf.train.Features(feature=features)
example = tf.train.Example(features=features)
record_bytes = example.SerializeToString()
file_writer.write(record_bytes)
```
Build a [TFRecordDataset](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset) using the saved filename to load it.
```py
def decode_fn(sample):
features = {
"input_ids": tf.io.FixedLenFeature((128,), dtype=tf.int64),
"attention_mask": tf.io.FixedLenFeature((128,), dtype=tf.int64),
"labels": tf.io.FixedLenFeature((1,), dtype=tf.int64),
}
return tf.io.parse_example(sample, features)
# TFRecordDataset can handle gs:// paths
tf_dataset = tf.data.TFRecordDataset(["gs://matt-tf-tpu-tutorial-datasets/cola/dataset.tfrecords"])
tf_dataset = tf_dataset.map(decode_fn)
tf_dataset = tf_dataset.shuffle(len(dataset)).batch(BATCH_SIZE, drop_remainder=True)
tf_dataset = tf_dataset.apply(
tf.data.experimental.assert_cardinality(len(labels) // BATCH_SIZE)
)
```
The dataset can now be passed to the [fit](https://keras.io/api/models/model_training_apis/#fit-method) method.
```py
model.fit(tf_dataset)
```
### Stream from raw data
Data can be stored in its native format and preprocessed in a [tf.data](https://www.tensorflow.org/guide/data) pipeline as the data is loaded. This approach isn't supported for many models with complex tokenization schemes, but some models like BERT are supported because their tokenization can be compiled. Refer to the table below to help you decide whether this approach is helpful for you.
| pros | cons |
|---|---|
| suitable for highly compressed big data in native format (images, audio) | requires writing a full preprocessing pipeline |
| convenient if raw data is available in a public cloud bucket | complex preprocessing on-the-fly can hurt throughput |
| works on all TPU instances if data is stored in Google Cloud | must place data in cloud storage if not already there |
| | not as suitable for text data because writing a tokenization pipeline is hard (use `TFRecord` for text) |
The example below demonstrates streaming data for an image model.
Load an image dataset and get a list of the underlying image file paths and labels.
```py
from datasets import load_dataset
image_dataset = load_dataset("beans", split="train")
filenames = image_dataset["image_file_path"]
labels = image_dataset["labels"]
```
Convert the local filenames in the dataset into `gs://` paths in Google Cloud Storage.
```py
# strip everything but the category directory and filenames
base_filenames = ['/'.join(filename.split('/')[-2:]) for filename in filenames]
# prepend the Google Cloud base path to everything instead
gs_paths = ["gs://matt-tf-tpu-tutorial-datasets/beans/"+filename for filename in base_filenames]
# create tf_dataset
tf_dataset = tf.data.Dataset.from_tensor_slices(
{"filename": gs_paths, "labels": labels}
)
tf_dataset = tf_dataset.shuffle(len(tf_dataset))
```
Transformers preprocessing classes like [`AutoImageProcessor`] are framework-agnostic and can't be compiled into a pipeline by `tf.data`. To get around this, get the normalization values (`mean` and `std`) from the [`AutoImageProcessor`] and use them in the `tf.data` pipeline.
```py
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
image_size = (processor.size["height"], processor.size["width"])
image_mean = processor.image_mean
image_std = processor.image_std
```
Use these normalization values to create a function to load and preprocess the images.
```py
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
def decode_fn(sample):
image_data = tf.io.read_file(sample["filename"])
image = tf.io.decode_jpeg(image_data, channels=3)
image = tf.image.resize(image, image_size)
array = tf.cast(image, tf.float32)
array /= 255.0
array = (array - image_mean) / image_std
array = tf.transpose(array, perm=[2, 0, 1])
return {"pixel_values": array, "labels": sample["labels"]}
tf_dataset = tf_dataset.map(decode_fn)
tf_dataset = tf_dataset.batch(BATCH_SIZE, drop_remainder=True)
print(tf_dataset.element_spec)
```
The dataset can now be passed to the [fit](https://keras.io/api/models/model_training_apis/#fit-method) method.
```py
from transformers import TFAutoModelForImageClassification
with strategy.scope():
model = TFAutoModelForImageClassification.from_pretrained(image_model_checkpoint)
model.compile(optimizer="adam")
model.fit(tf_dataset)
```
### Stream with prepare_tf_dataset
[`~TFPreTrainedModel.prepare_tf_dataset`] creates a `tf.data` pipeline that loads samples from [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). The pipeline uses [tf.numpy_function]() or [`~datasets.Dataset.from_generator`], which can't be compiled by TensorFlow, to access the underlying `tf.data.Dataset`. It also won't work on a Colab TPU or TPU Nodes because the pipeline streams data from a local disk. Refer to the table below to help you decide whether this approach is helpful for you.
| pros | cons |
|---|---|
| simple code | only works on TPU VM |
| same approach on TPU/GPU | data must be available as a Hugging Face Dataset |
| dataset doesn't have to fit in memory | data must fit on local storage |
| supports variable padding | data loading may be a bottleneck on a big TPU pod slice |
[`~TFPreTrainedModel.prepare_tf_dataset`] only works on [TPU VM](#tpu-types). Add the tokenizer output as columns in the dataset since the dataset is stored on disk, which means it can handle data larger than the available memory. Use [`~TFPreTrainedModel.prepare_tf_dataset`] to stream data from the dataset by wrapping it with a `tf.data` pipeline.
```py
def tokenize_function(examples):
return tokenizer(
examples["sentence"], padding="max_length", truncation=True, max_length=128
)
# add the tokenizer output to the dataset as new columns
dataset = dataset.map(tokenize_function)
# prepare_tf_dataset() chooses columns that match the models input names
tf_dataset = model.prepare_tf_dataset(
dataset, batch_size=BATCH_SIZE, shuffle=True, tokenizer=tokenizer
)
```
The dataset can now be passed to the [fit](https://keras.io/api/models/model_training_apis/#fit-method) method.
```py
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
with strategy.scope():
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint)
model.compile(optimizer="adam")
model.fit(tf_dataset)
```
## TPU types
There are two types of TPUs, a TPU Node and a TPU VM.
A TPU Node indirectly accesses a remote TPU. It requires a separate VM to initialize your network and data pipeline, and then forwards it to the remote node. Google Colab TPUs are an example of a TPU Node. You can't use local data because the TPU is remotely located, and data must be stored in Google Cloud Storage where the data pipeline can access it.
TPU VM are connected directly to the machine the TPU is located on, and they are generally easier to work with, especially when it comes to your data pipeline.
> [!TIP]
> We recommend avoiding TPU Nodes if possible because it is more difficult to debug than TPU VMs. TPU Nodes may also be unsupported in the future and become a legacy access method.
A single TPU (v2-8, v3-8, v4-8) runs 8 replicas. TPUs can exist in **pods** which run hundreds or even thousands of replicas simultaneously. When you only use a portion of a pod, it is referred to as a **pod slice**. On Google Colab, you'll typically get a single v2-8 TPU.
## XLA
[XLA](https://openxla.org/xla) is a linear algebra compiler for high-performance execution and it is used by default to improve performance on TPUs.
Before executing your code on a TPU, it's a good idea to try it first on a CPU or GPU because it is easier to debug. You can train for a few steps to make sure the model and data pipeline work as expected. Set `jit_compile=True` in the [compile](https://keras.io/api/models/model_training_apis/#compile-method) method to enable XLA compilation (but remember to remove this line of code before running on a TPU).
The section below outlines three rules for making your code XLA-compatible. Transformers enforce the first two rules for models and loss functions by default, but don't forget about them if you're writing your own models and loss functions.
### Data dependent conditionals
Any `if` statements cannot depend on values inside a [tf.Tensor](https://www.tensorflow.org/api_docs/python/tf/Tensor). The code below can't be compiled by XLA.
```py
if tf.reduce_sum(tensor) > 10:
tensor = tensor / 2.0
```
To compile with XLA, use [tf.cond](https://www.tensorflow.org/api_docs/python/tf/cond) or remove the conditional and use indicator variables instead as shown below.
```py
sum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32)
tensor = tensor / (1.0 + sum_over_10)
```
### Data dependent shapes
The shape of a [tf.Tensor](https://www.tensorflow.org/api_docs/python/tf/Tensor) cannot depend on their values. For example, [tf.unique](https://www.tensorflow.org/api_docs/python/tf/unique) can't be compiled because it returns a tensor containing an instance of each unique value in the input. The shape of this output depends on how repetitive the input [tf.Tensor](https://www.tensorflow.org/api_docs/python/tf/Tensor) is.
This is an issue during **label masking**, where labels are set to a negative value to indicate they should be ignored when computing the loss. The code below can't be compiled by XLA because the shape of `masked_outputs` and `masked_labels` depend on how many positions are masked.
```py
label_mask = labels >= 0
masked_outputs = outputs[label_mask]
masked_labels = labels[label_mask]
loss = compute_loss(masked_outputs, masked_labels)
mean_loss = torch.mean(loss)
```
To compile with XLA, avoid the data-dependent shapes by computing the loss for every position and zeroing out the masked positions in both the numerator and denominator when calculating the mean. Convert `tf.bool` to `tf.float32` as an indicator variable to make your code XLA-compatible.
```py
label_mask = tf.cast(labels >= 0, tf.float32)
loss = compute_loss(outputs, labels)
loss = loss * label_mask
mean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask)
```
### Recompile different input shapes
XLA recompiles your model if input shapes are variable which create huge performance problems. It is especially common in text models because input texts have variable lengths after tokenization.
> [!WARNING]
> Execessive padding can also severely slow down training because requires more compute and memory to process.
To avoid different shapes, use padding to pad all your inputs to the same length and use an `attention_mask`. Try padding batches of samples to a multiple of 32 or 64 tokens. Use the parameters `padding="max_length"`, `padding="longest"`, or `pad_to_multiple_of` to help with padding. This often increases the number of tokens by a small amount, but it significantly reduces the number of unique input shapes because every input shape is a multiple of 32 or 64. Fewer unique input shapes requires fewer recompilation.

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Serving # Serving
Transformer models can be served for inference with specialized libraries such as Text Generation Inference (TGI) and vLLM. These libraries are specifically designed to optimize performance with LLMs and include many unique optimization features that may not be included in Transformers. Transformer models can be served for inference natively with the `transformers serve` CLI, or with specialized libraries such as Text Generation Inference (TGI) and vLLM. These libraries are specifically designed to optimize performance with LLMs and include many unique optimization features that may not be included in Transformers. If you're looking for experimental features instead, serving natively might be better suited to you.
## TGI ## TGI
@ -62,3 +62,117 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct \
--model-impl transformers \ --model-impl transformers \
--trust-remote-code --trust-remote-code
``` ```
## Serve CLI
> [!WARNING]
> This section is experimental and subject to change in future versions
<!-- TODO: LLMs -> models, after we add audio/image input/output support -->
You can serve `transformers`-compatible LLMs with `transformers serve`. The server has a chat completion API compatible with the OpenAI SDK, so you can also quickly experiment with `transformers` models on existing aplications. To launch a server, use the `transformers serve` CLI:
```shell
transformers serve
```
<!-- TODO: either fully align the two APIs, or link to the `transformers` version instead -->
This server takes an extended version of the [`ChatCompletionInput`](https://huggingface.co/docs/huggingface_hub/v0.33.1/en/package_reference/inference_types#huggingface_hub.ChatCompletionInput), accepting a serialized `GenerationConfig` in its `extra_body` field for full `generate` parameterization. The CLI will dynamically load a new model as needed, following the `model` field in the request.
The simplest way to interact with the server is by sending an HTTP request with `cURL`, e.g.
```shell
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
```
from which you'll receive multiple chunks
```shell
data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]}
data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]}
(...)
```
This server is also an MCP client, which can receive information available MCP servers (i.e. tools), massage their information into the model prompt, and prepare calls to these tools when the model commands to do so. Naturally, this requires a model that is trained to use tools.
> [!TIP]
> At the moment, MCP tool usage in `transformers` is limited to the `qwen` family of models.
<!-- TODO: example with a minimal python example -->
### Example 1: `tiny-agents` and MCP Tools
This subsection showcases how to use `transformers serve` as a local LLM server to the [`tiny-agents`](https://huggingface.co/blog/python-tiny-agents) CLI, and how to configure MCP tools.
The first step to use MCP tools is to let the model know which tools are available. As an example, let's consider a `tiny-agents` configuration file with a reference to an [image generation MCP server](https://evalstate-flux1-schnell.hf.space/).
```json
{
"model": "Menlo/Jan-nano",
"endpointUrl": "http://localhost:8000",
"servers": [
{
"type": "sse",
"config": {
"url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse"
}
}
]
}
```
> [!TIP]
> Many Hugging Face Spaces can be used as MCP servers. You can find all compatible Spaces [here](https://huggingface.co/spaces?filter=mcp-server).
You can then launch your `tiny-agents` chat interface with the following command.
```bash
tiny-agents run path/to/your/config.json
```
If you have `transformers serve` running in the background, you're ready to use MCP tools from a local model! For instance, here's the example of a chat session with `tiny-agents`:
```bash
Agent loaded with 1 tools:
• flux1_schnell_infer
» Generate an image of a cat on the moon
<Tool req_0_tool_call>flux1_schnell_infer {"prompt": "a cat on the moon", "seed": 42, "randomize_seed": true, "width": 1024, "height": 1024, "num_inference_steps": 4}
Tool req_0_tool_call
[Binary Content: Image image/webp, 57732 bytes]
The task is complete and the content accessible to the User
Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/3dbddc0e53b5a865ed56a4e3dbdd30f3f61cf3b8aabf1b456f43e5241bd968b8/image.webp
380576952
I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance!
```
### Example 2: Jan and serving in a different machine
This subsection showcases how to use `transformers serve` as a local LLM server within the [Jan](https://jan.ai/) app. Jan is a ChatGPT-alternative graphical interface interface, with a focus on local models.
To connect `transformers server` with Jan, you'll need to set up a new model provider ("Settings" > "Model Providers"). Click on "Add Provider", and set a new name. In your new model provider page, all you need to set is the "Base URL" to following pattern
```
http://[host]:[port]/v1
```
where `host` and `port` are the `transformers serve` CLI parameters (`localhost:8000` by default). After setting this up, you should be able to see some models in the "Models" section, hitting "Refresh". Make sure you add a Hugging Face Hub API key too. Your custom model provider page should look like this:
<h3 align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_jan_model_providers.png"/>
</h3>
You are now ready to chat!
> [!TIP]
> You can add any `transformers`-compatible model to Jan through `transformers server`. In the custom model provider you created, click on the "+" button in the "Models" section and add its Hub repository name.
To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have `ssh` access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal
```
ssh -N -f -L localhost:8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
```
Port forwarding is not Jan-specific: you can use it to connect `transformers serve` running in a different machine with an app of your choice.

View File

@ -1,129 +0,0 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# XLA
[[open-in-colab]]
[Accelerated Linear Algebra (XLA)](https://openxla.org/xla) is a linear algebra compiler that optimizes model runtime across different hardware and frameworks.
This guide will look specifically at how to accelerate *TensorFlow* models with XLA.
## TensorFlow
XLA can potentially accelerate a TensorFlow model without making any source code changes. It is already packaged with the TensorFlow library, and it is triggered with `jit_compile` in any graph creating function such as [tf.function](https://www.tensorflow.org/api_docs/python/tf/function).
If you're using Keras methods like [fit](https://keras.io/api/models/model_training_apis/#fit-method) and [predict](https://keras.io/api/models/model_training_apis/#predict-method), enable XLA by passing `jit_compile=True` to [compile](https://keras.io/api/models/model_training_apis/#compile-method).
```py
model.compile(jit_compile=True)
```
XLA can be used to accelerate any arbitrary [tf.function](https://www.tensorflow.org/api_docs/python/tf/function).
Models with a TensorFlow implementation like [GPT2](./model_doc/gpt2), [T5](./model_doc/t5), [OPT](./model_doc/opt), and [Whisper](./model_doc/whisper) are XLA compatible. The speed up depends on a model, but in general, TensorFlow models in Transformers get a ~100x speed up.
### Functions
A typical forward pass in a TensorFlow model is shown below. To run a forward pass with XLA, wrap the model with [tf.function](https://www.tensorflow.org/api_docs/python/tf/function) and set `jit_compile=True`.
```diff
import tensorflow as tf
model = tf.keras.Sequential(
[tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")]
)
# Generate random inputs for the model.
batch_size = 16
input_vector_dim = 10
random_inputs = tf.random.normal((batch_size, input_vector_dim))
# Run a forward pass.
- _ = model(random_inputs)
+ xla_fn = tf.function(model, jit_compile=True)
+ _ = xla_fn(random_inputs)
```
The default `call` function of the model is used to compile the XLA graph. But if there's any other model function you want to compile with XLA, wrap them with [tf.function](https://www.tensorflow.org/api_docs/python/tf/function).
```py
my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True)
```
### Text generation
You could also compile other model functions with XLA. For example, enable XLA for text generation by wrapping [`~TFGenerationMixin.generate`] with [tf.function](https://www.tensorflow.org/api_docs/python/tf/function).
```py
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM
# Will error if the minimal version of Transformers is not installed.
from transformers.utils import check_min_version
check_min_version("4.21.0")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]
xla_generate = tf.function(model.generate, jit_compile=True)
tokenized_input = tokenizer(input_string, return_tensors="tf")
generated_tokens = xla_generate(**tokenized_input, num_beams=2)
decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")
"Generated -- TensorFlow is an open-source, open-source, distributed-source application framework for the"
```
## Tracing
When executing an XLA-enabled function for the first time, it tries to infer the computation graph in a process known as *tracing*. This is a time-consuming step, but any consecutive calls to the function will be much faster because it won't have to trace the computation graph again.
To ensure a function is only traced once, the inputs must have the same shape as when the graph was built. This usually isn't an issue for fixed input shapes like images, but it can be an issue for inputs with variable shapes like text.
One way to handle this is to pad your text so it always has the same shape. Configure padding options such as [pad_to_multiple_of](https://hf.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.pad.pad_to_multiple_of) in the tokenizer.
```py
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]
xla_generate = tf.function(model.generate, jit_compile=True)
# Call tokenizer with padding options.
tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf")
generated_tokens = xla_generate(**tokenized_input, num_beams=2)
decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")
```
In addition to the input shape, any changes to the generation options at any point also triggers tracing.
## Resources
Learn more about XLA with the following resources.
- A [notebook](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/91_tf_xla_generate.ipynb) demonstrating XLA-compatible encoder-decoder and decoder-only text generation models.
- The [Faster Text Generation with TensorFlow and XLA](https://hf.co/blog/tf-xla-generate) blog post compares benchmarks for XLA-compatible models and provides a friendly introduction to XLA in TensorFlow.
- The [How Hugging Face improved Text Generation performance with XLA](https://blog.tensorflow.org/2022/11/how-hugging-face-improved-text-generation-performance-with-xla.html) blog post discusses the design philosophy behind adding XLA to TensorFlow models in Transformers.
- The [Introduction to graphs and tf.function](https://www.tensorflow.org/guide/intro_to_graphs) guide.
- The [Better performance with tf.function](https://www.tensorflow.org/guide/function) guide.
- The [XLA](https://openxla.org/xla) documentation.

View File

@ -16,3 +16,5 @@ rendered properly in your Markdown viewer.
> [!WARNING] > [!WARNING]
> Agents and tools were spun out into the standalone [smolagents](https://huggingface.co/docs/smolagents/index) library. They were removed from `transformers` in v4.52. > Agents and tools were spun out into the standalone [smolagents](https://huggingface.co/docs/smolagents/index) library. They were removed from `transformers` in v4.52.
(deprecated)

View File

@ -133,15 +133,17 @@ def create_generation_config_from_req(req: "ChatCompletionInput") -> "Generation
generation_config = GenerationConfig() generation_config = GenerationConfig()
if req.frequency_penalty is not None: if req.frequency_penalty is not None:
generation_config.repetition_penalty = req.frequency_penalty generation_config.repetition_penalty = float(req.frequency_penalty)
if req.logit_bias is not None: if req.logit_bias is not None:
generation_config.sequence_bias = req.logit_bias generation_config.sequence_bias = req.logit_bias
if req.stop is not None: if req.stop is not None:
generation_config.stop_strings = req.stop generation_config.stop_strings = req.stop
if req.temperature is not None: if req.temperature is not None:
generation_config.temperature = req.temperature generation_config.temperature = float(req.temperature)
if float(req.temperature) == 0.0:
generation_config.do_sample = False
if req.top_p is not None: if req.top_p is not None:
generation_config.top_p = req.top_p generation_config.top_p = float(req.top_p)
if req.seed is not None: if req.seed is not None:
torch.manual_seed(req.seed) torch.manual_seed(req.seed)
@ -202,7 +204,7 @@ class ServeArguments:
use_bnb_nested_quant: bool = field(default=False, metadata={"help": "Whether to use nested quantization."}) use_bnb_nested_quant: bool = field(default=False, metadata={"help": "Whether to use nested quantization."})
# Serving settings # Serving settings
host: str = field(default="localhost", metadata={"help": "Interface the server will listen to.."}) host: str = field(default="localhost", metadata={"help": "Interface the server will listen to."})
port: int = field(default=8000, metadata={"help": "Port the server will listen to."}) port: int = field(default=8000, metadata={"help": "Port the server will listen to."})
# Other settings # Other settings
@ -256,6 +258,9 @@ class ServeCommand(BaseTransformersCLICommand):
finish_reason: Optional[str] = None, finish_reason: Optional[str] = None,
tool_calls: Optional[list[ChatCompletionStreamOutputDeltaToolCall]] = None, tool_calls: Optional[list[ChatCompletionStreamOutputDeltaToolCall]] = None,
) -> str: ) -> str:
print("role", role)
print("content", content)
print("finish_reason", finish_reason)
payload = { payload = {
"object": "chat.completion.chunk", "object": "chat.completion.chunk",
"id": request_id, "id": request_id,
@ -280,6 +285,16 @@ class ServeCommand(BaseTransformersCLICommand):
def run(self): def run(self):
app = FastAPI() app = FastAPI()
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=['*'],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
if self.use_continuous_batching: if self.use_continuous_batching:
self.continuous_batching(app) self.continuous_batching(app)
else: else:
@ -401,8 +416,8 @@ class ServeCommand(BaseTransformersCLICommand):
# No cached messages: this is a new request # No cached messages: this is a new request
if self.last_messages is None: if self.last_messages is None:
req_continues_last_messages = False req_continues_last_messages = False
# The new request has fewer rounds of conversation: this is a new request # The new request no new rounds of conversation: this is a new request
elif len(self.last_messages) > len(req.messages): elif len(self.last_messages) >= len(req.messages):
req_continues_last_messages = False req_continues_last_messages = False
# Otherwise, check that the last messages are a subset of the new request # Otherwise, check that the last messages are a subset of the new request
else: else:

View File

@ -1776,6 +1776,10 @@ class GenerationMixin(ContinuousMixin):
): ):
modified_values[key] = model_gen_config_value modified_values[key] = model_gen_config_value
setattr(generation_config, key, model_gen_config_value) setattr(generation_config, key, model_gen_config_value)
# edge case: we may set `temperature=0.0` and `do_sample=False`, but the model defaults to
# `do_sample=True`
if generation_config.temperature == 0.0:
generation_config.do_sample = False
if use_model_defaults is None and len(modified_values) > 0: if use_model_defaults is None and len(modified_values) > 0:
logger.warning_once( logger.warning_once(
f"`generation_config` default values have been modified to match model-specific defaults: " f"`generation_config` default values have been modified to match model-specific defaults: "

View File

@ -278,7 +278,6 @@ docs/source/en/perf_train_cpu_many.md
docs/source/en/perf_train_gpu_many.md docs/source/en/perf_train_gpu_many.md
docs/source/en/perf_train_gpu_one.md docs/source/en/perf_train_gpu_one.md
docs/source/en/perf_train_special.md docs/source/en/perf_train_special.md
docs/source/en/perf_train_tpu_tf.md
docs/source/en/perplexity.md docs/source/en/perplexity.md
docs/source/en/philosophy.md docs/source/en/philosophy.md
docs/source/en/pipeline_webserver.md docs/source/en/pipeline_webserver.md
@ -307,7 +306,6 @@ docs/source/en/tasks/video_classification.md
docs/source/en/tasks/visual_question_answering.md docs/source/en/tasks/visual_question_answering.md
docs/source/en/tasks/zero_shot_image_classification.md docs/source/en/tasks/zero_shot_image_classification.md
docs/source/en/tasks/zero_shot_object_detection.md docs/source/en/tasks/zero_shot_object_detection.md
docs/source/en/tf_xla.md
docs/source/en/tflite.md docs/source/en/tflite.md
docs/source/en/tokenizer_summary.md docs/source/en/tokenizer_summary.md
docs/source/en/torchscript.md docs/source/en/torchscript.md