transformers/docs/source/en/serving.md
2025-06-30 16:01:36 +00:00

8.6 KiB

Serving

Transformer models can be served for inference natively with the transformers serve CLI, or with specialized libraries such as Text Generation Inference (TGI) and vLLM. These libraries are specifically designed to optimize performance with LLMs and include many unique optimization features that may not be included in Transformers. If you're looking for experimental features instead, serving natively might be better suited to you.

TGI

TGI can serve models that aren't natively implemented by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren't available in the Transformers implementation, but other features like continuous batching and streaming are still supported.

Tip

Refer to the Non-core model serving guide for more details.

Serve a Transformers implementation the same way you'd serve a TGI model.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2

Add --trust-remote_code to the command to serve a custom Transformers model.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code

vLLM

vLLM can also serve a Transformers implementation of a model if it isn't natively implemented in vLLM.

Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.

Tip

Refer to the Transformers fallback section for more details.

By default, vLLM serves the native implementation and if it doesn't exist, it falls back on the Transformers implementation. But you can also set --model-impl transformers to explicitly use the Transformers model implementation.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
    --task generate \
    --model-impl transformers

Add the trust-remote-code parameter to enable loading a remote code model.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
    --task generate \
    --model-impl transformers \
    --trust-remote-code

Serve CLI

Warning

This section is experimental and subject to changes in future versions

You can serve transformers-compatible LLMs with transformers serve. The server has a chat completion API compatible with the OpenAI SDK, so you can also quickly experiment with transformers models on existing aplications. To launch a server, use the transformers serve CLI:

transformers serve

This server takes an extended version of the ChatCompletionInput, accepting a serialized GenerationConfig in its extra_body field for full generate parameterization. The CLI will dynamically load a new model as needed, following the model field in the request.

This server is also an MCP client, which can receive information available MCP servers (i.e. tools), massage their information into the model prompt, and prepare calls to these tools when the model commands to do so. Naturally, this requires a model that is trained to use tools.

Tip

At the moment, MCP tool usage in transformers is limited to the qwen family of models.

Example 1: tiny-agents and MCP Tools

This subsection showcases how to use transformers serve as a local LLM server to the tiny-agents CLI, and how to configure MCP tools.

The first step to use MCP tools is to let the model know which tools are available. As an example, let's consider a tiny-agents configuration file with a reference to an image generation MCP server.

{
    "model": "Menlo/Jan-nano",
    "endpointUrl": "http://localhost:8000",
    "servers": [
        {
            "type": "sse",
            "config": {
                "url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse"
            }
        }
    ]
}

Tip

Many Hugging Face Spaces can be used as MCP servers. You can find all compatible Spaces here.

You can then launch your tiny-agents chat interface with the following command.

tiny-agents run path/to/your/config.json

If you have transformers serve running in the background, you're ready to use MCP tools from a local model! For instance, here's the example of a chat session with tiny-agents:

Agent loaded with 1 tools:
 • flux1_schnell_infer
»  Generate an image of a cat on the moon
<Tool req_0_tool_call>flux1_schnell_infer {"prompt": "a cat on the moon", "seed": 42, "randomize_seed": true, "width": 1024, "height": 1024, "num_inference_steps": 4}

Tool req_0_tool_call
[Binary Content: Image image/webp, 57732 bytes]
The task is complete and the content accessible to the User
Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/3dbddc0e53b5a865ed56a4e3dbdd30f3f61cf3b8aabf1b456f43e5241bd968b8/image.webp
380576952

I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance!

Example 2: Jan and serving in a different machine

This subsection showcases how to use transformers serve as a local LLM server within the Jan app. Jan is a ChatGPT-alternative graphical interface interface, with a focus on local models.

To connect transformers server with Jan, you'll need to set up a new model provider ("Settings" > "Model Providers"). Click on "Add Provider", and set a new name. In your new model provider page, all you need to set is the "Base URL" to following pattern

http://[host]:[port]/v1

where host and port are the transformers serve CLI parameters (localhost:8000 by default). After setting this up, you should be able to see some models in the "Models" section, hitting "Refresh". Make sure you add a Hugging Face Hub API key too. Your custom model provider page should look like this:

You are now ready to chat!

Tip

You can add any transformers-compatible model to Jan through transformers server. In the custom model provider you created, click on the "+" button in the "Models" section and add its Hub repository name.

To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have ssh access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal

ssh -N -f -L localhost:8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server

Port forwarding is not Jan-specific: you can use it to connect transformers serve running in a different machine with an app of your choice.