transformers/docs/source/en/chat_templating.md
Steven Liu c0f8d055ce
[docs] Redesign (#31757)
* toctree

* not-doctested.txt

* collapse sections

* feedback

* update

* rewrite get started sections

* fixes

* fix

* loading models

* fix

* customize models

* share

* fix link

* contribute part 1

* contribute pt 2

* fix toctree

* tokenization pt 1

* Add new model (#32615)

* v1 - working version

* fix

* fix

* fix

* fix

* rename to correct name

* fix title

* fixup

* rename files

* fix

* add copied from on tests

* rename to `FalconMamba` everywhere and fix bugs

* fix quantization + accelerate

* fix copies

* add `torch.compile` support

* fix tests

* fix tests and add slow tests

* copies on config

* merge the latest changes

* fix tests

* add few lines about instruct

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix

* fix tests

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* "to be not" -> "not to be" (#32636)

* "to be not" -> "not to be"

* Update sam.md

* Update trainer.py

* Update modeling_utils.py

* Update test_modeling_utils.py

* Update test_modeling_utils.py

* fix hfoption tag

* tokenization pt. 2

* image processor

* fix toctree

* backbones

* feature extractor

* fix file name

* processor

* update not-doctested

* update

* make style

* fix toctree

* revision

* make fixup

* fix toctree

* fix

* make style

* fix hfoption tag

* pipeline

* pipeline gradio

* pipeline web server

* add pipeline

* fix toctree

* not-doctested

* prompting

* llm optims

* fix toctree

* fixes

* cache

* text generation

* fix

* chat pipeline

* chat stuff

* xla

* torch.compile

* cpu inference

* toctree

* gpu inference

* agents and tools

* gguf/tiktoken

* finetune

* toctree

* trainer

* trainer pt 2

* optims

* optimizers

* accelerate

* parallelism

* fsdp

* update

* distributed cpu

* hardware training

* gpu training

* gpu training 2

* peft

* distrib debug

* deepspeed 1

* deepspeed 2

* chat toctree

* quant pt 1

* quant pt 2

* fix toctree

* fix

* fix

* quant pt 3

* quant pt 4

* serialization

* torchscript

* scripts

* tpu

* review

* model addition timeline

* modular

* more reviews

* reviews

* fix toctree

* reviews reviews

* continue reviews

* more reviews

* modular transformers

* more review

* zamba2

* fix

* all frameworks

* pytorch

* supported model frameworks

* flashattention

* rm check_table

* not-doctested.txt

* rm check_support_list.py

* feedback

* updates/feedback

* review

* feedback

* fix

* update

* feedback

* updates

* update

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-03 10:33:46 -08:00

13 KiB
Raw Blame History

Templates

The chat pipeline guide introduced [TextGenerationPipeline] and the concept of a chat prompt or chat template for conversing with a model. Underlying this high-level pipeline is the [apply_chat_template] method. A chat template is a part of the tokenizer and it specifies how to convert conversations into a single tokenizable string in the expected model format.

In the example below, Mistral-7B-Instruct and Zephyr-7B are finetuned from the same base model but theyre trained with different chat formats. Without chat templates, you have to manually write formatting code for each model and even minor errors can hurt performance. Chat templates offer a universal way to format chat inputs to any model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)
<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)
<|user|>\nHello, how are you?</s>\n<|assistant|>\nI'm doing great. How can I help you today?</s>\n<|user|>\nI'd like to show off how chat templating works!</s>\n

This guide explores [apply_chat_template] and chat templates in more detail.

apply_chat_template

Chats should be structured as a list of dictionaries with role and content keys. The role key specifies the speaker (usually between you and the system), and the content key contains your message. For the system, the content is a high-level description of how the model should behave and respond when youre chatting with it.

Pass your messages to [apply_chat_template] to tokenize and format them. You can set add_generation_prompt to True to indicate the start of a message.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>

Now pass the tokenized chat to [~GenerationMixin.generate] to generate a response.

outputs = model.generate(tokenized_chat, max_new_tokens=128) 
print(tokenizer.decode(outputs[0]))
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.

add_generation_prompt

The add_generation_prompt parameter adds tokens that indicate the start of a response. This ensures the chat model generates a system response instead of continuing a users message.

Not all models require generation prompts, and some models, like Llama, dont have any special tokens before the system response. In this case, add_generation_prompt has no effect.

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
tokenized_chat
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>

continue_final_message

The continue_final_message parameter controls whether the final message in the chat should be continued or not instead of starting a new one. It removes end of sequence tokens so that the model continues generation from the final message.

This is useful for “prefilling” a model response. In the example below, the model generates text that continues the JSON string rather than starting a new message. It can be very useful for improving the accuracy for instruction following when you know how to start its replies.

chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'},
]

formatted_chat = tokenizer.apply_chat_template(chat, tokenize=True, return_dict=True, continue_final_message=True)
model.generate(**formatted_chat)

Warning

You shouldnt use add_generation_prompt and continue_final_message together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.

[TextGenerationPipeline] sets add_generation_prompt to True by default to start a new message. However, if the final message in the chat has the “assistant” role, it assumes the message is a prefill and switches to continue_final_message=True. This is because most models dont support multiple consecutive assistant messages. To override this behavior, explicitly pass the continue_final_message to the pipeline.

Multiple templates

A model may have several different templates for different use cases. For example, a model may have a template for regular chat, tool use, and RAG.

When there are multiple templates, the chat template is a dictionary. Each key corresponds to the name of a template. [apply_chat_template] handles multiple templates based on their name. It looks for a template named default in most cases and if it cant find one, it raises an error.

For a tool calling template, if a user passes a tools parameter and a tool_use template exists, the tool calling template is used instead of default.

To access templates with other names, pass the template name to the chat_template parameter in [apply_chat_template]. For example, if youre using a RAG template then set chat_template="rag".

It can be confusing to manage multiple templates though, so we recommend using a single template for all use cases. Use Jinja statements like if tools is defined and {% macro %} definitions to wrap multiple code paths in a single template.

Template selection

It is important to set a chat template format that matches the template format a model was pretrained on, otherwise performance may suffer. Even if youre training the model further, performance is best if the chat tokens are kept constant.

But if youre training a model from scratch or finetuning a model for chat, you have more options to select a template. For example, ChatML is a popular format that is flexbile enough to handle many use cases. It even includes support for generation prompts, but it doesnt add beginning-of-string (BOS) or end-of-string (EOS) tokens. If your model expects BOS and EOS tokens, set add_special_tokens=True and make sure to add them to your template.

{%- for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{%- endfor %}

Set the template with the following logic to support generation prompts. The template wraps each message with <|im_start|> and <|im_end|> tokens and writes the role as a string. This allows you to easily customize the roles you want to train with.

tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

The user, system and assistant roles are standard roles in chat templates. We recommend using these roles when it makes sense, especially if youre using your model with the [TextGenerationPipeline].

<|im_start|>system
You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I'm doing great!<|im_end|>

Model training

Training a model with a chat template is a good way to ensure a chat template matches the tokens a model is trained on. Apply the chat template as a preprocessing step to your dataset. Set add_generation_prompt=False because the additional tokens to prompt an assistant response arent helpful during training.

An example of preprocessing a dataset with a chat template is shown below.

from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])
<|user|>
Which is bigger, the moon or the sun?</s>
<|assistant|>
The sun.</s>

After this step, you can continue following the training recipe for causal language models using the formatted_chat column.

Some tokenizers add special <bos> and <eos> tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with apply_chat_template(tokenize=False), make sure you set add_special_tokens=False as well to avoid duplicating them.

apply_chat_template(messages, tokenize=False, add_special_tokens=False)

This isnt an issue if apply_chat_template(tokenize=True).