mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
🚨 No more default chat templates (#31733)
* No more default chat templates * Add the template to the GPT-SW3 tests since it's not available by default now * Fix GPT2 test * Fix Bloom test * Fix Bloom test * Remove default templates again
This commit is contained in:
parent
1c122a46dc
commit
edd68f4ed8
@ -580,7 +580,7 @@ default template for that model class is used instead. Let's take a look at the
|
||||
>>> from transformers import AutoTokenizer
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
|
||||
|
||||
>>> tokenizer.default_chat_template
|
||||
>>> tokenizer.chat_template
|
||||
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
|
||||
```
|
||||
|
||||
@ -704,23 +704,6 @@ with other names, pass the name of the template you want to the `chat_template`
|
||||
We find that this can be a bit confusing for users, though - so if you're writing a template yourself, we recommend
|
||||
trying to put it all in a single template where possible!
|
||||
|
||||
### What are "default" templates?
|
||||
|
||||
Before the introduction of chat templates, chat handling was hardcoded at the model class level. For backwards
|
||||
compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a
|
||||
model does not have a chat template set, but there is a default template for its model class, the `TextGenerationPipeline`
|
||||
class and methods like `apply_chat_template` will use the class template instead. You can find out what the default
|
||||
template for your tokenizer is by checking the `tokenizer.default_chat_template` attribute.
|
||||
|
||||
This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when
|
||||
the class template is appropriate for your model, we strongly recommend overriding the default template by
|
||||
setting the `chat_template` attribute explicitly to make it clear to users that your model has been correctly configured
|
||||
for chat.
|
||||
|
||||
Now that actual chat templates have been adopted more widely, default templates have been deprecated and will be
|
||||
removed in a future release. We strongly recommend setting the `chat_template` attribute for any tokenizers that
|
||||
still depend on them!
|
||||
|
||||
### What template should I use?
|
||||
|
||||
When setting the template for a model that's already been trained for chat, you should ensure that the template
|
||||
|
@ -220,7 +220,7 @@ La plantilla de chat para un modelo se almacena en el atributo `tokenizer.chat_t
|
||||
>>> from transformers import AutoTokenizer
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
|
||||
|
||||
>>> tokenizer.default_chat_template
|
||||
>>> tokenizer.chat_template
|
||||
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
|
||||
```
|
||||
|
||||
@ -307,12 +307,6 @@ Si estás ajustando finamente un modelo para chat, además de establecer una pla
|
||||
|
||||
</Tip>
|
||||
|
||||
### ¿Qué son las plantillas "default"?
|
||||
|
||||
Antes de la introducción de las plantillas de chat, el manejo del chat estaba codificado en el nivel de la clase del modelo. Por razones de compatibilidad con versiones anteriores, hemos conservado este manejo específico de la clase como plantillas predeterminadas, también establecidas a nivel de clase. Si un modelo no tiene una plantilla de chat establecida, pero hay una plantilla predeterminada para su clase de modelo, la clase `TextGenerationPipeline` y métodos como `apply_chat_template` usarán la plantilla de clase en su lugar. Puedes averiguar cuál es la plantilla predeterminada para tu tokenizador comprobando el atributo `tokenizer.default_chat_template`.
|
||||
|
||||
Esto es algo que hacemos puramente por razones de compatibilidad con versiones anteriores, para evitar romper cualquier flujo de trabajo existente. Incluso cuando la plantilla de clase es apropiada para tu modelo, recomendamos encarecidamente anular la plantilla predeterminada estableciendo explícitamente el atributo `chat_template` para dejar claro a los usuarios que tu modelo ha sido configurado correctamente para el chat, y para estar preparados para el futuro en caso de que las plantillas predeterminadas alguna vez se alteren o se eliminen.
|
||||
|
||||
### ¿Qué plantilla debería usar?
|
||||
|
||||
Cuando establezcas la plantilla para un modelo que ya ha sido entrenado para chat, debes asegurarte de que la plantilla coincida exactamente con el formato de mensajes que el modelo vio durante el entrenamiento, o de lo contrario es probable que experimentes degradación del rendimiento. Esto es cierto incluso si estás entrenando aún más el modelo; probablemente obtendrás el mejor rendimiento si mantienes constantes los tokens de chat. Esto es muy análogo a la tokenización: generalmente obtienes el mejor rendimiento para la inferencia o el ajuste fino cuando coincides precisamente con la tokenización utilizada durante el entrenamiento.
|
||||
|
@ -85,7 +85,7 @@ LLM(Language Model)のますます一般的な使用事例の1つは「チ
|
||||
>>> from transformers import AutoTokenizer
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
|
||||
|
||||
>>> tokenizer.default_chat_template
|
||||
>>> tokenizer.chat_template
|
||||
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
|
||||
```
|
||||
|
||||
|
@ -228,7 +228,7 @@ The sun.</s>
|
||||
>>> from transformers import AutoTokenizer
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
|
||||
|
||||
>>> tokenizer.default_chat_template
|
||||
>>> tokenizer.chat_template
|
||||
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}"
|
||||
```
|
||||
|
||||
|
@ -405,17 +405,3 @@ class BlenderbotTokenizer(PreTrainedTokenizer):
|
||||
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
|
||||
"""
|
||||
return token_ids_0 + [self.eos_token_id]
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A very simple chat template that just adds whitespace between messages.
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
|
||||
"{{ message['content'] }}"
|
||||
"{% if not loop.last %}{{ ' ' }}{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{{ eos_token }}"
|
||||
)
|
||||
|
@ -287,18 +287,3 @@ class BlenderbotTokenizerFast(PreTrainedTokenizerFast):
|
||||
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
|
||||
"""
|
||||
return token_ids_0 + [self.eos_token_id]
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A very simple chat template that just adds whitespace between messages.
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
|
||||
"{{ message['content'] }}"
|
||||
"{% if not loop.last %}{{ ' ' }}{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{{ eos_token }}"
|
||||
)
|
||||
|
@ -217,18 +217,3 @@ class BlenderbotSmallTokenizer(PreTrainedTokenizer):
|
||||
index += 1
|
||||
|
||||
return vocab_file, merge_file
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A very simple chat template that just adds whitespace between messages.
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
|
||||
"{{ message['content'] }}"
|
||||
"{% if not loop.last %}{{ ' ' }}{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{{ eos_token }}"
|
||||
)
|
||||
|
@ -98,18 +98,3 @@ class BlenderbotSmallTokenizerFast(PreTrainedTokenizerFast):
|
||||
if token_ids_1 is None:
|
||||
return len(cls + token_ids_0 + sep) * [0]
|
||||
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A very simple chat template that just adds whitespace between messages.
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}"
|
||||
"{{ message['content'] }}"
|
||||
"{% if not loop.last %}{{ ' ' }}{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{{ eos_token }}"
|
||||
)
|
||||
|
@ -147,11 +147,3 @@ class BloomTokenizerFast(PreTrainedTokenizerFast):
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
files = self._tokenizer.model.save(save_directory, name=filename_prefix)
|
||||
return tuple(files)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
|
||||
"""
|
||||
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
|
@ -437,61 +437,6 @@ class CodeLlamaTokenizer(PreTrainedTokenizer):
|
||||
|
||||
return output
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages.
|
||||
Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
|
||||
user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
|
||||
rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
|
||||
results in an unusual token ordering when it is present. This template should definitely be changed if you wish
|
||||
to fine-tune a model with more flexible role ordering!
|
||||
|
||||
The output should look something like:
|
||||
|
||||
<bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos>
|
||||
<bos>[INST] Prompt [/INST]
|
||||
|
||||
The reference for this chat template is [this code
|
||||
snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
|
||||
in the original repository.
|
||||
"""
|
||||
template = (
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
|
||||
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
"{% endif %}"
|
||||
"{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message
|
||||
"{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
|
||||
"{% else %}"
|
||||
"{% set content = message['content'] %}"
|
||||
"{% endif %}"
|
||||
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
|
||||
"{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
|
||||
"{% elif message['role'] == 'system' %}"
|
||||
"{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ ' ' + content.strip() + ' ' + eos_token }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
)
|
||||
template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
|
||||
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
|
||||
template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
|
||||
|
||||
return template
|
||||
|
||||
def __getstate__(self):
|
||||
state = self.__dict__.copy()
|
||||
state["sp_model"] = None
|
||||
|
@ -349,61 +349,6 @@ class CodeLlamaTokenizerFast(PreTrainedTokenizerFast):
|
||||
|
||||
return (out_vocab_file,)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages.
|
||||
Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
|
||||
user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
|
||||
rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
|
||||
results in an unusual token ordering when it is present. This template should definitely be changed if you wish
|
||||
to fine-tune a model with more flexible role ordering!
|
||||
|
||||
The output should look something like:
|
||||
|
||||
<bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos>
|
||||
<bos>[INST] Prompt [/INST]
|
||||
|
||||
The reference for this chat template is [this code
|
||||
snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
|
||||
in the original repository.
|
||||
"""
|
||||
template = (
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
|
||||
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
"{% endif %}"
|
||||
"{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message
|
||||
"{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
|
||||
"{% else %}"
|
||||
"{% set content = message['content'] %}"
|
||||
"{% endif %}"
|
||||
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
|
||||
"{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
|
||||
"{% elif message['role'] == 'system' %}"
|
||||
"{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ ' ' + content.strip() + ' ' + eos_token }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
)
|
||||
template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
|
||||
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
|
||||
template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
|
||||
|
||||
return template
|
||||
|
||||
def build_inputs_with_special_tokens(
|
||||
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
|
||||
) -> List[int]:
|
||||
|
@ -228,188 +228,6 @@ class CohereTokenizerFast(PreTrainedTokenizerFast):
|
||||
self._add_bos_token = value
|
||||
self.update_post_processor()
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
|
||||
Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
|
||||
for user, assitant and system messages respectively.
|
||||
|
||||
The output should look something like:
|
||||
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>
|
||||
|
||||
Use add_generation_prompt to add a prompt for the model to generate a response:
|
||||
>>> from transformers import AutoTokenizer
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
|
||||
>>> messages = [{"role": "user", "content": "Hello, how are you?"}]
|
||||
>>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||
'<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'
|
||||
|
||||
"""
|
||||
default_template = (
|
||||
"{{ bos_token }}"
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% elif USE_DEFAULT_PROMPT == true %}"
|
||||
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% if system_message != false %}" # Start with system message
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
"{% endif %}"
|
||||
"{% set content = message['content'] %}"
|
||||
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{% if add_generation_prompt %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
)
|
||||
default_template = default_template.replace(
|
||||
"USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false"
|
||||
)
|
||||
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
|
||||
default_template = default_template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
|
||||
|
||||
tool_use_template = (
|
||||
"{{ bos_token }}"
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% endif %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
|
||||
"{{ '# Safety Preamble' }}"
|
||||
"{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
|
||||
"{{ '\n\n# System Preamble' }}"
|
||||
"{{ '\n## Basic Rules' }}"
|
||||
"{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
|
||||
"{{ '\n\n# User Preamble' }}"
|
||||
"{{ '\n' + system_message }}"
|
||||
"{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
|
||||
"{% for tool in tools %}"
|
||||
"{% if loop.index0 != 0 %}"
|
||||
"{{ '\n\n'}}"
|
||||
"{% endif %}"
|
||||
"{{'```python\ndef ' + tool.name + '('}}"
|
||||
"{% for param_name, param_fields in tool.parameter_definitions.items() %}"
|
||||
"{% if loop.index0 != 0 %}"
|
||||
"{{ ', '}}"
|
||||
"{% endif %}"
|
||||
"{{param_name}}: "
|
||||
"{% if not param_fields.required %}"
|
||||
"{{'Optional[' + param_fields.type + '] = None'}}"
|
||||
"{% else %}"
|
||||
"{{ param_fields.type }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
'{{ \') -> List[Dict]:\n """\'}}'
|
||||
"{{ tool.description }}"
|
||||
"{% if tool.parameter_definitions|length != 0 %}"
|
||||
"{{ '\n\n Args:\n '}}"
|
||||
"{% for param_name, param_fields in tool.parameter_definitions.items() %}"
|
||||
"{% if loop.index0 != 0 %}"
|
||||
"{{ '\n ' }}"
|
||||
"{% endif %}"
|
||||
"{{ param_name + ' ('}}"
|
||||
"{% if not param_fields.required %}"
|
||||
"{{'Optional[' + param_fields.type + ']'}}"
|
||||
"{% else %}"
|
||||
"{{ param_fields.type }}"
|
||||
"{% endif %}"
|
||||
"{{ '): ' + param_fields.description }}"
|
||||
"{% endfor %}"
|
||||
"{% endif %}"
|
||||
'{{ \'\n """\n pass\n```\' }}'
|
||||
"{% endfor %}"
|
||||
"{{ '<|END_OF_TURN_TOKEN|>'}}"
|
||||
"{% for message in loop_messages %}"
|
||||
"{% set content = message['content'] %}"
|
||||
"{% if message['role'] == 'user' %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% elif message['role'] == 'system' %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n {\n \"tool_name\": title of the tool in the specification,\n \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n }\n]```<|END_OF_TURN_TOKEN|>'}}"
|
||||
"{% if add_generation_prompt %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
)
|
||||
default_tool_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
|
||||
tool_use_template = tool_use_template.replace("DEFAULT_SYSTEM_MESSAGE", default_tool_message)
|
||||
|
||||
rag_template = (
|
||||
"{{ bos_token }}"
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% endif %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
|
||||
"{{ '# Safety Preamble' }}"
|
||||
"{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
|
||||
"{{ '\n\n# System Preamble' }}"
|
||||
"{{ '\n## Basic Rules' }}"
|
||||
"{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
|
||||
"{{ '\n\n# User Preamble' }}"
|
||||
"{{ '\n' + system_message }}"
|
||||
"{{ '<|END_OF_TURN_TOKEN|>'}}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% set content = message['content'] %}"
|
||||
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% elif message['role'] == 'system' %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
|
||||
"{{ '<results>' }}"
|
||||
"{% for document in documents %}" # Loop over all non-system messages
|
||||
"{{ '\nDocument: ' }}"
|
||||
"{{ loop.index0 }}\n"
|
||||
"{% for key, value in document.items() %}"
|
||||
"{{ key }}: {{value}}\n"
|
||||
"{% endfor %}"
|
||||
"{% endfor %}"
|
||||
"{{ '</results>'}}"
|
||||
"{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
|
||||
"{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
|
||||
"{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
|
||||
"{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
|
||||
"{% if citation_mode=='accurate' %}"
|
||||
"{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
|
||||
"{% endif %}"
|
||||
"{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
|
||||
"{{ '<|END_OF_TURN_TOKEN|>' }}"
|
||||
"{% if add_generation_prompt %}"
|
||||
"{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
|
||||
"{% endif %}"
|
||||
)
|
||||
default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
|
||||
rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)
|
||||
|
||||
return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}
|
||||
|
||||
def apply_tool_use_template(
|
||||
self,
|
||||
conversation: Union[List[Dict[str, str]]],
|
||||
|
@ -236,19 +236,6 @@ class GPTSanJapaneseTokenizer(PreTrainedTokenizer):
|
||||
text = "".join(words)
|
||||
return text
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that adds standard BOS, SEP and EOS tokens between messages while discarding role
|
||||
information.
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{% if not loop.first %}{{ bos_token}}{% endif %}"
|
||||
"{{ sep_token }}{{ message.content }} {{ eos_token }}"
|
||||
"{% endfor %}"
|
||||
)
|
||||
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
index = 0
|
||||
if os.path.isdir(save_directory):
|
||||
|
@ -329,10 +329,3 @@ class GPT2Tokenizer(PreTrainedTokenizer):
|
||||
if is_split_into_words or add_prefix_space:
|
||||
text = " " + text
|
||||
return (text, kwargs)
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
|
||||
"""
|
||||
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
|
@ -139,12 +139,3 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast):
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
files = self._tokenizer.model.save(save_directory, name=filename_prefix)
|
||||
return tuple(files)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
|
||||
"""
|
||||
|
||||
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
|
@ -228,11 +228,3 @@ class GPTNeoXTokenizerFast(PreTrainedTokenizerFast):
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
files = self._tokenizer.model.save(save_directory, name=filename_prefix)
|
||||
return tuple(files)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
|
||||
"""
|
||||
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
|
@ -161,18 +161,6 @@ class GPTNeoXJapaneseTokenizer(PreTrainedTokenizer):
|
||||
out_string = "".join(tokens).strip()
|
||||
return out_string
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that just adds BOS/EOS tokens around messages while discarding role information.
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{{ bos_token + eos_token + message.content + eos_token }}"
|
||||
"{% endfor %}"
|
||||
"{% if add_generation_prompt %} {{ bos_token + eos_token }} {% endif %}"
|
||||
)
|
||||
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
index = 0
|
||||
if os.path.isdir(save_directory):
|
||||
|
@ -294,19 +294,3 @@ class GPTSw3Tokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
|
||||
return self.sp_model.decode(token_ids)
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
This chat template formats messages like an instant messenger chat log, with "User:" and "Bot:" strings
|
||||
preceding messages. BOS tokens are added between all messages.
|
||||
"""
|
||||
return (
|
||||
"{{ eos_token }}{{ bos_token }}"
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'user' %}{{ 'User: ' + message['content']}}"
|
||||
"{% else %}{{ 'Bot: ' + message['content']}}{% endif %}"
|
||||
"{{ message['text'] }}{{ bos_token }}"
|
||||
"{% endfor %}"
|
||||
"Bot:"
|
||||
)
|
||||
|
@ -251,60 +251,3 @@ class Idefics2Processor(ProcessorMixin):
|
||||
tokenizer_input_names = self.tokenizer.model_input_names
|
||||
image_processor_input_names = self.image_processor.model_input_names
|
||||
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
This template formats inputs in the form of a chat history. For each message in the chat history:
|
||||
* the template will output the role of the speaker followed by the content of the message.
|
||||
* content can be a single string or a list of strings and images.
|
||||
* If the content element is an image, the template will output a sequence of <image> tokens and <fake_token_around_image> token before and after each image
|
||||
* The template will output an <end_of_utterance> token at the end of each message.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What’s in this image?"},
|
||||
{"type": "image"},
|
||||
{"type": "image"},
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": "This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground."},]
|
||||
}]
|
||||
```
|
||||
|
||||
Will create outputs like:
|
||||
```
|
||||
User: What is in this Image?<image><image><end_of_utterance>
|
||||
Assistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>
|
||||
```
|
||||
"""
|
||||
# fmt: off
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{{message['role'].capitalize()}}"
|
||||
"{% if message['content'][0]['type'] == 'image' %}"
|
||||
"{{':'}}"
|
||||
"{% else %}"
|
||||
"{{': '}}"
|
||||
"{% endif %}"
|
||||
"{% for line in message['content'] %}"
|
||||
"{% if line['type'] == 'text' %}"
|
||||
"{{line['text']}}"
|
||||
"{% elif line['type'] == 'image' %}"
|
||||
"{{ '<image>' }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
"<end_of_utterance>\n"
|
||||
"{% endfor %}"
|
||||
|
||||
"{% if add_generation_prompt %}"
|
||||
"{{ 'Assistant:' }}"
|
||||
"{% endif %}"
|
||||
)
|
||||
# fmt: on
|
||||
|
@ -411,57 +411,3 @@ class LlamaTokenizer(PreTrainedTokenizer):
|
||||
output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
|
||||
|
||||
return output
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages.
|
||||
Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
|
||||
user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
|
||||
rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
|
||||
results in an unusual token ordering when it is present. This template should definitely be changed if you wish
|
||||
to fine-tune a model with more flexible role ordering!
|
||||
|
||||
The output should look something like:
|
||||
|
||||
<bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos>
|
||||
<bos>[INST] Prompt [/INST]
|
||||
|
||||
The reference for this chat template is [this code
|
||||
snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
|
||||
in the original repository.
|
||||
"""
|
||||
template = (
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
|
||||
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
"{% endif %}"
|
||||
"{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message
|
||||
"{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
|
||||
"{% else %}"
|
||||
"{% set content = message['content'] %}"
|
||||
"{% endif %}"
|
||||
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
|
||||
"{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
|
||||
"{% elif message['role'] == 'system' %}"
|
||||
"{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ ' ' + content.strip() + ' ' + eos_token }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
)
|
||||
template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
|
||||
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
|
||||
template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
|
||||
|
||||
return template
|
||||
|
@ -241,61 +241,6 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
|
||||
|
||||
return (out_vocab_file,)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
LLaMA uses [INST] and [/INST] to indicate user messages, and <<SYS>> and <</SYS>> to indicate system messages.
|
||||
Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
|
||||
user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
|
||||
rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
|
||||
results in an unusual token ordering when it is present. This template should definitely be changed if you wish
|
||||
to fine-tune a model with more flexible role ordering!
|
||||
|
||||
The output should look something like:
|
||||
|
||||
<bos>[INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer <eos><bos>[INST] Prompt [/INST] Answer <eos>
|
||||
<bos>[INST] Prompt [/INST]
|
||||
|
||||
The reference for this chat template is [this code
|
||||
snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
|
||||
in the original repository.
|
||||
"""
|
||||
template = (
|
||||
"{% if messages[0]['role'] == 'system' %}"
|
||||
"{% set loop_messages = messages[1:] %}" # Extract system message if it's present
|
||||
"{% set system_message = messages[0]['content'] %}"
|
||||
"{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}"
|
||||
"{% set loop_messages = messages %}" # Or use the default system message if the flag is set
|
||||
"{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
|
||||
"{% else %}"
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
"{% endif %}"
|
||||
"{% if loop.index0 == 0 and system_message != false %}" # Embed system message in first message
|
||||
"{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}"
|
||||
"{% else %}"
|
||||
"{% set content = message['content'] %}"
|
||||
"{% endif %}"
|
||||
"{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
|
||||
"{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
|
||||
"{% elif message['role'] == 'system' %}"
|
||||
"{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}"
|
||||
"{% elif message['role'] == 'assistant' %}"
|
||||
"{{ ' ' + content.strip() + ' ' + eos_token }}"
|
||||
"{% endif %}"
|
||||
"{% endfor %}"
|
||||
)
|
||||
template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
|
||||
default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
|
||||
template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
|
||||
|
||||
return template
|
||||
|
||||
# TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers
|
||||
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||
|
@ -159,63 +159,3 @@ class LlavaNextVideoProcessor(ProcessorMixin):
|
||||
tokenizer_input_names = self.tokenizer.model_input_names
|
||||
image_processor_input_names = self.image_processor.model_input_names
|
||||
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
This default vicuna template formats inputs in the form of a chat history. For each message in the chat history:
|
||||
* the template will output the role of the speaker followed by the content of the message.
|
||||
* content is a list of strings and images.
|
||||
* If the content element is an image, the template will output a sequence of <image> or <video> tokens
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What’s the content of this video?"},
|
||||
{"type": "video"},
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": "This picture shows a red stop sign."},]
|
||||
}]
|
||||
```
|
||||
|
||||
Will create outputs like:
|
||||
```
|
||||
USER: <video>\nWhat is the content of this video?
|
||||
ASSITANT: This picture shows a red stop sign
|
||||
```
|
||||
"""
|
||||
# fmt: off
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'system' %}"
|
||||
"{{ message['content'][0]['text'] }}"
|
||||
"{% else %}"
|
||||
"{{ message['role'].upper() + ': '}}"
|
||||
"{% endif %}"
|
||||
|
||||
"{# Render all images first #}"
|
||||
"{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}"
|
||||
"{{ '<image>\n' }}"
|
||||
"{% endfor %}"
|
||||
|
||||
"{# Render all videos next #}"
|
||||
"{% for content in message['content'] | selectattr('type', 'equalto', 'video') %}"
|
||||
"{{ '<video>\n' }}"
|
||||
"{% endfor %}"
|
||||
|
||||
"{# Render all text finally #}"
|
||||
"{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}"
|
||||
"{{ content['text'] + ' '}}"
|
||||
"{% endfor %}"
|
||||
"{% endfor %}"
|
||||
"{% if add_generation_prompt %}"
|
||||
"{{ 'ASSISTANT:' }}"
|
||||
"{% endif %}"
|
||||
)
|
||||
# fmt: on
|
||||
|
@ -810,14 +810,6 @@ class WhisperTokenizer(PreTrainedTokenizer):
|
||||
text = " " + text
|
||||
return (text, kwargs)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
|
||||
"""
|
||||
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
|
||||
def get_decoder_prompt_ids(self, task=None, language=None, no_timestamps=True):
|
||||
self.set_prefix_tokens(task=task, language=language, predict_timestamps=not no_timestamps)
|
||||
# prefix tokens are of the form: <|startoftranscript|> <|lang_id|> <|task|> <|notimestamps|>
|
||||
|
@ -539,14 +539,6 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
|
||||
return prefix_ones + ([0] * len(token_ids_0)) + suffix_ones
|
||||
return prefix_ones + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
A simple chat template that ignores role information and just concatenates messages with EOS tokens.
|
||||
"""
|
||||
return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
|
||||
# Copied from transformers.models.whisper.tokenization_whisper.WhisperTokenizer.get_decoder_prompt_ids
|
||||
def get_decoder_prompt_ids(self, task=None, language=None, no_timestamps=True):
|
||||
self.set_prefix_tokens(task=task, language=language, predict_timestamps=not no_timestamps)
|
||||
|
@ -971,8 +971,8 @@ class ProcessorMixin(PushToHubMixin):
|
||||
conversation (`List[Dict, str, str]`):
|
||||
The conversation to format.
|
||||
chat_template (`Optional[str]`, *optional*):
|
||||
The Jinja template to use for formatting the conversation. If not provided, the default chat template
|
||||
is used.
|
||||
The Jinja template to use for formatting the conversation. If not provided, the tokenizer's
|
||||
chat template is used.
|
||||
tokenize (`bool`, *optional*, defaults to `False`):
|
||||
Whether to tokenize the output or not.
|
||||
**kwargs:
|
||||
@ -982,15 +982,6 @@ class ProcessorMixin(PushToHubMixin):
|
||||
if chat_template is None:
|
||||
if self.chat_template is not None:
|
||||
chat_template = self.chat_template
|
||||
elif getattr(self, "default_chat_template", None) is not None:
|
||||
logger.warning_once(
|
||||
"No chat template is set for this processor, falling back to a default class-level template. This is "
|
||||
"very error-prone, because models are often trained with templates different from the class default! "
|
||||
"Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which "
|
||||
"point any code depending on them will stop working. We recommend setting a valid chat template before "
|
||||
"then to ensure that this model continues working without issues."
|
||||
)
|
||||
chat_template = self.default_chat_template
|
||||
else:
|
||||
raise ValueError(
|
||||
"No chat template is set for this processor. Please either set the `chat_template` attribute, "
|
||||
|
@ -1704,8 +1704,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
|
||||
"""
|
||||
Converts a list of dictionaries with `"role"` and `"content"` keys to a list of token
|
||||
ids. This method is intended for use with chat models, and will read the tokenizer's chat_template attribute to
|
||||
determine the format and control tokens to use when converting. When chat_template is None, it will fall back
|
||||
to the default_chat_template specified at the class level.
|
||||
determine the format and control tokens to use when converting.
|
||||
|
||||
Args:
|
||||
conversation (Union[List[Dict[str, str]], List[List[Dict[str, str]]]]): A list of dicts
|
||||
@ -1986,22 +1985,12 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
|
||||
Returns:
|
||||
`str`: The chat template string.
|
||||
"""
|
||||
using_default_template = False
|
||||
# First, handle the cases when the model has a dict of multiple templates
|
||||
if isinstance(self.chat_template, dict) or (
|
||||
self.chat_template is None and isinstance(self.default_chat_template, dict)
|
||||
):
|
||||
if self.chat_template is not None:
|
||||
template_dict = self.chat_template
|
||||
using_default_dict = False
|
||||
else:
|
||||
template_dict = self.default_chat_template
|
||||
using_default_dict = True
|
||||
if isinstance(self.chat_template, dict):
|
||||
template_dict = self.chat_template
|
||||
if chat_template is not None and chat_template in template_dict:
|
||||
# The user can pass the name of a template to the chat template argument instead of an entire template
|
||||
chat_template = template_dict[chat_template]
|
||||
if using_default_dict:
|
||||
using_default_template = True
|
||||
elif chat_template is None:
|
||||
if tools is not None and "tool_use" in template_dict:
|
||||
chat_template = template_dict["tool_use"]
|
||||
@ -2013,44 +2002,23 @@ class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
|
||||
"template or the name of the template you wish to use to the `chat_template` argument. Available "
|
||||
f"template names are {sorted(template_dict.keys())}."
|
||||
)
|
||||
if using_default_dict:
|
||||
using_default_template = True
|
||||
|
||||
elif chat_template is None:
|
||||
# These are the cases when the model has a single template
|
||||
# priority: `chat_template` argument > `tokenizer.chat_template` > `tokenizer.default_chat_template
|
||||
# priority: `chat_template` argument > `tokenizer.chat_template`
|
||||
if self.chat_template is not None:
|
||||
chat_template = self.chat_template
|
||||
else:
|
||||
chat_template = self.default_chat_template
|
||||
using_default_template = True
|
||||
|
||||
if using_default_template:
|
||||
logger.warning_once(
|
||||
"No chat template is set for this tokenizer, falling back to a default class-level template. This is "
|
||||
"very error-prone, because models are often trained with templates different from the class default! "
|
||||
"Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which "
|
||||
"point any code depending on them will stop working. We recommend setting a valid chat template before "
|
||||
"then to ensure that this model continues working without issues."
|
||||
)
|
||||
else:
|
||||
raise ValueError(
|
||||
"Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template "
|
||||
"argument was passed! For information about writing templates and setting the "
|
||||
"tokenizer.chat_template attribute, please see the documentation at "
|
||||
"https://huggingface.co/docs/transformers/main/en/chat_templating"
|
||||
)
|
||||
|
||||
return chat_template
|
||||
|
||||
@property
|
||||
def default_chat_template(self):
|
||||
"""
|
||||
This template formats inputs in the standard ChatML format. See
|
||||
https://github.com/openai/openai-python/blob/main/chatml.md
|
||||
"""
|
||||
return (
|
||||
"{% for message in messages %}"
|
||||
"{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}"
|
||||
"{% endfor %}"
|
||||
"{% if add_generation_prompt %}"
|
||||
"{{ '<|im_start|>assistant\n' }}"
|
||||
"{% endif %}"
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(
|
||||
cls,
|
||||
|
@ -135,6 +135,7 @@ class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
@require_jinja
|
||||
def test_tokenization_for_chat(self):
|
||||
tokenizer = self.get_rust_tokenizer()
|
||||
tokenizer.chat_template = "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}"
|
||||
test_chats = [
|
||||
[{"role": "system", "content": "You are a helpful chatbot."}, {"role": "user", "content": "Hello!"}],
|
||||
[
|
||||
|
@ -280,6 +280,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
@require_jinja
|
||||
def test_tokenization_for_chat(self):
|
||||
tokenizer = GPT2Tokenizer.from_pretrained(self.tmpdirname)
|
||||
tokenizer.chat_template = "{% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}"
|
||||
test_chats = [
|
||||
[{"role": "system", "content": "You are a helpful chatbot."}, {"role": "user", "content": "Hello!"}],
|
||||
[
|
||||
|
@ -131,6 +131,15 @@ class GPTSw3TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
@require_jinja
|
||||
def test_tokenization_for_chat(self):
|
||||
tokenizer = GPTSw3Tokenizer(SAMPLE_VOCAB)
|
||||
tokenizer.chat_template = (
|
||||
"{{ eos_token }}{{ bos_token }}"
|
||||
"{% for message in messages %}"
|
||||
"{% if message['role'] == 'user' %}{{ 'User: ' + message['content']}}"
|
||||
"{% else %}{{ 'Bot: ' + message['content']}}{% endif %}"
|
||||
"{{ message['text'] }}{{ bos_token }}"
|
||||
"{% endfor %}"
|
||||
"Bot:"
|
||||
)
|
||||
# This is in English, but it's just here to make sure the chat control tokens are being added properly
|
||||
test_chats = [
|
||||
[{"role": "system", "content": "You are a helpful chatbot."}, {"role": "user", "content": "Hello!"}],
|
||||
|
Loading…
Reference in New Issue
Block a user