From e786844425b6b1112c76513d66217ce2fe6aea41 Mon Sep 17 00:00:00 2001 From: Matt Date: Fri, 5 Jul 2024 15:30:24 +0100 Subject: [PATCH] Repeating an important warning in the chat template docs (#31796) * Repeating an important warning in the chat template docs * Update docs/source/en/chat_templating.md Co-authored-by: Lysandre Debut * Reword for clarity * Reword for clarity --------- Co-authored-by: Lysandre Debut --- docs/source/en/chat_templating.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md index df1843743e2..d840caaf660 100644 --- a/docs/source/en/chat_templating.md +++ b/docs/source/en/chat_templating.md @@ -199,7 +199,8 @@ effect that `add_generation_prompt` has will depend on the template being used. ## Can I use chat templates in training? -Yes! We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you +Yes! This is a good way to ensure that the chat template matches the tokens the model sees during training. +We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you can simply continue like any other language model training task. When training, you should usually set `add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during training. Let's see an example: @@ -233,6 +234,16 @@ The sun. From here, just continue training like you would with a standard language modelling task, using the `formatted_chat` column. + +If you format text with `apply_chat_template(tokenize=False)` and then tokenize it in a separate step, you should set the argument +`add_special_tokens=False`. If you use `apply_chat_template(tokenize=True)`, you don't need to worry about this! + +By default, some tokenizers add special tokens like `` and `` to text they tokenize. Chat templates should +always include all of the special tokens they need, and so adding extra special tokens with +the default `add_special_tokens=True` can result in incorrect or duplicated special tokens, which will hurt model +performance. + + ## Advanced: Extra inputs to chat templates The only argument that `apply_chat_template` requires is `messages`. However, you can pass any keyword