[lamaTokenizerFast] Update documentation (#24132)

* Update documentation * nits
2025-07-31 02:02:21 +06:00 · 2023-06-09 16:30:20 +02:00 · 2023-06-09 16:30:20 +02:00 · 5af3a1aa48
commit 5af3a1aa48
parent 62fe753325
2 changed files with 10 additions and 0 deletions
--- a/docs/source/en/model_doc/llama.mdx
+++ b/docs/source/en/model_doc/llama.mdx
@ -65,6 +65,7 @@ This model was contributed by [zphang](https://huggingface.co/zphang) with contr
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
+    - update_post_processor
    - save_vocabulary

 ## LlamaModel
--- a/src/transformers/models/llama/tokenization_llama_fast.py
+++ b/src/transformers/models/llama/tokenization_llama_fast.py
@ -48,6 +48,12 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
    >>> [1, 15043, 445, 338, 263, 1243]
    ```

+    If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
+    call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
+    values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
+    [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
+
+
    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods.

@ -108,6 +114,9 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
        self.can_save_slow_tokenizer = False if not self.vocab_file else True

    def update_post_processor(self):
+        """
+        Updates the underlying post processor with the current `bos_token` and `eos_token`.
+        """
        bos = self.bos_token
        bos_token_id = self.bos_token_id