[lamaTokenizerFast] Update documentation (#24132)

* Update documentation

* nits
This commit is contained in:
Arthur 2023-06-09 16:30:20 +02:00 committed by GitHub
parent 62fe753325
commit 5af3a1aa48
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 10 additions and 0 deletions

View File

@ -65,6 +65,7 @@ This model was contributed by [zphang](https://huggingface.co/zphang) with contr
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- update_post_processor
- save_vocabulary
## LlamaModel

View File

@ -48,6 +48,12 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
>>> [1, 15043, 445, 338, 263, 1243]
```
If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
[post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
refer to this superclass for more information regarding those methods.
@ -108,6 +114,9 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
self.can_save_slow_tokenizer = False if not self.vocab_file else True
def update_post_processor(self):
"""
Updates the underlying post processor with the current `bos_token` and `eos_token`.
"""
bos = self.bos_token
bos_token_id = self.bos_token_id