transformers/docs/source/en/tiktoken.md
Viktor Scherbakov 95c10fedb3
Updated documentation and added conversion utility (#34319)
* Updated documentation and added conversion utility

* Update docs/source/en/tiktoken.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/tiktoken.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Moved util function to integration folder + allow for str

* Update formatting

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Updated formatting

* style changes

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-11-25 18:44:09 +01:00

2.6 KiB

Tiktoken and interaction with Transformers

Support for tiktoken model files is seamlessly integrated in 🤗 transformers when loading models from_pretrained with a tokenizer.model tiktoken file on the Hub, which is automatically converted into our fast tokenizer.

Known models that were released with a tiktoken.model:

- gpt2
- llama3

Example usage

In order to load tiktoken files in transformers, ensure that the tokenizer.model file is a tiktoken file and it will automatically be loaded when loading from_pretrained. Here is how one would load a tokenizer and a model, which can be loaded from the exact same file:

from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original") 

Create tiktoken tokenizer

The tokenizer.model file contains no information about additional tokens or pattern strings. If these are important, convert the tokenizer to tokenizer.json, the appropriate format for [PreTrainedTokenizerFast].

Generate the tokenizer.model file with tiktoken.get_encoding and then convert it to tokenizer.json with [convert_tiktoken_to_fast].


from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding

# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")

The resulting tokenizer.json file is saved to the specified directory and can be loaded with [PreTrainedTokenizerFast].

tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")