
* Updated documentation and added conversion utility * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tiktoken.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Moved util function to integration folder + allow for str * Update formatting Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Updated formatting * style changes --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2.6 KiB
Tiktoken and interaction with Transformers
Support for tiktoken model files is seamlessly integrated in 🤗 transformers when loading models
from_pretrained
with a tokenizer.model
tiktoken file on the Hub, which is automatically converted into our
fast tokenizer.
Known models that were released with a tiktoken.model
:
- gpt2
- llama3
Example usage
In order to load tiktoken
files in transformers
, ensure that the tokenizer.model
file is a tiktoken file and it
will automatically be loaded when loading from_pretrained
. Here is how one would load a tokenizer and a model, which
can be loaded from the exact same file:
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")
Create tiktoken tokenizer
The tokenizer.model
file contains no information about additional tokens or pattern strings. If these are important, convert the tokenizer to tokenizer.json
, the appropriate format for [PreTrainedTokenizerFast
].
Generate the tokenizer.model
file with tiktoken.get_encoding and then convert it to tokenizer.json
with [convert_tiktoken_to_fast
].
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding
# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")
The resulting tokenizer.json
file is saved to the specified directory and can be loaded with [PreTrainedTokenizerFast
].
tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")