mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 22:30:09 +06:00

* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
60 lines
3.6 KiB
ReStructuredText
60 lines
3.6 KiB
ReStructuredText
Tokenizer
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most
|
|
of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the
|
|
Rust library `tokenizers <https://github.com/huggingface/tokenizers>`__. The "Fast" implementations allows:
|
|
|
|
1. a significant speed-up in particular when doing batched tokenization and
|
|
2. additional methods to map between the original string (character and words) and the token space (e.g. getting the
|
|
index of the token comprising a given character or the span of characters corresponding to a given token). Currently
|
|
no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLMRoBERTa
|
|
and XLNet models).
|
|
|
|
The base classes :class:`~transformers.PreTrainedTokenizer` and :class:`~transformers.PreTrainedTokenizerFast`
|
|
implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and
|
|
"Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library
|
|
(downloaded from HuggingFace's AWS S3 repository). They both rely on
|
|
:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase` that contains the common methods, and
|
|
:class:`~transformers.tokenization_utils_base.SpecialTokensMixin`.
|
|
|
|
:class:`~transformers.PreTrainedTokenizer` and :class:`~transformers.PreTrainedTokenizerFast` thus implement the main
|
|
methods for using all the tokenizers:
|
|
|
|
- Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and
|
|
encoding/decoding (i.e., tokenizing and converting to integers).
|
|
- Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece...).
|
|
- Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the
|
|
tokenizer for easy access and making sure they are not split during tokenization.
|
|
|
|
:class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
|
|
``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
|
|
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these
|
|
methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace
|
|
`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced
|
|
alignment methods which can be used to map between the original string (character and words) and the token space (e.g.,
|
|
getting the index of the token comprising a given character or the span of characters corresponding to a given token).
|
|
|
|
|
|
PreTrainedTokenizer
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PreTrainedTokenizer
|
|
:special-members: __call__
|
|
:members:
|
|
|
|
|
|
PreTrainedTokenizerFast
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PreTrainedTokenizerFast
|
|
:special-members: __call__
|
|
:members:
|
|
|
|
|
|
BatchEncoding
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.BatchEncoding
|
|
:members:
|