mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 05:10:06 +06:00
307 lines
19 KiB
ReStructuredText
307 lines
19 KiB
ReStructuredText
..
|
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
Summary of the tokenizers
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
On this page, we will have a closer look at tokenization.
|
|
|
|
.. raw:: html
|
|
|
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
|
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
|
picture-in-picture" allowfullscreen></iframe>
|
|
|
|
As we saw in :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or
|
|
subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
|
|
straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
|
|
More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding
|
|
(BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`, and :ref:`SentencePiece <sentencepiece>`, and show examples
|
|
of which tokenizer type is used by which model.
|
|
|
|
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
|
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
|
that the model uses :ref:`WordPiece <wordpiece>`.
|
|
|
|
Introduction
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
|
|
For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."``
|
|
|
|
.. raw:: html
|
|
|
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/nhJxYji1aho" title="YouTube video player"
|
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
|
picture-in-picture" allowfullscreen></iframe>
|
|
|
|
A simple way of tokenizing this text is to split it by spaces, which would give:
|
|
|
|
.. code-block::
|
|
|
|
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
|
|
|
|
This is a sensible first step, but if we look at the tokens ``"Transformers?"`` and ``"do."``, we notice that the
|
|
punctuation is attached to the words ``"Transformer"`` and ``"do"``, which is suboptimal. We should take the
|
|
punctuation into account so that a model does not have to learn a different representation of a word and every possible
|
|
punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
|
|
Taking punctuation into account, tokenizing our exemplary text would give:
|
|
|
|
.. code-block::
|
|
|
|
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
|
|
|
|
Better. However, it is disadvantageous, how the tokenization dealt with the word ``"Don't"``. ``"Don't"`` stands for
|
|
``"do not"``, so it would be better tokenized as ``["Do", "n't"]``. This is where things start getting complicated, and
|
|
part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
|
|
different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
|
|
input that was tokenized with the same rules that were used to tokenize its training data.
|
|
|
|
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
|
|
rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:
|
|
|
|
.. code-block::
|
|
|
|
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
|
|
|
|
As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
|
|
punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
|
|
as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
|
|
tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
|
|
usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, :doc:`Transformer XL
|
|
<model_doc/transformerxl>` uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
|
|
|
|
Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
|
|
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
|
|
greater than 50,000, especially if they are pretrained only on a single language.
|
|
|
|
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
|
|
|
|
.. raw:: html
|
|
|
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/ssLq_EK2jLE" title="YouTube video player"
|
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
|
picture-in-picture" allowfullscreen></iframe>
|
|
|
|
While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
|
|
for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
|
|
representation for the letter ``"t"`` is much harder than learning a context-independent representation for the word
|
|
``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
|
|
both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
|
|
tokenization.
|
|
|
|
Subword tokenization
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. raw:: html
|
|
|
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/zHvTiHr506c" title="YouTube video player"
|
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
|
picture-in-picture" allowfullscreen></iframe>
|
|
|
|
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
|
|
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
|
|
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
|
|
stand-alone subwords would appear more frequently while at the same time the meaning of ``"annoyingly"`` is kept by the
|
|
composite meaning of ``"annoying"`` and ``"ly"``. This is especially useful in agglutinative languages such as Turkish,
|
|
where you can form (almost) arbitrarily long complex words by stringing together subwords.
|
|
|
|
Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
|
|
context-independent representations. In addition, subword tokenization enables the model to process words it has never
|
|
seen before, by decomposing them into known subwords. For instance, the :class:`~transformers.BertTokenizer` tokenizes
|
|
``"I have a new GPU!"`` as follows:
|
|
|
|
.. code-block::
|
|
|
|
>>> from transformers import BertTokenizer
|
|
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
|
>>> tokenizer.tokenize("I have a new GPU!")
|
|
["i", "have", "a", "new", "gp", "##u", "!"]
|
|
|
|
Because we are considering the uncased model, the sentence was lowercased first. We can see that the words ``["i",
|
|
"have", "a", "new"]`` are present in the tokenizer's vocabulary, but the word ``"gpu"`` is not. Consequently, the
|
|
tokenizer splits ``"gpu"`` into known subwords: ``["gp" and "##u"]``. ``"##"`` means that the rest of the token should
|
|
be attached to the previous one, without space (for decoding or reversal of the tokenization).
|
|
|
|
As another example, :class:`~transformers.XLNetTokenizer` tokenizes our previously exemplary text as follows:
|
|
|
|
.. code-block::
|
|
|
|
>>> from transformers import XLNetTokenizer
|
|
>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
|
|
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
|
|
["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
|
|
|
|
We'll get back to the meaning of those ``"▁"`` when we look at :ref:`SentencePiece <sentencepiece>`. As one can see,
|
|
the rare word ``"Transformers"`` has been split into the more frequent subwords ``"Transform"`` and ``"ers"``.
|
|
|
|
Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
|
|
algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
|
|
on.
|
|
|
|
.. _byte-pair-encoding:
|
|
|
|
Byte-Pair Encoding (BPE)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Byte-Pair Encoding (BPE) was introduced in `Neural Machine Translation of Rare Words with Subword Units (Sennrich et
|
|
al., 2015) <https://arxiv.org/abs/1508.07909>`__. BPE relies on a pre-tokenizer that splits the training data into
|
|
words. Pretokenization can be as simple as space tokenization, e.g. :doc:`GPT-2 <model_doc/gpt2>`, :doc:`Roberta
|
|
<model_doc/roberta>`. More advanced pre-tokenization include rule-based tokenization, e.g. :doc:`XLM <model_doc/xlm>`,
|
|
:doc:`FlauBERT <model_doc/flaubert>` which uses Moses for most languages, or :doc:`GPT <model_doc/gpt>` which uses
|
|
Spacy and ftfy, to count the frequency of each word in the training corpus.
|
|
|
|
After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
|
|
training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
|
|
of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
|
|
the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
|
|
define before training the tokenizer.
|
|
|
|
As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
|
|
determined:
|
|
|
|
.. code-block::
|
|
|
|
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
|
|
|
|
Consequently, the base vocabulary is ``["b", "g", "h", "n", "p", "s", "u"]``. Splitting all words into symbols of the
|
|
base vocabulary, we obtain:
|
|
|
|
.. code-block::
|
|
|
|
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
|
|
|
|
BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
|
|
the example above ``"h"`` followed by ``"u"`` is present `10 + 5 = 15` times (10 times in the 10 occurrences of
|
|
``"hug"``, 5 times in the 5 occurrences of ``"hugs"``). However, the most frequent symbol pair is ``"u"`` followed by
|
|
``"g"``, occurring `10 + 5 + 5 = 20` times in total. Thus, the first merge rule the tokenizer learns is to group all
|
|
``"u"`` symbols followed by a ``"g"`` symbol together. Next, ``"ug"`` is added to the vocabulary. The set of words then
|
|
becomes
|
|
|
|
.. code-block::
|
|
|
|
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
|
|
|
|
BPE then identifies the next most common symbol pair. It's ``"u"`` followed by ``"n"``, which occurs 16 times. ``"u"``,
|
|
``"n"`` is merged to ``"un"`` and added to the vocabulary. The next most frequent symbol pair is ``"h"`` followed by
|
|
``"ug"``, occurring 15 times. Again the pair is merged and ``"hug"`` can be added to the vocabulary.
|
|
|
|
At this stage, the vocabulary is ``["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`` and our set of unique words
|
|
is represented as
|
|
|
|
.. code-block::
|
|
|
|
("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
|
|
|
|
Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
|
|
to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
|
|
the word ``"bug"`` would be tokenized to ``["b", "ug"]`` but ``"mug"`` would be tokenized as ``["<unk>", "ug"]`` since
|
|
the symbol ``"m"`` is not in the base vocabulary. In general, single letters such as ``"m"`` are not replaced by the
|
|
``"<unk>"`` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
|
|
to happen for very special characters like emojis.
|
|
|
|
As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
|
|
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
|
|
and chose to stop training after 40,000 merges.
|
|
|
|
Byte-level BPE
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
|
|
considered as base characters. To have a better base vocabulary, `GPT-2
|
|
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ uses bytes
|
|
as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
|
|
every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
|
|
tokenizer can tokenize every text without the need for the <unk> symbol. :doc:`GPT-2 <model_doc/gpt>` has a vocabulary
|
|
size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
|
|
with 50,000 merges.
|
|
|
|
.. _wordpiece:
|
|
|
|
WordPiece
|
|
=======================================================================================================================
|
|
|
|
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
|
|
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
|
|
Voice Search (Schuster et al., 2012)
|
|
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
|
|
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
|
|
progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
|
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
|
|
|
|
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
|
|
equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
|
|
its second symbol is the greatest among all symbol pairs. *E.g.* ``"u"``, followed by ``"g"`` would have only been
|
|
merged if the probability of ``"ug"`` divided by ``"u"``, ``"g"`` would have been greater than for any other symbol
|
|
pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it `loses` by merging two symbols
|
|
to make ensure it's `worth it`.
|
|
|
|
.. _unigram:
|
|
|
|
Unigram
|
|
=======================================================================================================================
|
|
|
|
Unigram is a subword tokenization algorithm introduced in `Subword Regularization: Improving Neural Network Translation
|
|
Models with Multiple Subword Candidates (Kudo, 2018) <https://arxiv.org/pdf/1804.10959.pdf>`__. In contrast to BPE or
|
|
WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
|
|
symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
|
|
the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
|
|
conjunction with :ref:`SentencePiece <sentencepiece>`.
|
|
|
|
At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
|
|
data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
|
|
computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
|
|
removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
|
|
symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
|
|
reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
|
|
|
|
Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
|
|
tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
|
|
|
|
.. code-block::
|
|
|
|
["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
|
|
|
|
``"hugs"`` could be tokenized both as ``["hug", "s"]``, ``["h", "ug", "s"]`` or ``["h", "u", "g", "s"]``. So which one
|
|
to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
|
|
the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
|
|
likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
|
|
probabilities.
|
|
|
|
Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
|
|
the words :math:`x_{1}, \dots, x_{N}` and that the set of all possible tokenizations for a word :math:`x_{i}` is
|
|
defined as :math:`S(x_{i})`, then the overall loss is defined as
|
|
|
|
.. math::
|
|
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
|
|
|
|
.. _sentencepiece:
|
|
|
|
SentencePiece
|
|
=======================================================================================================================
|
|
|
|
All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
|
|
separate words. However, not all languages use spaces to separate words. One possible solution is to use language
|
|
specific pre-tokenizers, *e.g.* :doc:`XLM <model_doc/xlm>` uses a specific Chinese, Japanese, and Thai pre-tokenizer).
|
|
To solve this problem more generally, `SentencePiece: A simple and language independent subword tokenizer and
|
|
detokenizer for Neural Text Processing (Kudo et al., 2018) <https://arxiv.org/pdf/1808.06226.pdf>`__ treats the input
|
|
as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
|
|
algorithm to construct the appropriate vocabulary.
|
|
|
|
The :class:`~transformers.XLNetTokenizer` uses SentencePiece for example, which is also why in the example earlier the
|
|
``"▁"`` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
|
|
concatenated and ``"▁"`` is replaced by a space.
|
|
|
|
All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
|
|
using SentencePiece are :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>`, :doc:`Marian
|
|
<model_doc/marian>`, and :doc:`T5 <model_doc/t5>`.
|