Minor typo fixes to the tokenizer summary (#8045)

Minor typo fixes to the tokenizer summary
2025-08-03 03:31:05 +06:00 · 2020-10-26 12:08:33 +00:00 · 2020-10-26 12:08:33 +00:00 · 9aa2826687
commit 9aa2826687
parent 829b9f8cc3
1 changed files with 4 additions and 4 deletions
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@ -81,7 +81,7 @@ this:
    ['i', 'have', 'a', 'new', 'gp', '##u', '!']

 Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
-vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
+vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The "##"
 means that the rest of the token should be attached to the previous one, without space (for when we need to decode
 predictions and reverse the tokenization).

@ -112,7 +112,7 @@ splitting the training data into words, which can be a simple space tokenization

 :doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.

-It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
+It then begins from the list of all characters and will learn merge rules to form a new token from two symbols in the
 vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).

 Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
@ -197,8 +197,8 @@ progressively. It's not used directly for any of the pretrained models in the li
 with :ref:`SentencePiece <sentencepiece>`.

 More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
-for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
-sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
+for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
+sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and removes
 all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
 reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
 BPE or WordPiece).