mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 21:30:07 +06:00

* udpaet * update * Update docs/source/ja/autoclass_tutorial.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * add codes workflows/build_pr_documentation.yml * Create preprocessing.md * added traning.md * Create Model_sharing.md * add quicktour.md * new * ll * Create benchmark.md * Create Tensorflow_model * add * add community.md * add create_a_model * create custom_model.md * create_custom_tools.md * create fast_tokenizers.md * create * add * Update docs/source/ja/_toctree.yml Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * md * add * commit * add * h * Update docs/source/ja/peft.md Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/ja/_toctree.yml Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/ja/_toctree.yml Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Suggested Update * add perf_train_gpu_one.md * added perf based MD files * Modify toctree.yml and Add transmartion to md codes * Add `serialization.md` and edit `_toctree.yml` * add task summary and tasks explained * Add and Modify files starting from T * Add testing.md * Create main_classes files * delete main_classes folder * Add toctree.yml * Update llm_tutorail.md * Update docs/source/ja/_toctree.yml Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update misspelled filenames * Update docs/source/ja/_toctree.yml Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/_toctree.yml * Update docs/source/ja/_toctree.yml * missplled file names inmrpovements * Update _toctree.yml * close tip block * close another tip block * Update docs/source/ja/quicktour.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/pipeline_tutorial.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/pipeline_tutorial.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/preprocessing.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/peft.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/add_new_model.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/testing.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/task_summary.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/tasks_explained.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update glossary.md * Update docs/source/ja/transformers_agents.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/llm_tutorial.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/create_a_model.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/torchscript.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/benchmarks.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/troubleshooting.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/troubleshooting.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/troubleshooting.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/add_new_model.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update perf_torch_compile.md * Update Year to default in en documentation * Final Update --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
180 lines
22 KiB
Markdown
180 lines
22 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||
the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations under the License.
|
||
|
||
â ïž Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||
rendered properly in your Markdown viewer.
|
||
|
||
-->
|
||
|
||
# Summary of the tokenizers
|
||
|
||
[[open-in-colab]]
|
||
|
||
ãã®ããŒãžã§ã¯ãããŒã¯ãã€ãŒãŒã·ã§ã³ã«ã€ããŠè©³ããèŠãŠãããŸãã
|
||
|
||
<Youtube id="VFp38yj8h3A"/>
|
||
|
||
[ååŠçã®ãã¥ãŒããªã¢ã«](preprocessing)ã§èŠãããã«ãããã¹ããããŒã¯ã³åããããšã¯ããããåèªãŸãã¯ãµãã¯ãŒãã«åå²ããããããã«ãã¯ã¢ããããŒãã«ãä»ããŠIDã«å€æããããšã§ããåèªãŸãã¯ãµãã¯ãŒããIDã«å€æããããšã¯ç°¡åã§ãã®ã§ããã®èŠçŽã§ã¯ããã¹ããåèªãŸãã¯ãµãã¯ãŒãã«åå²ããïŒã€ãŸããããã¹ããããŒã¯ãã€ãºããïŒããšã«çŠç¹ãåœãŠãŸããå
·äœçã«ã¯ãð€ Transformersã§äœ¿çšããã3ã€ã®äž»èŠãªããŒã¯ãã€ã¶ã[Byte-Pair EncodingïŒBPEïŒ](#byte-pair-encoding)ã[WordPiece](#wordpiece)ãããã³[SentencePiece](#sentencepiece)ãèŠãŠãã©ã®ã¢ãã«ãã©ã®ããŒã¯ãã€ã¶ã¿ã€ãã䜿çšããŠãããã®äŸã瀺ããŸãã
|
||
|
||
åã¢ãã«ããŒãžã§ã¯ãäºåãã¬ãŒãã³ã°æžã¿ã¢ãã«ãã©ã®ããŒã¯ãã€ã¶ã¿ã€ãã䜿çšããŠããããç¥ãããã«ãé¢é£ããããŒã¯ãã€ã¶ã®ããã¥ã¡ã³ãã確èªã§ããŸããäŸãã°ã[`BertTokenizer`]ãèŠããšãã¢ãã«ã[WordPiece](#wordpiece)ã䜿çšããŠããããšãããããŸãã
|
||
|
||
## Introduction
|
||
|
||
ããã¹ããããå°ããªãã£ã³ã¯ã«åå²ããããšã¯ãèŠãã以äžã«é£ããã¿ã¹ã¯ã§ãããè€æ°ã®æ¹æ³ããããŸããäŸãã°ãæ¬¡ã®æãèããŠã¿ãŸããããã"Don't you love ð€ Transformers? We sure do."ã
|
||
|
||
<Youtube id="nhJxYji1aho"/>
|
||
|
||
ãã®ããã¹ããããŒã¯ã³åããç°¡åãªæ¹æ³ã¯ãã¹ããŒã¹ã§åå²ããããšã§ããããã«ããã以äžã®ããã«ãªããŸãïŒ
|
||
|
||
|
||
```
|
||
["Don't", "you", "love", "ð€", "Transformers?", "We", "sure", "do."]
|
||
```
|
||
|
||
ããã¯åççãªç¬¬äžæ©ã§ãããããŒã¯ã³ "Transformers?" ãš "do." ãèŠããšãå¥èªç¹ãåèª "Transformer" ãš "do" ã«çµåãããŠããããšãããããããã¯æé©ã§ã¯ãããŸãããå¥èªç¹ãèæ
®ã«å
¥ããã¹ãã§ãã¢ãã«ãåèªãšããã«ç¶ãå¯èœæ§ã®ãããã¹ãŠã®å¥èªç¹èšå·ã®ç°ãªã衚çŸãåŠã°ãªããã°ãªããªãããšãé¿ããã¹ãã§ããããã«ãããã¢ãã«ãåŠã°ãªããã°ãªããªã衚çŸã®æ°ãççºçã«å¢å ããŸããå¥èªç¹ãèæ
®ã«å
¥ããå ŽåãäŸæã®ããŒã¯ã³åã¯æ¬¡ã®ããã«ãªããŸãïŒ
|
||
|
||
|
||
```
|
||
["Don", "'", "t", "you", "love", "ð€", "Transformers", "?", "We", "sure", "do", "."]
|
||
```
|
||
|
||
ãã ããåèªã"Don't"ããããŒã¯ã³åããæ¹æ³ã«é¢ããŠã¯ãäžå©ãªåŽé¢ããããŸãã ã"Don't"ãã¯ã"do not"ãã衚ããŠãããããã["Do", "n't"]ããšããŠããŒã¯ã³åããæ¹ãé©ããŠããŸããããããäºæãè€éã«ãªããåã¢ãã«ãç¬èªã®ããŒã¯ãã€ã¶ãŒã¿ã€ããæã€çç±ã®äžéšã§ããããŸããããã¹ããããŒã¯ã³åããããã«é©çšããã«ãŒã«ã«å¿ããŠãåãããã¹ãã«å¯ŸããŠç°ãªãããŒã¯ãã€ãºãããåºåãçæãããŸããäºåãã¬ãŒãã³ã°æžã¿ã¢ãã«ã¯ããã¬ãŒãã³ã°ããŒã¿ãããŒã¯ãã€ãºããã®ã«äœ¿çšãããã«ãŒã«ãšåãã«ãŒã«ã§ããŒã¯ãã€ãºãããå
¥åãæäŸããå Žåã«ã®ã¿æ£åžžã«æ©èœããŸãã
|
||
|
||
[spaCy](https://spacy.io/)ãš[Moses](http://www.statmt.org/moses/?n=Development.GetStarted)ã¯ã2ã€ã®äººæ°ã®ããã«ãŒã«ããŒã¹ã®ããŒã¯ãã€ã¶ãŒã§ããããããç§ãã¡ã®äŸã«é©çšãããšã*spaCy*ãš*Moses*ã¯æ¬¡ã®ãããªåºåãçæããŸãïŒ
|
||
|
||
```
|
||
["Do", "n't", "you", "love", "ð€", "Transformers", "?", "We", "sure", "do", "."]
|
||
```
|
||
|
||
空çœãšå¥èªç¹ã®ããŒã¯ã³åãããã³ã«ãŒã«ããŒã¹ã®ããŒã¯ã³åã䜿çšãããŠããããšãããããŸãã空çœãšå¥èªç¹ã®ããŒã¯ã³åãããã³ã«ãŒã«ããŒã¹ã®ããŒã¯ã³åã¯ãæãåèªã«åå²ããããšãããããã«å®çŸ©ãããåèªããŒã¯ã³åã®äŸã§ããããã¹ããããå°ããªãã£ã³ã¯ã«åå²ããããã®æãçŽæçãªæ¹æ³ã§ããäžæ¹ããã®ããŒã¯ã³åæ¹æ³ã¯å€§èŠæš¡ãªããã¹ãã³ãŒãã¹ã«å¯ŸããŠåé¡ãåŒãèµ·ããããšããããŸãããã®å Žåã空çœãšå¥èªç¹ã®ããŒã¯ã³åã¯éåžžãéåžžã«å€§ããªèªåœïŒãã¹ãŠã®äžæãªåèªãšããŒã¯ã³ã®ã»ããïŒãçæããŸããäŸãã°ã[Transformer XL](model_doc/transformerxl)ã¯ç©ºçœãšå¥èªç¹ã®ããŒã¯ã³åã䜿çšããŠãããèªåœãµã€ãºã¯267,735ã§ãïŒ
|
||
|
||
ãã®ãããªå€§ããªèªåœãµã€ãºã¯ãã¢ãã«ã«éåžžã«å€§ããªåã蟌ã¿è¡åãå
¥åããã³åºåã¬ã€ã€ãŒãšããŠæãããããšã匷å¶ããã¡ã¢ãªããã³æéã®è€éãã®å¢å ãåŒãèµ·ãããŸããäžè¬çã«ããã©ã³ã¹ãã©ãŒããŒã¢ãã«ã¯ãç¹ã«åäžã®èšèªã§äºåãã¬ãŒãã³ã°ãããå Žåã50,000ãè¶
ããèªåœãµã€ãºãæã€ããšã¯ã»ãšãã©ãããŸããã
|
||
|
||
ãããã£ãŠãã·ã³ãã«ãªç©ºçœãšå¥èªç¹ã®ããŒã¯ã³åãäžååãªå Žåããªãåã«æååäœã§ããŒã¯ã³åããªãã®ããšããçåãçããŸããïŒ
|
||
|
||
<Youtube id="ssLq_EK2jLE"/>
|
||
|
||
æååäœã®ããŒã¯ã³åã¯éåžžã«ã·ã³ãã«ã§ãããã¡ã¢ãªãšæéã®è€éãã倧å¹
ã«åæžã§ããŸãããã¢ãã«ã«æå³ã®ããå
¥å衚çŸãåŠç¿ãããããšãéåžžã«é£ãããªããŸããããšãã°ãæåã"t"ãã®ããã®æå³ã®ããã³ã³ããã¹ãç¬ç«ã®è¡šçŸãåŠç¿ããããšã¯ãåèªã"today"ãã®ããã®ã³ã³ããã¹ãç¬ç«ã®è¡šçŸãåŠç¿ãããããã¯ããã«é£ããã§ãããã®ãããæååäœã®ããŒã¯ã³åã¯ãã°ãã°ããã©ãŒãã³ã¹ã®äœäžã䌎ããŸãããããã£ãŠããã©ã³ã¹ãã©ãŒããŒã¢ãã«ã¯åèªã¬ãã«ãšæåã¬ãã«ã®ããŒã¯ã³åã®ãã€ããªããã§ãã**ãµãã¯ãŒã**ããŒã¯ã³åã䜿çšããŠãäž¡æ¹ã®äžçã®å©ç¹ã掻ãããŸãã
|
||
|
||
## Subword tokenization
|
||
|
||
<Youtube id="zHvTiHr506c"/>
|
||
|
||
ãµãã¯ãŒãããŒã¯ã³åã¢ã«ãŽãªãºã ã¯ãé »ç¹ã«äœ¿çšãããåèªãããå°ããªãµãã¯ãŒãã«åå²ãã¹ãã§ã¯ãªãããçããåèªã¯æå³ã®ãããµãã¯ãŒãã«åè§£ããããšããååã«äŸåããŠããŸããããšãã°ãã"annoyingly"ãã¯çããåèªãšèŠãªããããã®åèªã¯ã"annoying"ããšã"ly"ãã«åè§£ããããããããŸãããç¬ç«ããã"annoying"ããšã"ly"ãã¯ããé »ç¹ã«çŸããŸãããã"annoyingly"ãã®æå³ã¯ã"annoying"ããšã"ly"ãã®åæçãªæå³ã«ãã£ãŠä¿æãããŸããããã¯ç¹ã«ãã«ã³èªãªã©ã®çµåèšèªã§åœ¹ç«ã¡ãŸããããã§ã¯ãµãã¯ãŒããé£çµããŠïŒã»ãŒïŒä»»æã®é·ãè€éãªåèªã圢æã§ããŸãã
|
||
|
||
ãµãã¯ãŒãããŒã¯ã³åã«ãããã¢ãã«ã¯åççãªèªåœãµã€ãºãæã€ããšãã§ããæå³ã®ããã³ã³ããã¹ãç¬ç«ã®è¡šçŸãåŠç¿ã§ããŸããããã«ããµãã¯ãŒãããŒã¯ã³åã«ãããã¢ãã«ã¯ä»¥åã«èŠãããšã®ãªãåèªãåŠçããããããæ¢ç¥ã®ãµãã¯ãŒãã«åè§£ããããšãã§ããŸããäŸãã°ã[`~transformers.BertTokenizer`]ã¯`"I have a new GPU!"`ã以äžã®ããã«ããŒã¯ã³åããŸãïŒ
|
||
|
||
|
||
```py
|
||
>>> from transformers import BertTokenizer
|
||
|
||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||
>>> tokenizer.tokenize("I have a new GPU!")
|
||
["i", "have", "a", "new", "gp", "##u", "!"]
|
||
```
|
||
|
||
ãuncasedãã¢ãã«ãèæ
®ããŠããããããŸãæãå°æåã«å€æããŸãããããŒã¯ãã€ã¶ã®èªåœã«ã["i", "have", "a", "new"]ããšããåèªãååšããããšãããããŸãããã"gpu"ããšããåèªã¯ååšããŸããããããã£ãŠãããŒã¯ãã€ã¶ã¯ã"gpu"ããæ¢ç¥ã®ãµãã¯ãŒãã["gp"ã"##u"]ãã«åå²ããŸããããã§ã"##"ãã¯ãããŒã¯ã³ã®ãã³ãŒããŸãã¯ããŒã¯ãã€ãŒãŒã·ã§ã³ã®é転ã®ããã«ãããŒã¯ã³ã®åã®éšåã«ã¹ããŒã¹ãªãã§æ¥ç¶ããå¿
èŠãããããšãæå³ããŸãã
|
||
|
||
å¥ã®äŸãšããŠã[`~transformers.XLNetTokenizer`]ã¯ä»¥äžã®ããã«ä»¥åã®ãµã³ãã«ããã¹ããããŒã¯ã³åããŸãïŒ
|
||
|
||
```py
|
||
>>> from transformers import XLNetTokenizer
|
||
|
||
>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
|
||
>>> tokenizer.tokenize("Don't you love ð€ Transformers? We sure do.")
|
||
["âDon", "'", "t", "âyou", "âlove", "â", "ð€", "â", "Transform", "ers", "?", "âWe", "âsure", "âdo", "."]
|
||
```
|
||
|
||
ãããã®ãâãã®æå³ã«ã€ããŠã¯ã[SentencePiece](#sentencepiece)ãèŠããšãã«è©³ãã説æããŸããã芧ã®éãããTransformersããšããçããåèªã¯ãããé »ç¹ã«çŸãããµãã¯ãŒããTransformããšãersãã«åå²ãããŠããŸãã
|
||
|
||
ããŠãç°ãªããµãã¯ãŒãããŒã¯ã³åã¢ã«ãŽãªãºã ãã©ã®ããã«åäœããããèŠãŠã¿ãŸãããããããã®ããŒã¯ãã€ãŒãŒã·ã§ã³ã¢ã«ãŽãªãºã ã¯ãã¹ãŠãéåžžã¯å¯Ÿå¿ããã¢ãã«ããã¬ãŒãã³ã°ãããã³ãŒãã¹ã§è¡ããã圢åŒã®ãã¬ãŒãã³ã°ã«äŸåããŠããŸãã
|
||
|
||
<a id='byte-pair-encoding'></a>
|
||
|
||
### Byte-Pair EncodingïŒBPEïŒ
|
||
|
||
Byte-Pair EncodingïŒBPEïŒã¯ã[Neural Machine Translation of Rare Words with Subword UnitsïŒSennrich et al., 2015ïŒ](https://arxiv.org/abs/1508.07909)ã§å°å
¥ãããŸãããBPEã¯ããã¬ãŒãã³ã°ããŒã¿ãåèªã«åå²ããããªããŒã¯ãã€ã¶ã«äŸåããŠããŸããããªããŒã¯ãã€ãŒãŒã·ã§ã³ã¯ã空çœã®ããŒã¯ãã€ãŒãŒã·ã§ã³ãªã©ãéåžžã«åçŽãªãã®ã§ããããšããããŸããäŸãã°ã[GPT-2](model_doc/gpt2)ã[RoBERTa](model_doc/roberta)ã§ããããé«åºŠãªããªããŒã¯ãã€ãŒãŒã·ã§ã³ã«ã¯ãã«ãŒã«ããŒã¹ã®ããŒã¯ãã€ãŒãŒã·ã§ã³ïŒ[XLM](model_doc/xlm)ã[FlauBERT](model_doc/flaubert)ãªã©ã倧éšåã®èšèªã«Mosesã䜿çšïŒãã[GPT](model_doc/gpt)ïŒSpacyãšftfyã䜿çšããŠãã¬ãŒãã³ã°ã³ãŒãã¹å
ã®ååèªã®é »åºŠãæ°ããïŒãªã©ãå«ãŸããŸãã
|
||
|
||
ããªããŒã¯ãã€ãŒãŒã·ã§ã³ã®åŸãäžæã®åèªã»ãããäœæãããååèªããã¬ãŒãã³ã°ããŒã¿ã§åºçŸããé »åºŠãæ±ºå®ãããŸããæ¬¡ã«ãBPEã¯ããŒã¹èªåœãäœæããããŒã¹èªåœã®äºã€ã®ã·ã³ãã«ããæ°ããã·ã³ãã«ã圢æããããã®ããŒãžã«ãŒã«ãåŠç¿ããŸãããã®ããã»ã¹ã¯ãèªåœãææã®èªåœãµã€ãºã«éãããŸã§ç¶ããããŸãããªããææã®èªåœãµã€ãºã¯ããŒã¯ãã€ã¶ããã¬ãŒãã³ã°ããåã«å®çŸ©ãããã€ããŒãã©ã¡ãŒã¿ã§ããããšã«æ³šæããŠãã ããã
|
||
|
||
äŸãšããŠãããªããŒã¯ãã€ãŒãŒã·ã§ã³ã®åŸã次ã®ã»ããã®åèªãšãã®åºçŸé »åºŠã決å®ããããšä»®å®ããŸãããïŒ
|
||
|
||
|
||
```
|
||
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
|
||
```
|
||
|
||
ãããã£ãŠãããŒã¹èªåœã¯ã["b", "g", "h", "n", "p", "s", "u"]ãã§ãããã¹ãŠã®åèªãããŒã¹èªåœã®ã·ã³ãã«ã«åå²ãããšã次ã®ããã«ãªããŸãïŒ
|
||
|
||
|
||
```
|
||
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
|
||
```
|
||
|
||
ãã®åŸãBPEã¯å¯èœãªãã¹ãŠã®ã·ã³ãã«ãã¢ã®é »åºŠãæ°ããæãé »ç¹ã«çºçããã·ã³ãã«ãã¢ãéžæããŸããäžèšã®äŸã§ã¯ã`"h"`ã®åŸã«`"u"`ã15åïŒ`"hug"`ã®10åã`"hugs"`ã®5åïŒåºçŸããŸããããããæãé »ç¹ãªã·ã³ãã«ãã¢ã¯ãåèšã§20åïŒ`"u"`ã®10åã`"g"`ã®5åã`"u"`ã®5åïŒåºçŸãã`"u"`ã®åŸã«`"g"`ãç¶ãã·ã³ãã«ãã¢ã§ãããããã£ãŠãããŒã¯ãã€ã¶ãæåã«åŠç¿ããããŒãžã«ãŒã«ã¯ã`"u"`ã®åŸã«`"g"`ãç¶ããã¹ãŠã®`"u"`ã·ã³ãã«ãäžç·ã«ã°ã«ãŒãåããããšã§ããæ¬¡ã«ã`"ug"`ãèªåœã«è¿œå ãããŸããåèªã®ã»ããã¯æ¬¡ã«ãªããŸãïŒ
|
||
|
||
|
||
```
|
||
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
|
||
```
|
||
|
||
次ã«ãBPEã¯æ¬¡ã«æãäžè¬çãªã·ã³ãã«ãã¢ãèå¥ããŸããããã¯ã"u"ãã«ç¶ããŠã"n"ãã§ã16ååºçŸããŸãããããã£ãŠãã"u"ããšã"n"ãã¯ã"un"ãã«çµåãããèªåœã«è¿œå ãããŸããæ¬¡ã«æãé »åºŠã®é«ãã·ã³ãã«ãã¢ã¯ãã"h"ãã«ç¶ããŠã"ug"ãã§ã15ååºçŸããŸããåã³ãã¢ãçµåããããhugããèªåœã«è¿œå ã§ããŸãã
|
||
|
||
ãã®æ®µéã§ã¯ãèªåœã¯`["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`ã§ãããäžæã®åèªã®ã»ããã¯ä»¥äžã®ããã«è¡šãããŸãïŒ
|
||
|
||
|
||
```
|
||
("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
|
||
```
|
||
|
||
åæãšããŠãByte-Pair EncodingïŒBPEïŒã®ãã¬ãŒãã³ã°ããã®æ®µéã§åæ¢ãããšãåŠç¿ãããããŒãžã«ãŒã«ãæ°ããåèªã«é©çšãããŸãïŒæ°ããåèªã«ã¯ããŒã¹ããã£ãã©ãªã«å«ãŸããŠããªãã·ã³ãã«ãå«ãŸããŠããªãéãïŒã äŸãã°ãåèª "bug" 㯠["b", "ug"] ãšããŠããŒã¯ã³åãããŸããã"mug" ã¯ããŒã¹ããã£ãã©ãªã« "m" ã·ã³ãã«ãå«ãŸããŠããªãããã["<unk>", "ug"] ãšããŠããŒã¯ã³åãããŸãã äžè¬çã«ã"m" ã®ãããªåäžã®æåã¯ããã¬ãŒãã³ã°ããŒã¿ã«ã¯éåžžãåæåã®å°ãªããšã1ã€ã®åºçŸãå«ãŸããŠããããã"<unk>" ã·ã³ãã«ã«çœ®ãæããããããšã¯ãããŸããããçµµæåã®ãããªéåžžã«ç¹æ®ãªæåã®å Žåã«ã¯çºçããå¯èœæ§ããããŸãã
|
||
|
||
åè¿°ã®ããã«ãããã£ãã©ãªãµã€ãºãããªãã¡ããŒã¹ããã£ãã©ãªãµã€ãº + ããŒãžã®åæ°ã¯éžæãããã€ããŒãã©ã¡ãŒã¿ã§ãã äŸãã°ã[GPT](model_doc/gpt) ã¯ããŒã¹æåã478æåã§ã40,000åã®ããŒãžåŸã«ãã¬ãŒãã³ã°ã忢ãããããããã£ãã©ãªãµã€ãºã¯40,478ã§ãã
|
||
|
||
#### Byte-level BPE
|
||
|
||
ãã¹ãŠã®UnicodeæåãããŒã¹æåãšèãããšããã¹ãŠã®å¯èœãªããŒã¹æåãå«ãŸãããããããªãããŒã¹ããã£ãã©ãªã¯ããªã倧ãããªãããšããããŸãã [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) ã¯ãããŒã¹ããã£ãã©ãªã256ãã€ãã«ããè³¢ãããªãã¯ãšããŠãã€ããããŒã¹ããã£ãã©ãªãšããŠäœ¿çšãããã¹ãŠã®ããŒã¹æåãããã£ãã©ãªã«å«ãŸããããã«ããŠããŸãã ãã³ã¯ãã¥ãšãŒã·ã§ã³ãæ±ãããã®ããã€ãã®è¿œå ã«ãŒã«ãåããGPT2ã®ããŒã¯ãã€ã¶ã¯ã<unk> ã·ã³ãã«ãå¿
èŠãšããã«ãã¹ãŠã®ããã¹ããããŒã¯ã³åã§ããŸãã [GPT-2](model_doc/gpt) ã¯50,257ã®ããã£ãã©ãªãµã€ãºãæã£ãŠãããããã¯256ãã€ãã®ããŒã¹ããŒã¯ã³ãç¹å¥ãªããã¹ãã®çµäºã瀺ãããŒã¯ã³ãããã³50,000åã®ããŒãžã§åŠç¿ããã·ã³ãã«ã«å¯Ÿå¿ããŠããŸãã
|
||
|
||
### WordPiece
|
||
|
||
WordPieceã¯ã[BERT](model_doc/bert)ã[DistilBERT](model_doc/distilbert)ãããã³[Electra](model_doc/electra)ã§äœ¿çšããããµãã¯ãŒãããŒã¯ãã€ãŒãŒã·ã§ã³ã¢ã«ãŽãªãºã ã§ãã ãã®ã¢ã«ãŽãªãºã ã¯ã[Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) ã§æŠèª¬ãããŠãããBPEã«éåžžã«äŒŒãŠããŸãã WordPieceã¯æãé »ç¹ãªã·ã³ãã«ãã¢ãéžæããã®ã§ã¯ãªãããã¬ãŒãã³ã°ããŒã¿ã«è¿œå ããå Žåã«ãã¬ãŒãã³ã°ããŒã¿ã®å°€åºŠãæå€§åããã·ã³ãã«ãã¢ãéžæããŸãã
|
||
|
||
ããã¯å
·äœçã«ã¯ã©ãããæå³ã§ããïŒåã®äŸãåç
§ãããšããã¬ãŒãã³ã°ããŒã¿ã®å°€åºŠãæå€§åããããšã¯ããã®ã·ã³ãã«ãã¢ã®ç¢ºçããã®æåã®ã·ã³ãã«ã«ç¶ã2çªç®ã®ã·ã³ãã«ã®ç¢ºçã§å²ã£ããã®ãããã¹ãŠã®ã·ã³ãã«ãã¢ã®äžã§æã倧ããå Žåã«è©²åœããã·ã³ãã«ãã¢ãèŠã€ããããšã«çããã§ãã ããšãã°ã"u" ã®åŸã« "g" ãç¶ãå Žåãä»ã®ã©ã®ã·ã³ãã«ãã¢ããã "ug" ã®ç¢ºçã "u"ã"g" ã§å²ã£ã確çãé«ããã°ããããã®ã·ã³ãã«ã¯çµåãããŸããçŽæçã«èšãã°ãWordPieceã¯2ã€ã®ã·ã³ãã«ãçµåããããšã«ãã£ãŠå€±ããããã®ãè©äŸ¡ãããããããã«å€ãããã©ããã確èªããç¹ã§BPEãšã¯ãããã«ç°ãªããŸãã
|
||
|
||
### Unigram
|
||
|
||
Unigramã¯ã[Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf) ã§å°å
¥ããããµãã¯ãŒãããŒã¯ãã€ãŒãŒã·ã§ã³ã¢ã«ãŽãªãºã ã§ãã BPEãWordPieceãšã¯ç°ãªããUnigramã¯ããŒã¹ããã£ãã©ãªã倿°ã®ã·ã³ãã«ã§åæåããåã·ã³ãã«ãåæžããŠããå°ããªããã£ãã©ãªãååŸããŸãã ããŒã¹ããã£ãã©ãªã¯ãäºåã«ããŒã¯ã³åããããã¹ãŠã®åèªãšæãäžè¬çãªéšåæååã«å¯Ÿå¿ããå¯èœæ§ããããŸãã Unigramã¯transformersã®ã¢ãã«ã®çŽæ¥ã®äœ¿çšã«ã¯é©ããŠããŸãããã[SentencePiece](#sentencepiece)ãšçµã¿åãããŠäœ¿çšãããŸãã
|
||
|
||
åãã¬ãŒãã³ã°ã¹ãããã§ãUnigramã¢ã«ãŽãªãºã ã¯çŸåšã®ããã£ãã©ãªãšãŠãã°ã©ã èšèªã¢ãã«ã䜿çšããŠãã¬ãŒãã³ã°ããŒã¿äžã®æå€±ïŒéåžžã¯å¯Ÿæ°å°€åºŠãšããŠå®çŸ©ïŒãå®çŸ©ããŸãããã®åŸãããã£ãã©ãªå
ã®åã·ã³ãã«ã«ã€ããŠããã®ã·ã³ãã«ãããã£ãã©ãªããåé€ãããå Žåã«å
šäœã®æå€±ãã©ãã ãå¢å ããããèšç®ããŸãã Unigramã¯ãæå€±ã®å¢å ãæãäœãpïŒéåžžã¯10ïŒ
ãŸãã¯20ïŒ
ïŒããŒã»ã³ãã®ã·ã³ãã«ãåé€ããŸããã€ãŸãããã¬ãŒãã³ã°ããŒã¿å
šäœã®æå€±ã«æã圱é¿ãäžããªããæãæå€±ã®å°ãªãã·ã³ãã«ãåé€ããŸãã ãã®ããã»ã¹ã¯ãããã£ãã©ãªãæãŸãããµã€ãºã«éãããŸã§ç¹°ãè¿ãããŸãã Unigramã¢ã«ãŽãªãºã ã¯åžžã«ããŒã¹æåãä¿æãããããä»»æã®åèªãããŒã¯ã³åã§ããŸãã
|
||
|
||
Unigramã¯ããŒãžã«ãŒã«ã«åºã¥ããŠããªãããïŒBPEãšWordPieceãšã¯å¯Ÿç
§çã«ïŒããã¬ãŒãã³ã°åŸã®æ°ããããã¹ãã®ããŒã¯ã³åã«ã¯ããã€ãã®æ¹æ³ããããŸããäŸãšããŠããã¬ãŒãã³ã°ãããUnigramããŒã¯ãã€ã¶ãæã€ããã£ãã©ãªã次ã®ãããªå ŽåïŒ
|
||
|
||
|
||
```
|
||
["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
|
||
```
|
||
|
||
`"hugs"`ã¯ã`["hug", "s"]`ã`["h", "ug", "s"]`ããŸãã¯`["h", "u", "g", "s"]`ã®ããã«ããŒã¯ã³åã§ããŸããã§ã¯ãã©ããéžæãã¹ãã§ããããïŒ Unigramã¯ããã¬ãŒãã³ã°ã³ãŒãã¹å
ã®åããŒã¯ã³ã®ç¢ºçãä¿åãããã¬ãŒãã³ã°åŸã«åå¯èœãªããŒã¯ã³åã®ç¢ºçãèšç®ã§ããããã«ããŸãããã®ã¢ã«ãŽãªãºã ã¯å®éã«ã¯æãå¯èœæ§ã®é«ãããŒã¯ã³åãéžæããŸããã確çã«åŸã£ãŠå¯èœãªããŒã¯ã³åããµã³ããªã³ã°ãããªãã·ã§ã³ãæäŸããŸãã
|
||
|
||
ãããã®ç¢ºçã¯ãããŒã¯ãã€ã¶ãŒããã¬ãŒãã³ã°ã«äœ¿çšããæå€±ã«ãã£ãŠå®çŸ©ãããŸãããã¬ãŒãã³ã°ããŒã¿ãåèª \\(x_{1}, \dots, x_{N}\\) ã§æ§æãããåèª \\(x_{i}\\) ã®ãã¹ãŠã®å¯èœãªããŒã¯ã³åã®ã»ããã \\(S(x_{i})\\) ãšå®çŸ©ãããå Žåãå
šäœã®æå€±ã¯æ¬¡ã®ããã«å®çŸ©ãããŸãã
|
||
|
||
$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$
|
||
|
||
<a id='sentencepiece'></a>
|
||
|
||
### SentencePiece
|
||
|
||
ãããŸã§ã«èª¬æãããã¹ãŠã®ããŒã¯ã³åã¢ã«ãŽãªãºã ã«ã¯åãåé¡ããããŸããããã¯ãå
¥åããã¹ããåèªãåºåãããã«ã¹ããŒã¹ã䜿çšããŠãããšä»®å®ããŠãããšããããšã§ãããããããã¹ãŠã®èšèªãåèªãåºåãããã«ã¹ããŒã¹ã䜿çšããŠããããã§ã¯ãããŸããããã®åé¡ãäžè¬çã«è§£æ±ºããããã®1ã€ã®æ¹æ³ã¯ãèšèªåºæã®åããŒã¯ãã€ã¶ãŒã䜿çšããããšã§ãïŒäŸïŒ[XLM](model_doc/xlm)ã¯ç¹å®ã®äžåœèªãæ¥æ¬èªãããã³ã¿ã€èªã®åããŒã¯ãã€ã¶ãŒã䜿çšããŠããŸãïŒãããäžè¬çã«ãã®åé¡ã解決ããããã«ã[SentencePieceïŒãã¥ãŒã©ã«ããã¹ãåŠçã®ããã®ã·ã³ãã«ã§èšèªéäŸåã®ãµãã¯ãŒãããŒã¯ãã€ã¶ãŒããã³ãããŒã¯ãã€ã¶ãŒïŒKudo et al.ã2018ïŒ](https://arxiv.org/pdf/1808.06226.pdf) ã¯ãå
¥åãçã®å
¥åã¹ããªãŒã ãšããŠæ±ããã¹ããŒã¹ã䜿çšããæåã®ã»ããã«å«ããŸããããããBPEãŸãã¯unigramã¢ã«ãŽãªãºã ã䜿çšããŠé©åãªèªåœãæ§ç¯ããŸãã
|
||
|
||
ããšãã°ã[`XLNetTokenizer`]ã¯SentencePieceã䜿çšããŠããããã®ããã«åè¿°ã®äŸã§`"â"`æåãèªåœã«å«ãŸããŠããŸãããSentencePieceã䜿çšãããã³ãŒãã¯éåžžã«ç°¡åã§ããã¹ãŠã®ããŒã¯ã³ãåçŽã«é£çµãã`"â"`ã¯ã¹ããŒã¹ã«çœ®æãããŸãã
|
||
|
||
ã©ã€ãã©ãªå
ã®ãã¹ãŠã®transformersã¢ãã«ã¯ãSentencePieceãunigramãšçµã¿åãããŠäœ¿çšããŸããSentencePieceã䜿çšããã¢ãã«ã®äŸã«ã¯ã[ALBERT](model_doc/albert)ã[XLNet](model_doc/xlnet)ã[Marian](model_doc/marian)ãããã³[T5](model_doc/t5)ããããŸãã
|