mirror of https://github.com/huggingface/transformers.git synced 2025-07-03 12:50:06 +06:00

Translate en/model_doc to JP (#27264 )

* Add `model_docs`

* Add

* Update Model adoc

* Update docs/source/ja/model_doc/bark.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/beit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/bit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/blenderbot.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/blenderbot-small.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update reiew-1

* Update toctree.yml

* translating docs and fixes of PR #27401

* Update docs/source/ja/model_doc/bert.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update the model docs

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

2023-11-27 13:19:04 -08:00

2.9 KiB

Raw Blame History

BertJapanese

Overview

BERT モデルは日本語テキストでトレーニングされました。

2 つの異なるトークン化方法を備えたモデルがあります。

MeCab と WordPiece を使用してトークン化します。これには、MeCab のラッパーである fugashi という追加の依存関係が必要です。
文字にトークン化します。

MecabTokenizer を使用するには、pip installTransformers["ja"] (または、インストールする場合は pip install -e .["ja"]) する必要があります。ソースから）依存関係をインストールします。

cl-tohakuリポジトリの詳細を参照してください。

MeCab および WordPiece トークン化でモデルを使用する例:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾輩 は 猫 で ある 。 [SEP]

>>> outputs = bertjapanese(**inputs)

文字トークン化を使用したモデルの使用例:

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]

>>> outputs = bertjapanese(**inputs)

この実装はトークン化方法を除いて BERT と同じです。その他の使用例については、BERT のドキュメントを参照してください。

このモデルはcl-tohakuから提供されました。

BertJapaneseTokenizer

autodoc BertJapaneseTokenizer

2.9 KiB Raw Blame History

BertJapanese

Overview

BertJapaneseTokenizer

2.9 KiB

Raw Blame History