Commit Graph

5 Commits

Author SHA1 Message Date
Suraj Patil
2e60276b38
[M2M100Tokenizer] fix _build_translation_inputs (#14382)
* add return_tensors paramter

* fix test

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-11-13 20:57:12 +05:30
Philip May
fcad801825
Add regression tests for slow sentencepiece tokenizers. (#11737)
* add test_vocab_size for sentencepiece tok.

* add test_get_vocab for sentencepiece tok.

* add test_convert_token_and_id for sentencepiece tok.

* add test_tokenize_and_convert_tokens_to_string for all tok.

* improve test_tokenize_and_convert_tokens_to_string for sp. tok.

* add common tokenizer integration tests
- for albert
- for barthez

* add tokenizer integration tests to bert gen.

* add most tokenizer integration tests

* fix camembert tokenizer integration test

* add tokenizer integration test to marian

* add tokenizer integration test to reformer

* add typing and doc to tokenizer_integration_test_util

* fix tokenizer integration test of reformer

* improve test_sentencepiece_tokenize_and_convert_tokens_to_string

* empty commit to trigger CI

* fix tokenizer integration test of reformer

* remove code not needed anymore

* empty commit to trigger CI

* empty commit to trigger CI
2021-06-01 09:24:39 -04:00
Philip May
37ed3ab719
Enable option for subword regularization in more tokenizers. (#11417)
* improve slow class tok usage at xlm rob

* add subword regularization for barthez

* improve barthez tok. test

* fix tokenizer tests

* add subword regularization for camembert

* add subword regularization for deberta v2 tokenizer

* add more doc to deberta v2 tokenizer

* add subword regularization for speech to text tok.

* fix sp_model_kwargs type in speech 2 text tok.

* add subword regularization for M2M100 tok.

* add more concrete type hints

* fix tests for m2m100 and s2t tok.

* add missing Any import

* fix syntax error in m2m100 tok.

* fix unpickle of m2m100 and s2t tok.

* fix test of m2m100 and s2t tok.

* improve unpickle of deberta v2 tok.

* add test for pickle of barthez & camembert

* fix pickle of barthez & camembert

* add test for deberta v2 tok. pickle

* fix m2m100 tok. pickle

* fix s2t tok. pickle

* add subword regularization to albert tok.

* refactor subword reg. test into TokenizerTesterMixin

improve albert tok. test

remove sample argument form albert tok.

check subword reg. using TokenizerTesterMixin

improve tok. tests

improve xlm roberta tok. tests

improve xlm roberta tok. tests

* add subword regularization for big bird t.

* improve xlm roberta tok. test

* add subword regularization for mbart50 tok.

* add subword regularization for pegasus tok.

* add subword regularization for reformer tok.

* add subword regularization for T5 tok.

* fix t5 tok. test formatting

* add subword regularization for xlm_proph. tok.

* add subword regularization for xlnet tok.

* add subword regularization for gert_gen tok.

* add typing to tokenizers

* add typing to xlm rob. tok

* add subword regularization for marian tok.

* add reverse tok. test

* fix marian tok test

* fix marian tok test

* fix casing in tok. tests

* fix style of tok. common test

* fix deberta v2 tok test

* add type annotations to tok. tests

* add type annotations to tok. __init__

* add typing to kokenizer

* add type annotations to tok. __init__

* don't specify the default when it's None

* fix barthez tok. doc

* move sentencepiece tok. tests to TokenizerTesterMixin

* fix unused imports

* fix albert tok. test

* add comment to sentencepiece test options

* fix Any import at big bird tok.

* fix Any import at xlm prophetnet tok.

* empty commit to trigger CI
2021-05-13 02:44:55 -04:00
Nicolas Patry
92970c0cb9
Enabling multilingual models for translation pipelines. (#10536)
* [WIP] Enabling multilingual models for translation pipelines.

* decoder_input_ids -> forced_bos_token_id

* Improve docstring.

* Rebase

* Fixing 2 bugs

- Type token_ids coming from `_parse_and_tokenize`
- Wrong index from tgt_lang.

* Fixing black version.

* Adding tests for _build_translation_inputs and add them for all
tokenizers.

* Mbart actually puts the lang code at the end.

* Fixing m2m100.

* Adding TF support to `deep_round`.

* Update src/transformers/pipelines/text2text_generation.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding one line comment.

* Fixing M2M100 `_build_translation_input_ids`, and fix the call site.

* Fixing tests + deep_round -> nested_simplify

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-16 11:31:35 +02:00
Suraj Patil
f6e74a63ca
Add m2m100 (#10236)
* m2m_100

* no layernorm_embedding

* sinusoidal positional embeddings

* update pos embeddings

* add default config values

* tokenizer

* add conversion script

* fix config

* fix pos embed

* remove _float_tensor

* update tokenizer

* update lang codes

* handle lang codes

* fix pos embeds

* fix spm key

* put embedding weights on device

* remove qa and seq classification heads

* fix convert script

* lang codes pn one line

* fix embeds

* fix tokenizer

* fix tokenizer

* add fast tokenizer

* style

* M2M100MT => M2M100

* fix copyright, style

* tokenizer converter

* vocab file

* remove fast tokenizer

* fix embeds

* fix tokenizer

* fix tests

* add tokenizer tests

* add integration test

* quality

* fix model name

* fix test

* doc

* doc

* fix doc

* add copied from statements

* fix tokenizer tests

* apply review suggestions

* fix urls

* fix shift_tokens_right

* apply review suggestions

* fix

* fix doc

* add lang code to id

* remove unused function

* update checkpoint names

* fix copy

* fix tokenizer

* fix checkpoint names

* fix merge issue

* style
2021-03-06 22:14:16 +05:30