* add warning to let the user know that the method is slower that for a fast tokenizer
* user warnings
* fix layoutlmv2
* fix layout*
* change warnings into logger.warning
* Draft new cached_file
* Initial draft for config and model
* Small fixes
* Fix first batch of tests
* Look in cache when internet is down
* Fix last tests
* Bad black, not fixing all quality errors
* Make diff less
* Implement change for TF and Flax models
* Add tokenizer and feature extractor
* For compatibility with main
* Add utils to move the cache and auto-do it at first use.
* Quality
* Deal with empty commit shas
* Deal with empty etag
* Address review comments
* Prepare CI for v0.8.0
* pin hfh (revert before merge)
* Revert "pin hfh (revert before merge)"
This reverts commit a0103140e1.
* Test rc3
* Test latest rc
* Unpin to the RC
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
* [Json dump] Make json prettier
* correct more tokenizeirs
* more patterns
* add aggressive test
* the aggressive test was actually useful :-)
* more tests
* Apply suggestions from code review
* Fix setters of *_token_id properties of SpecialTokensMixin
* Test setters of common tokens ids
* Move to a separate test checks of setters of tokens ids
* Add independent test for ByT5
* Add Canine test
* Test speech to text
* Make Transformers use cache files when hf.co is down
* Fix tests
* Was there a random circleCI failure?
* Isolate patches
* Style
* Comment out the failure since it doesn't fail anymore
* Better comment
* add new test
* update test
* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
* add `tokenizer_file` for the fast only tokenizer
* change global variables layoutxml
* remove `"tokenizer_file"` from DPR tokenizer's Global variables
* remove `tokenizer_file` from herbert slow tokenizer init
* `"tokenizer_file"` from LED tokenizer's Global variables
* remove `tokenizer_file` from mbart slow tokenizer init
* remove `tokenizer_file` from slow tokenizer template
* adapt to versioning
* adapt the `test_tokenizer_mismatch_warning` test
* clean test
* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
* Revert "remove `tokenizer_file` from mbart slow tokenizer init"
This reverts commit 0dbb723fa9.
* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
This reverts commit 5a3f879bdd.
* Revert "remove `tokenizer_file` from herbert slow tokenizer init"
This reverts commit f5e10007b7.
* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
This reverts commit da0895330b.
* set `tokenizer_file` in super `__init__` of mbart
* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`
* add test
* fix kwargs
* reformat test
* format
* format
* fix typo to render the documentation
* add new test
* add a feature to same the sentencepiece tokenizer model when the init file was deleted
* update marian
* update m2m_100
* fix marian
* update speech to text
* override test for layoutxlm
* fix saving bartpho
* remove harcoded values bartpho
* special token string version
* finish bartpho
* override layoutxml test
* add mbart
* move special tokens list
* format
* Revert "format"
This reverts commit 37a40df379.
* simplify list of string of special tokens
* Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
* Fix special tokens not correctly tokenized
* Add testing
* Fix
* Fix
* Use user workflows instead of directly assigning variables
* Enable test of fast tokenizers
* Update test of canine tokenizer
* Moving slow tokenizer to the Trie world.
* Adding more docstrings to the Trie.
* Fixing doctest (incompatible wiht our format? )
* Update src/transformers/tokenization_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Adding a lot more comment into the internals of this algorithm.
* Cleaner doc.
* Fixing the namings.
* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* quality.
* Fixing longest first match.
* Small improvements to cuts + more test + canine resistant test.
* Fixing fast test.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* correct order of overflowing_tokens for slow tokenizer (issue fix#13148)
* python 3.9 requires sentencepiece version 0.1.94 or above
* slicing of ids fixed in truncated_sequence()
* Update setup.py
* Correct order of overflowing tokens for pair of sentences
* code reformatted
* Update tokenization_utils_base.py
* reformatting file
* test to check single_input added
* missing function restored
* test to check pair_input overflowing tokens order
* test to check pair_input overflowing tokens order
* test to check pair_input overflowing tokens order
* added an error message for pair of seq and longest_first strategy
* test for pair_input modified
* variable name corrected
* fixed a typo in error message
* requested changes implemented
* required test added
* Corrected the message to match test message
* added error message for Luke Tokenizer
* lost test recovered
* docstring for truncate_sequences and prepare_for_model updated
* docstring for luke tokenizer updated
* updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
* aligned text and fixed puncuatations
* improved style and quality of code
* fixed error_msg in truncate_sequences
* replaced encode_plus method with regular call method
* clean up
* rephrased the docstring
* add test in trainer and test tokenizer saving wi
th trainer
* quality
* reverse trainer changes
* replace test in test_trainer by a test for all the tokenizers
* format
* add can_save_slow_tokenizer attribute to all tokenizers
* fix Herbert
* format
* Change comment in error
* add comments and a new assert
* Update src/transformers/models/albert/tokenization_albert_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* change ValueError barthez
* change ValueError BigBird
* change ValueError Camembert
* change ValueError Mbart50
* change ValueError Pegasus
* change ValueError ReFormer
* change ValueError T5
* change ValueError RoBERTa
* XLNET fast
* Update tests/test_tokenization_common.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* change `assert` into `self.assertIn`
* format
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add test
* add change in PretrainedTokenizerBase
* change Luke
* deactivate
* add the possibility to add additional special tokens for M2M100
* format
* add special test for canine
* proposed changes for mbart
* proposed changes for mbart50
* proposed changes for byt5
* proposed changes for canine
* proposed changes for t5
* test fast and slow
* remove comment
* remove comment
* add fast version for all tests
* replace break by continue
* add more comments
* add check to avoid duplicates
* remove comment
* format
* proposed change for wave2vec2
* reverse changes mbart
* uncomment
* format
* preserve type of `additional_special_tokens` in `special_token_map`
* format
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* [WIP] Easily train a new fast tokenizer from a given one
* Fix test
* Roll out to other tokenizers and add tests
* Fix bug with unk id and add emoji to test
* Really use something different in test
* Implement special tokens map
* Map special tokens in the Transformers tokenizers
* Fix test
* Make test more robust
* Fix test for BPE
* More robust map and test
Co-authored-by SaulLu
* Test file
* Stronger tests
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
* Map unk token for Wordpiece and address review comment
* Fix lowercase test and address review comment
* Fix all tests
* Simplify test
* Fix tests for realsies
* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)
* Propose change in tests regarding lower case
* add new test for special tokens types
* put back the test part about decoding
* add feature: the AddedToken is re-build with the different mapped content
* Address review comment: simplify AddedToken building
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* Clean push to hub API
* Create working dir if it does not exist
* Different tweak
* New API + all models + test Flax
* Adds the Trainer clean up
* Update src/transformers/file_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Address review comments
* (nit) output types
* No need to set clone_from when folder exists
* Update src/transformers/trainer.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Add generated_from_trainer tag
* Update to new version
* Fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* feature for tokenizer without slow/legacy version
* format
* modify common test
* add tests
* add PreTrainedTokenizerFast to AutoTokenizer
* format
* change tokenizer common test in order to be able to run test without a slow version
* update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class`
* add autokenizer test
* replace `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None`
* remove obsolete change in comment
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* change `get_main_tokenizer` into `get_tokenizers`
* clarify `get_tokenizers` method
* homogenize with `test_slow_tokenizer` and `test_rust_tokenizer`
* add `test_rust_tokenizer = False` to tokenizer which don't define a fast version
* `test_rust_tokenizer = False` for BertJapaneseTokenizer
* `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add test_vocab_size for sentencepiece tok.
* add test_get_vocab for sentencepiece tok.
* add test_convert_token_and_id for sentencepiece tok.
* add test_tokenize_and_convert_tokens_to_string for all tok.
* improve test_tokenize_and_convert_tokens_to_string for sp. tok.
* add common tokenizer integration tests
- for albert
- for barthez
* add tokenizer integration tests to bert gen.
* add most tokenizer integration tests
* fix camembert tokenizer integration test
* add tokenizer integration test to marian
* add tokenizer integration test to reformer
* add typing and doc to tokenizer_integration_test_util
* fix tokenizer integration test of reformer
* improve test_sentencepiece_tokenize_and_convert_tokens_to_string
* empty commit to trigger CI
* fix tokenizer integration test of reformer
* remove code not needed anymore
* empty commit to trigger CI
* empty commit to trigger CI
* improve slow class tok usage at xlm rob
* add subword regularization for barthez
* improve barthez tok. test
* fix tokenizer tests
* add subword regularization for camembert
* add subword regularization for deberta v2 tokenizer
* add more doc to deberta v2 tokenizer
* add subword regularization for speech to text tok.
* fix sp_model_kwargs type in speech 2 text tok.
* add subword regularization for M2M100 tok.
* add more concrete type hints
* fix tests for m2m100 and s2t tok.
* add missing Any import
* fix syntax error in m2m100 tok.
* fix unpickle of m2m100 and s2t tok.
* fix test of m2m100 and s2t tok.
* improve unpickle of deberta v2 tok.
* add test for pickle of barthez & camembert
* fix pickle of barthez & camembert
* add test for deberta v2 tok. pickle
* fix m2m100 tok. pickle
* fix s2t tok. pickle
* add subword regularization to albert tok.
* refactor subword reg. test into TokenizerTesterMixin
improve albert tok. test
remove sample argument form albert tok.
check subword reg. using TokenizerTesterMixin
improve tok. tests
improve xlm roberta tok. tests
improve xlm roberta tok. tests
* add subword regularization for big bird t.
* improve xlm roberta tok. test
* add subword regularization for mbart50 tok.
* add subword regularization for pegasus tok.
* add subword regularization for reformer tok.
* add subword regularization for T5 tok.
* fix t5 tok. test formatting
* add subword regularization for xlm_proph. tok.
* add subword regularization for xlnet tok.
* add subword regularization for gert_gen tok.
* add typing to tokenizers
* add typing to xlm rob. tok
* add subword regularization for marian tok.
* add reverse tok. test
* fix marian tok test
* fix marian tok test
* fix casing in tok. tests
* fix style of tok. common test
* fix deberta v2 tok test
* add type annotations to tok. tests
* add type annotations to tok. __init__
* add typing to kokenizer
* add type annotations to tok. __init__
* don't specify the default when it's None
* fix barthez tok. doc
* move sentencepiece tok. tests to TokenizerTesterMixin
* fix unused imports
* fix albert tok. test
* add comment to sentencepiece test options
* fix Any import at big bird tok.
* fix Any import at xlm prophetnet tok.
* empty commit to trigger CI
* Initial support for upload to hub
* push -> upload
* Fixes + examples
* Fix torchhub test
* Torchhub test I hate you
* push_model_to_hub -> push_to_hub
* Apply mixin to other pretrained models
* Remove ABC inheritance
* Add tests
* Typo
* Run tests
* Install git-lfs
* Change approach
* Add push_to_hub to all
* Staging test suite
* Typo
* Maybe like this?
* More deps
* Cache
* Adapt name
* Quality
* MOAR tests
* Put it in testing_utils
* Docs + torchhub last hope
* Styling
* Wrong method
* Typos
* Update src/transformers/file_utils.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>