* Result of black 23.1
* Update target to Python 3.7
* Switch flake8 to ruff
* Configure isort
* Configure isort
* Apply isort with line limit
* Put the right black version
* adapt black in check copies
* Fix copies
* Add test for SentencePiece not adding special tokens to strings
* Add SentencePieceStringConversionMixin to fix issue 15003
* Fix conversion from tokens to string for most SentencePiece tokenizers
Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer
* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab
* Fix DebertaV2Tokenizer
* Ignore LayoutXLMTokenizer in SentencePiece string conversion test
* Run 'make style' and 'make quality'
* Clean convert_tokens_to_string test
Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.
* Remove commented out code
* Improve robustness of convert_tokens_to_string test
Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.
* Inline and remove SentencePieceStringConversionMixin
The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.
* Run 'make style' and 'make quality'
* Revert removal of space in convert_tokens_to_string
* Remove redundant import
* Revert test text to original
* Uncomment the lowercasing of the reverse_text variable
* Mimic Rust tokenizer behavior for tokenizers
- Albert
- Barthez
- Camembert
- MBart50
- T5
* Fix accidentally skipping test in wrong tokenizer
* Add test for equivalent Rust and slow tokenizer behavior
* Override _decode in BigBirdTokenizer to mimic Rust behavior
* Override _decode in FNetTokenizer to mimic Rust behavior
* Override _decode in XLNetTokenizer to mimic Rust behavior
* Remove unused 're' import
* Update DebertaV2Tokenizer to mimic Rust tokenizer
* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.
* Ignore problematic tests in Deberta V2
* Add comment on why the Deberta V2 tests are skipped
* add warning to let the user know that the method is slower that for a fast tokenizer
* user warnings
* fix layoutlmv2
* fix layout*
* change warnings into logger.warning
* Draft new cached_file
* Initial draft for config and model
* Small fixes
* Fix first batch of tests
* Look in cache when internet is down
* Fix last tests
* Bad black, not fixing all quality errors
* Make diff less
* Implement change for TF and Flax models
* Add tokenizer and feature extractor
* For compatibility with main
* Add utils to move the cache and auto-do it at first use.
* Quality
* Deal with empty commit shas
* Deal with empty etag
* Address review comments
* Prepare CI for v0.8.0
* pin hfh (revert before merge)
* Revert "pin hfh (revert before merge)"
This reverts commit a0103140e1.
* Test rc3
* Test latest rc
* Unpin to the RC
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
* [Json dump] Make json prettier
* correct more tokenizeirs
* more patterns
* add aggressive test
* the aggressive test was actually useful :-)
* more tests
* Apply suggestions from code review
* Fix setters of *_token_id properties of SpecialTokensMixin
* Test setters of common tokens ids
* Move to a separate test checks of setters of tokens ids
* Add independent test for ByT5
* Add Canine test
* Test speech to text
* Make Transformers use cache files when hf.co is down
* Fix tests
* Was there a random circleCI failure?
* Isolate patches
* Style
* Comment out the failure since it doesn't fail anymore
* Better comment
* add new test
* update test
* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
* add `tokenizer_file` for the fast only tokenizer
* change global variables layoutxml
* remove `"tokenizer_file"` from DPR tokenizer's Global variables
* remove `tokenizer_file` from herbert slow tokenizer init
* `"tokenizer_file"` from LED tokenizer's Global variables
* remove `tokenizer_file` from mbart slow tokenizer init
* remove `tokenizer_file` from slow tokenizer template
* adapt to versioning
* adapt the `test_tokenizer_mismatch_warning` test
* clean test
* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
* Revert "remove `tokenizer_file` from mbart slow tokenizer init"
This reverts commit 0dbb723fa9.
* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
This reverts commit 5a3f879bdd.
* Revert "remove `tokenizer_file` from herbert slow tokenizer init"
This reverts commit f5e10007b7.
* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
This reverts commit da0895330b.
* set `tokenizer_file` in super `__init__` of mbart
* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`
* add test
* fix kwargs
* reformat test
* format
* format
* fix typo to render the documentation
* add new test
* add a feature to same the sentencepiece tokenizer model when the init file was deleted
* update marian
* update m2m_100
* fix marian
* update speech to text
* override test for layoutxlm
* fix saving bartpho
* remove harcoded values bartpho
* special token string version
* finish bartpho
* override layoutxml test
* add mbart
* move special tokens list
* format
* Revert "format"
This reverts commit 37a40df379.
* simplify list of string of special tokens
* Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
* Fix special tokens not correctly tokenized
* Add testing
* Fix
* Fix
* Use user workflows instead of directly assigning variables
* Enable test of fast tokenizers
* Update test of canine tokenizer
* Moving slow tokenizer to the Trie world.
* Adding more docstrings to the Trie.
* Fixing doctest (incompatible wiht our format? )
* Update src/transformers/tokenization_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Adding a lot more comment into the internals of this algorithm.
* Cleaner doc.
* Fixing the namings.
* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* quality.
* Fixing longest first match.
* Small improvements to cuts + more test + canine resistant test.
* Fixing fast test.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* correct order of overflowing_tokens for slow tokenizer (issue fix#13148)
* python 3.9 requires sentencepiece version 0.1.94 or above
* slicing of ids fixed in truncated_sequence()
* Update setup.py
* Correct order of overflowing tokens for pair of sentences
* code reformatted
* Update tokenization_utils_base.py
* reformatting file
* test to check single_input added
* missing function restored
* test to check pair_input overflowing tokens order
* test to check pair_input overflowing tokens order
* test to check pair_input overflowing tokens order
* added an error message for pair of seq and longest_first strategy
* test for pair_input modified
* variable name corrected
* fixed a typo in error message
* requested changes implemented
* required test added
* Corrected the message to match test message
* added error message for Luke Tokenizer
* lost test recovered
* docstring for truncate_sequences and prepare_for_model updated
* docstring for luke tokenizer updated
* updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
* aligned text and fixed puncuatations
* improved style and quality of code
* fixed error_msg in truncate_sequences
* replaced encode_plus method with regular call method
* clean up
* rephrased the docstring
* add test in trainer and test tokenizer saving wi
th trainer
* quality
* reverse trainer changes
* replace test in test_trainer by a test for all the tokenizers
* format
* add can_save_slow_tokenizer attribute to all tokenizers
* fix Herbert
* format
* Change comment in error
* add comments and a new assert
* Update src/transformers/models/albert/tokenization_albert_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* change ValueError barthez
* change ValueError BigBird
* change ValueError Camembert
* change ValueError Mbart50
* change ValueError Pegasus
* change ValueError ReFormer
* change ValueError T5
* change ValueError RoBERTa
* XLNET fast
* Update tests/test_tokenization_common.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* change `assert` into `self.assertIn`
* format
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add test
* add change in PretrainedTokenizerBase
* change Luke
* deactivate
* add the possibility to add additional special tokens for M2M100
* format
* add special test for canine
* proposed changes for mbart
* proposed changes for mbart50
* proposed changes for byt5
* proposed changes for canine
* proposed changes for t5
* test fast and slow
* remove comment
* remove comment
* add fast version for all tests
* replace break by continue
* add more comments
* add check to avoid duplicates
* remove comment
* format
* proposed change for wave2vec2
* reverse changes mbart
* uncomment
* format
* preserve type of `additional_special_tokens` in `special_token_map`
* format
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* [WIP] Easily train a new fast tokenizer from a given one
* Fix test
* Roll out to other tokenizers and add tests
* Fix bug with unk id and add emoji to test
* Really use something different in test
* Implement special tokens map
* Map special tokens in the Transformers tokenizers
* Fix test
* Make test more robust
* Fix test for BPE
* More robust map and test
Co-authored-by SaulLu
* Test file
* Stronger tests
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
* Map unk token for Wordpiece and address review comment
* Fix lowercase test and address review comment
* Fix all tests
* Simplify test
* Fix tests for realsies
* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)
* Propose change in tests regarding lower case
* add new test for special tokens types
* put back the test part about decoding
* add feature: the AddedToken is re-build with the different mapped content
* Address review comment: simplify AddedToken building
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* Clean push to hub API
* Create working dir if it does not exist
* Different tweak
* New API + all models + test Flax
* Adds the Trainer clean up
* Update src/transformers/file_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Address review comments
* (nit) output types
* No need to set clone_from when folder exists
* Update src/transformers/trainer.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Add generated_from_trainer tag
* Update to new version
* Fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* feature for tokenizer without slow/legacy version
* format
* modify common test
* add tests
* add PreTrainedTokenizerFast to AutoTokenizer
* format
* change tokenizer common test in order to be able to run test without a slow version
* update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class`
* add autokenizer test
* replace `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None`
* remove obsolete change in comment
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* change `get_main_tokenizer` into `get_tokenizers`
* clarify `get_tokenizers` method
* homogenize with `test_slow_tokenizer` and `test_rust_tokenizer`
* add `test_rust_tokenizer = False` to tokenizer which don't define a fast version
* `test_rust_tokenizer = False` for BertJapaneseTokenizer
* `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add test_vocab_size for sentencepiece tok.
* add test_get_vocab for sentencepiece tok.
* add test_convert_token_and_id for sentencepiece tok.
* add test_tokenize_and_convert_tokens_to_string for all tok.
* improve test_tokenize_and_convert_tokens_to_string for sp. tok.
* add common tokenizer integration tests
- for albert
- for barthez
* add tokenizer integration tests to bert gen.
* add most tokenizer integration tests
* fix camembert tokenizer integration test
* add tokenizer integration test to marian
* add tokenizer integration test to reformer
* add typing and doc to tokenizer_integration_test_util
* fix tokenizer integration test of reformer
* improve test_sentencepiece_tokenize_and_convert_tokens_to_string
* empty commit to trigger CI
* fix tokenizer integration test of reformer
* remove code not needed anymore
* empty commit to trigger CI
* empty commit to trigger CI
* improve slow class tok usage at xlm rob
* add subword regularization for barthez
* improve barthez tok. test
* fix tokenizer tests
* add subword regularization for camembert
* add subword regularization for deberta v2 tokenizer
* add more doc to deberta v2 tokenizer
* add subword regularization for speech to text tok.
* fix sp_model_kwargs type in speech 2 text tok.
* add subword regularization for M2M100 tok.
* add more concrete type hints
* fix tests for m2m100 and s2t tok.
* add missing Any import
* fix syntax error in m2m100 tok.
* fix unpickle of m2m100 and s2t tok.
* fix test of m2m100 and s2t tok.
* improve unpickle of deberta v2 tok.
* add test for pickle of barthez & camembert
* fix pickle of barthez & camembert
* add test for deberta v2 tok. pickle
* fix m2m100 tok. pickle
* fix s2t tok. pickle
* add subword regularization to albert tok.
* refactor subword reg. test into TokenizerTesterMixin
improve albert tok. test
remove sample argument form albert tok.
check subword reg. using TokenizerTesterMixin
improve tok. tests
improve xlm roberta tok. tests
improve xlm roberta tok. tests
* add subword regularization for big bird t.
* improve xlm roberta tok. test
* add subword regularization for mbart50 tok.
* add subword regularization for pegasus tok.
* add subword regularization for reformer tok.
* add subword regularization for T5 tok.
* fix t5 tok. test formatting
* add subword regularization for xlm_proph. tok.
* add subword regularization for xlnet tok.
* add subword regularization for gert_gen tok.
* add typing to tokenizers
* add typing to xlm rob. tok
* add subword regularization for marian tok.
* add reverse tok. test
* fix marian tok test
* fix marian tok test
* fix casing in tok. tests
* fix style of tok. common test
* fix deberta v2 tok test
* add type annotations to tok. tests
* add type annotations to tok. __init__
* add typing to kokenizer
* add type annotations to tok. __init__
* don't specify the default when it's None
* fix barthez tok. doc
* move sentencepiece tok. tests to TokenizerTesterMixin
* fix unused imports
* fix albert tok. test
* add comment to sentencepiece test options
* fix Any import at big bird tok.
* fix Any import at xlm prophetnet tok.
* empty commit to trigger CI
* Initial support for upload to hub
* push -> upload
* Fixes + examples
* Fix torchhub test
* Torchhub test I hate you
* push_model_to_hub -> push_to_hub
* Apply mixin to other pretrained models
* Remove ABC inheritance
* Add tests
* Typo
* Run tests
* Install git-lfs
* Change approach
* Add push_to_hub to all
* Staging test suite
* Typo
* Maybe like this?
* More deps
* Cache
* Adapt name
* Quality
* MOAR tests
* Put it in testing_utils
* Docs + torchhub last hope
* Styling
* Wrong method
* Typos
* Update src/transformers/file_utils.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* change tokenizer requirement
* split line
* Correct typo from list to str
* improve style
* make other function pretty as well
* add comment
* correct typo
* add new test
* pass tests for tok without padding token
* Apply suggestions from code review
* Add target contextmanager and rework prepare_seq2seq_batch
* Fix tests, treat BART and Barthez
* Add last tokenizers
* Fix test
* Set src token before calling the superclass
* Remove special behavior for T5
* Remove needless imports
* Remove needless asserts
* First commit: adding all files from tapas_v3
* Fix multiple bugs including soft dependency and new structure of the library
* Improve testing by adding torch_device to inputs and adding dependency on scatter
* Use Python 3 inheritance rather than Python 2
* First draft model cards of base sized models
* Remove model cards as they are already on the hub
* Fix multiple bugs with integration tests
* All model integration tests pass
* Remove print statement
* Add test for convert_logits_to_predictions method of TapasTokenizer
* Incorporate suggestions by Google authors
* Fix remaining tests
* Change position embeddings sizes to 512 instead of 1024
* Comment out positional embedding sizes
* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
* Added more model names
* Fix truncation when no max length is specified
* Disable torchscript test
* Make style & make quality
* Quality
* Address CI needs
* Test the Masked LM model
* Fix the masked LM model
* Truncate when overflowing
* More much needed docs improvements
* Fix some URLs
* Some more docs improvements
* Test PyTorch scatter
* Set to slow + minify
* Calm flake8 down
* First commit: adding all files from tapas_v3
* Fix multiple bugs including soft dependency and new structure of the library
* Improve testing by adding torch_device to inputs and adding dependency on scatter
* Use Python 3 inheritance rather than Python 2
* First draft model cards of base sized models
* Remove model cards as they are already on the hub
* Fix multiple bugs with integration tests
* All model integration tests pass
* Remove print statement
* Add test for convert_logits_to_predictions method of TapasTokenizer
* Incorporate suggestions by Google authors
* Fix remaining tests
* Change position embeddings sizes to 512 instead of 1024
* Comment out positional embedding sizes
* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
* Added more model names
* Fix truncation when no max length is specified
* Disable torchscript test
* Make style & make quality
* Quality
* Address CI needs
* Test the Masked LM model
* Fix the masked LM model
* Truncate when overflowing
* More much needed docs improvements
* Fix some URLs
* Some more docs improvements
* Add add_pooling_layer argument to TapasModel
Fix comments by @sgugger and @patrickvonplaten
* Fix issue in docs + fix style and quality
* Clean up conversion script and add task parameter to TapasConfig
* Revert the task parameter of TapasConfig
Some minor fixes
* Improve conversion script and add test for absolute position embeddings
* Improve conversion script and add test for absolute position embeddings
* Fix bug with reset_position_index_per_cell arg of the conversion cli
* Add notebooks to the examples directory and fix style and quality
* Apply suggestions from code review
* Move from `nielsr/` to `google/` namespace
* Apply Sylvain's comments
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Rogge Niels <niels.rogge@howest.be>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
* Warning about too long input for fast tokenizers too
If truncation is not set in tokenizers, but the tokenization is too long
for the model (`model_max_length`), we used to trigger a warning that
The input would probably fail (which it most likely will).
This PR re-enables the warning for fast tokenizers too and uses common
code for the trigger to make sure it's consistent across.
* Checking for pair of inputs too.
* Making the function private and adding it's doc.
* Remove formatting ?? in odd place.
* Missed uppercase.
* Tokenizers should be framework agnostic
* Run the slow tests
* Not testing
* Fix documentation
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Fixing roberta for slow-fast tests
* WIP getting equivalence on pipelines
* slow-to-fast equivalence - working on question-answering pipeline
* optional FAISS tests
* Pipeline Q&A
* Move pipeline tests to their own test job again
* update tokenizer to add sequence id methods
* update to tokenizers 0.9.4
* set sentencepiecce as optional
* clean up squad
* clean up pipelines to use sequence_ids
* style/quality
* wording
* Switch to use_fast = True by default
* update tests for use_fast at True by default
* fix rag tokenizer test
* removing protobuf from required dependencies
* fix NER test for use_fast = True by default
* fixing example tests (Q&A examples use slow tokenizers for now)
* protobuf in main deps extras["sentencepiece"] and example deps
* fix protobug install test
* try to fix seq2seq by switching to slow tokenizers for now
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* WIP refactoring pipeline tests - switching to fast tokenizers
* fix dialog pipeline and fill-mask
* refactoring pipeline tests backbone
* make large tests slow
* fix tests (tf Bart inactive for now)
* fix doc...
* clean up for merge
* fixing tests - remove bart from summarization until there is TF
* fix quality and RAG
* Add new translation pipeline tests - fix JAX tests
* only slow for dialog
* Fixing the missing TF-BART imports in modeling_tf_auto
* spin out pipeline tests in separate CI job
* adding pipeline test to CI YAML
* add slow pipeline tests
* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* Update src/transformers/pipelines.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/pipelines.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add require_torch and require_tf in is_pt_tf_cross_test
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* splitting fast and slow tokenizers [WIP]
* [WIP] splitting sentencepiece and tokenizers dependencies
* update dummy objects
* add name_or_path to models and tokenizers
* prefix added to file names
* prefix
* styling + quality
* spliting all the tokenizer files - sorting sentencepiece based ones
* update tokenizer version up to 0.9.0
* remove hard dependency on sentencepiece 🎉
* and removed hard dependency on tokenizers 🎉
* update conversion script
* update missing models
* fixing tests
* move test_tokenization_fast to main tokenization tests - fix bugs
* bump up tokenizers
* fix bert_generation
* update ad fix several tokenizers
* keep sentencepiece in deps for now
* fix funnel and deberta tests
* fix fsmt
* fix marian tests
* fix layoutlm
* fix squeezebert and gpt2
* fix T5 tokenization
* fix xlnet tests
* style
* fix mbart
* bump up tokenizers to 0.9.2
* fix model tests
* fix tf models
* fix seq2seq examples
* fix tests without sentencepiece
* fix slow => fast conversion without sentencepiece
* update auto and bert generation tests
* fix mbart tests
* fix auto and common test without tokenizers
* fix tests without tokenizers
* clean up tests lighten up when tokenizers + sentencepiece are both off
* style quality and tests fixing
* add sentencepiece to doc/examples reqs
* leave sentencepiece on for now
* style quality split hebert and fix pegasus
* WIP Herbert fast
* add sample_text_no_unicode and fix hebert tokenization
* skip FSMT example test for now
* fix style
* fix fsmt in example tests
* update following Lysandre and Sylvain's comments
* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* [WIP] SP tokenizers
* fixing tests for T5
* WIP tokenizers
* serialization
* update T5
* WIP T5 tokenization
* slow to fast conversion script
* Refactoring to move tokenzier implementations inside transformers
* Adding gpt - refactoring - quality
* WIP adding several tokenizers to the fast world
* WIP Roberta - moving implementations
* update to dev4 switch file loading to in-memory loading
* Updating and fixing
* advancing on the tokenizers - updating do_lower_case
* style and quality
* moving forward with tokenizers conversion and tests
* MBart, T5
* dumping the fast version of transformer XL
* Adding to autotokenizers + style/quality
* update init and space_between_special_tokens
* style and quality
* bump up tokenizers version
* add protobuf
* fix pickle Bert JP with Mecab
* fix newly added tokenizers
* style and quality
* fix bert japanese
* fix funnel
* limite tokenizer warning to one occurence
* clean up file
* fix new tokenizers
* fast tokenizers deep tests
* WIP adding all the special fast tests on the new fast tokenizers
* quick fix
* adding more fast tokenizers in the fast tests
* all tokenizers in fast version tested
* Adding BertGenerationFast
* bump up setup.py for CI
* remove BertGenerationFast (too early)
* bump up tokenizers version
* Clean old docstrings
* Typo
* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
* Exposing prepare_for_model for both slow & fast tokenizers
* Update method signature
* The traditional style commit
* Hide the warnings behind the verbose flag
* update default truncation strategy and prepare_for_model
* fix tests and prepare_for_models methods
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* Add new parameter `pad_to_multiple_of` on tokenizers.
* unittest for pad_to_multiple_of
* Add .name when logging enum.
* Fix missing .items() on dict in tests.
* Add special check + warning if the tokenizer doesn't have proper pad_token.
* Use the correct logger format specifier.
* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.
* Skip test if tokenizer doesn't have pad_token
* Fix RobertaTokenizer on empty input
* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* fix and updating to simpler API
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* fix-5181
Padding to max sequence length while truncation to another length was wrong on slow tokenizers
* clean up and fix#5155
* fix XLM test
* Fix tests for Transfo-XL
* logging only above WARNING in tests
* switch slow tokenizers tests in @slow
* fix Marian truncation tokenization test
* style and quality
* make the test a lot faster by limiting the sequence length used in tests
* Add return lengths
* make pad a bit more flexible so it can be used as collate_fn
* check all kwargs sent to encoding method are known
* fixing kwargs in encodings
* New AddedToken class in python
This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.
* style and quality
* switched to hugginface tokenizers library for AddedTokens
* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state
* style and quality
* do not raise an error on additional or unused kwargs for tokenize() but only a warning
* transfo-xl pretrained model requires torch
* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* fix#5081 and improve backward compatibility (slightly)
* add nlp to setup.cfg - style and quality
* align default to previous default
* remove test that doesn't generalize
* Refactor tensor creation in tokenizers.
* Make sure to convert string to TensorType
* Refactor convert_to_tensors_
* Introduce numpy tensor creation
* Format
* Add unittest for TensorType creation from str
* sorting imports
* Added unittests for numpy tensor conversion.
* Do not use in-place version for squeeze as numpy doesn't provide such feature.
* Added extra parameter prepend_batch_axis: bool on prepare_for_model.
* Ensure test_np_encode_plus_sent_to_model is not executed if encoder/decoder model.
* style.
* numpy tests require_torch for now while flax not merged.
* Hopefully will make flake8 happy.
* One more time 🎶
* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Make fast tokenizers unittests work on Windows.
* Entirely refactored unittest for tokenizers fast.
* Remove ABC class for CommonFastTokenizerTest
* Added embeded_special_tokens tests from allenai @dirkgr
* Make embeded_special_tokens tests from allenai more generic
* Uniformize vocab_size as a property for both Fast and normal tokenizers
* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
* Ensure providing None input raise the same ValueError than Python tokenizer + tests.
* Fix invalid input for assert_padding when testing batch_encode_plus
* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
* Ensure tokenize() correctly forward add_special_tokens to rust.
* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.
* unittests ensure tokenize() also throws a ValueError if provided None
* Added add_special_tokens unittest for all supported models.
* Style
* Make sure TransfoXL test run only if PyTorch is provided.
* Split up tokenizers tests for each model type.
* Fix invalid unittest with new tokenizers API.
* Filter out Roberta openai detector models from unittests.
* Introduce BatchEncoding on fast tokenizers path.
This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.
* Introduce BatchEncoding on slow tokenizers path.
Backward compatibility.
* Improve error message on BatchEncoding for slow path
* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
* Style and format.
* Added typing on all methods for PretrainedTokenizerFast
* Style and format
* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
* Style and format
* encode_plus now supports pretokenized inputs.
* Remove user warning about add_special_tokens when working on pretokenized inputs.
* Always go through the post processor.
* Added support for pretokenized input pairs on encode_plus
* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
* Added pretokenized inputs support on batch_encode_plus
* Update BatchEncoding methods name to match Encoding.
* Bump setup.py tokenizers dependency to 0.7.0rc1
* Remove unused parameters in BertTokenizerFast
* Make sure Roberta returns token_type_ids for unittests.
* Added missing typings
* Update add_tokens prototype to match tokenizers side and allow AddedToken
* Bumping tokenizers to 0.7.0rc2
* Added documentation for BatchEncoding
* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
* Fix text-classification pipeline using the wrong tokenizer
* Make pipelines works with BatchEncoding
* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove update_special_tokens from fast tokenizers.
* Ensure TransfoXL unittests are run only when torch is available.
* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Style
* Style 🙏🙏
* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
* Remove Roberta warning on __init__.
* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>