mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-17 19:48:23 +06:00
35df911485
8 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
35df911485
|
Fix convert_token_type_ids_from_sequences for fast tokenizers (#4503) | ||
![]() |
827d6d6ef0
|
Cleanup fast tokenizers integration (#3706)
* First pass on utility classes and python tokenizers * finishing cleanup pass * style and quality * Fix tests * Updating following @mfuntowicz comment * style and quality * Fix Roberta * fix batch_size/seq_length inBatchEncoding * add alignement methods + tests * Fix OpenAI and Transfo-XL tokenizers * adding trim_offsets=True default for GPT2 et RoBERTa * style and quality * fix tests * add_prefix_space in roberta * bump up tokenizers to rc7 * style * unfortunately tensorfow does like these - removing shape/seq_len for now * Update src/transformers/tokenization_utils.py Co-Authored-By: Stefan Schweter <stefan@schweter.it> * Adding doc and docstrings * making flake8 happy Co-authored-by: Stefan Schweter <stefan@schweter.it> |
||
![]() |
96ab75b8dd
|
Tokenizers v3.0.0 (#3185)
* Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr> |
||
![]() |
4cd9c0971c
|
Fix for fast tokenizers save_pretrained compatibility with Python. (#2933)
* Renamed file generate by tokenizers when calling save_pretrained to match python. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added save_vocabulary tests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove python quick and dirty fix for clean Rust impl. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.5.1 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added some save_pretrained / from_pretrained unittests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Update tokenizers to 0.5.2 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Quality and format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * flake8 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Making sure there is really a bug in unittest * Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> |
||
![]() |
cc6775cdf5
|
Fix max_length not taken into account when using pad_to_max_length on fast tokenizers (#2961)
* enable_padding should pad up to max_length if set. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added more testing on padding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> |
||
![]() |
d490b5d500
|
Fast Tokenizers save pretrained should return the list of generated file paths. (#2918)
* Correctly return the tuple of generated file(s) when calling save_pretrained Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Quality and format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> |
||
![]() |
e676764241
|
Override build_inputs_with_special_tokens for fast tokenizers (#2912)
* Override build_inputs_with_special_tokens for fast impl + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Quality + format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> |
||
![]() |
3f3fa7f7da
|
Integrate fast tokenizers library inside transformers (#2674)
* Implemented fast version of tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bumped tokenizers version requirements to latest 0.2.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added matching tests
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Matching OpenAI GPT tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Matching GPT2 on tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Expose add_prefix_space as constructor parameter for GPT2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Matching Roberta tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Removed fast implementation of CTRL.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Binding TransformerXL tokenizers to Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Updating tests accordingly.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added tokenizers as top-level modules.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Black & isort.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Rename LookupTable to WordLevel to match Rust side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Black.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Use "fast" suffix instead of "ru" for rust tokenizers implementations.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Introduce tokenize() method on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* encode_plus dispatchs to batch_encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* batch_encode_plus now dispatchs to encode if there is only one input element.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizers dependency to 0.3.0
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Formatting.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix tokenization_auto with support for new (python, fast) mapping schema.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Give correct fixtures path in test_tokenization_fast.py for the CLI.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Expose max_len_ properties on BertTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* _convert_encoding should keep the batch axis tensor if only one sample in the batch.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add warning message for RobertaTokenizerFast if used for MLM.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().
This allows to easily enable/disable Rust-based tokenizer instantiation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Let's tokenizers handle all the truncation and padding stuff.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Allow to provide tokenizer arguments during pipeline creation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update test_fill_mask pipeline to not use fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix too much parameters for convert_encoding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* When enabling padding, max_length should be set to None.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Avoid returning nested tensors of length 1 when calling encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Ensure output is padded when return_tensor is not None.
Tensor creation requires the inital list input to be of the exact same size.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Disable transfoxl unittest if pytorch is not available (required to load the model)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* encode_plus should not remove the leading batch axis if return_tensor is set
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Temporary disable fast tokenizers on QA pipelines.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix formatting issues.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update tokenizers to 0.4.0
* Update style
* Enable truncation + stride unit test on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add unittest ensuring special_tokens set match between Python and Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Ensure special_tokens are correctly set during construction.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Give more warning feedback to the user in case of padding without pad_token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* quality & format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added possibility to add a single token as str
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added unittest for add_tokens and add_special_tokens on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix rebase mismatch on pipelines qa default model.
QA requires cased input while the tokenizers would be uncased.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Using offset mapping relative to the original string + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: save_vocabulary requires folder and file name
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Simplify import for Bert.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Remove private member access in tokenize()
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Bump tokenizers dependency to 0.4.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Use named arguments when applicable.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Relax type checking to include tuple and list object.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing review comment: Document the truncate_and_pad manager behavior.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Raise an exception if return_offsets_mapping is not available with the current tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* On pytorch we need to stack tensor to get proper new axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Generalize tests to different framework removing hard written return_tensors="..."
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizer dependency for num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Improved error message for padding strategy without pad token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bumping tokenizers dependency to 0.5.0 for release.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Optimizing convert_encoding around 4x improvement. 🚀
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove perfect alignment constraint for Roberta (allowing 1% difference max)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Triggering final CI
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
|