* Renamed file generate by tokenizers when calling save_pretrained to match python.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added save_vocabulary tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Remove python quick and dirty fix for clean Rust impl.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Bump tokenizers dependency to 0.5.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added some save_pretrained / from_pretrained unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Update tokenizers to 0.5.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* flake8
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Making sure there is really a bug in unittest
* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* add preprocessing to add space before punctuation for transfo_xl
* improve warning messages
* make style
* compile regex at instantination of tokenizer object
* Add disable_outgoing to pretrained items
Setting disable_outgoing=True disables outgonig traffic:
- etags are not looked up
- models are not downloaded
* parameter name change
* Remove forgotten print
* Testing that encode_plus and batch_encode_plus behave the same way
Spoiler alert: they don't
* Testing rest of arguments in batch_encode_plus
* Test tensor return in batch_encode_plus
* Addressing Sam's comments
* flake8
* Simplified with `num_added_tokens`
* Added support for Albert in NER pipeline
* Added command-line options to examples/ner/run_ner.py to better control tokenization
* Added class AlbertForTokenClassification
* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens
- I added an example using the model with pipelines to show that we have set```{"use_fast": False}``` in the tokenizer.
- I added a Colab to play with the model and pipelines
- I added a Colab to discover Huggingface pipelines at the end of the document
* enable_padding should pad up to max_length if set.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added more testing on padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* improving generation
* finalized special token behaviour for no_beam_search generation
* solved modeling_utils merge conflict
* solve merge conflicts in modeling_utils.py
* add run_generation improvements from PR #2749
* adapted language generation to not use hardcoded -1 if no padding token is available
* remove the -1 removal as hard coded -1`s are not necessary anymore
* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown
* add slow language generation tests for pretrained models using hardcoded output with pytorch seed
* delete ipdb
* check that all generated tokens are valid
* renaming
* renaming Generation -> Generate
* make style
* updated so that generate_beam_search has same token behavior than generate_no_beam_search
* consistent return format for run_generation.py
* deleted pretrain lm generate tests -> will be added in another PR
* cleaning of unused if statements and renaming
* run_generate will always return an iterable
* make style
* consistent renaming
* improve naming, make sure generate function always returns the same tensor, add docstring
* add slow tests for all lmhead models
* make style and improve example comments modeling_utils
* better naming and refactoring in modeling_utils
* improving generation
* finalized special token behaviour for no_beam_search generation
* solved modeling_utils merge conflict
* solve merge conflicts in modeling_utils.py
* add run_generation improvements from PR #2749
* adapted language generation to not use hardcoded -1 if no padding token is available
* remove the -1 removal as hard coded -1`s are not necessary anymore
* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown
* add slow language generation tests for pretrained models using hardcoded output with pytorch seed
* delete ipdb
* check that all generated tokens are valid
* renaming
* renaming Generation -> Generate
* make style
* updated so that generate_beam_search has same token behavior than generate_no_beam_search
* consistent return format for run_generation.py
* deleted pretrain lm generate tests -> will be added in another PR
* cleaning of unused if statements and renaming
* run_generate will always return an iterable
* make style
* consistent renaming
* improve naming, make sure generate function always returns the same tensor, add docstring
* add slow tests for all lmhead models
* make style and improve example comments modeling_utils
* better naming and refactoring in modeling_utils
* changed fast random lm generation testing design to more general one
* delete in old testing design in gpt2
* correct old variable name
* temporary fix for encoder_decoder lm generation tests - has to be updated when t5 is fixed
* adapted all fast random generate tests to new design
* better warning description in modeling_utils
* better comment
* better comment and error message
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* Remove warning when pad_to_max_length is not set.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move RoberTa warning to RoberTa and not GPT2 base tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Correctly return the tuple of generated file(s) when calling save_pretrained
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>