transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-30 01:32:23 +06:00

Author	SHA1	Message	Date
Iz Beltagy	8f1d047148	Longformer (#4352 ) * first commit * bug fixes * better examples * undo padding * remove wrong VOCAB_FILES_NAMES * License * make style * make isort happy * unit tests * integration test * make `black` happy by undoing `isort` changes!! * lint * no need for the padding value * batch_size not bsz * remove unused type casting * seqlen not seq_len * staticmethod * `bert` selfattention instead of `n2` * uint8 instead of bool + lints * pad inputs_embeds using embeddings not a constant * black * unit test with padding * fix unit tests * remove redundant unit test * upload model weights * resolve todo * simpler _mask_invalid_locations without lru_cache + backward compatible masked_fill_ * increase unittest coverage	2020-05-19 16:04:43 +02:00
Julien Chaumond	5e7fe8b585	Distributed eval: SequentialDistributedSampler + gather all results (#4243 ) * Distributed eval: SequentialDistributedSampler + gather all results * For consistency only write to disk from world_master Close https://github.com/huggingface/transformers/issues/4272 * Working distributed eval * Hook into scripts * Fix #3721 again * TPU.mesh_reduce: stay in tensor space Thanks @jysohn23 * Just a small comment * whitespace * torch.hub: pip install packaging * Add test scenarii	2020-05-18 22:02:39 -04:00
Julien Chaumond	4c06893610	Fix nn.DataParallel compatibility in PyTorch 1.5 (#4300 ) * Test case for #3936 * multigpu tests pass on pytorch 1.4.0 * Fixup * multigpu tests pass on pytorch 1.5.0 * Update src/transformers/modeling_utils.py * Update src/transformers/modeling_utils.py * rename multigpu to require_multigpu * mode doc	2020-05-18 20:34:50 -04:00
Sam Shleifer	a699525d25	[test_pipelines] Mark tests > 10s @slow, small speedups (#4421 )	2020-05-18 12:23:21 -04:00
Patrick von Platen	026a5d0888	[T5 fp16] Fix fp16 in T5 (#4436 ) * fix fp16 in t5 * make style * refactor invert_attention_mask fn * fix typo	2020-05-18 17:25:58 +02:00
Funtowicz Morgan	31c799a0c9	Tag onnx export tests as slow (#4432 )	2020-05-18 09:24:41 -04:00
Lorenzo Ampil	18d233d525	Allow the creation of "entity groups" for NerPipeline #3548 (#3957 ) * Add index to be returned by NerPipeline to allow for the creation of * Add entity groups * Convert entity list to dict * Add entity to entity_group_disagg atfter updating entity gorups * Change 'group' parameter to 'grouped_entities' * Add unit tests for grouped NER pipeline case * Correct variable name typo for NER_FINETUNED_MODELS * Sync grouped tests to recent test updates	2020-05-17 09:25:17 +02:00
Funtowicz Morgan	db0076a9df	Conversion script to export transformers models to ONNX IR. (#4253 ) * Added generic ONNX conversion script for PyTorch model. * WIP initial TF support. * TensorFlow/Keras ONNX export working. * Print framework version info * Add possibility to check the model is correctly loading on ONNX runtime. * Remove quantization option. * Specify ONNX opset version when exporting. * Formatting. * Remove unused imports. * Make functions more generally reusable from other part of the code. * isort happy. * flake happy * Export only feature-extraction for now * Correctly check inputs order / filter before export. * Removed task variable * Fix invalid args call in load_graph_from_args. * Fix invalid args call in convert. * Fix invalid args call in infer_shapes. * Raise exception and catch in caller function instead of exit. * Add 04-onnx-export.ipynb notebook * More WIP on the notebook * Remove unused imports * Simplify & remove unused constants. * Export with constant_folding in PyTorch * Let's try to put function args in the right order this time ... * Disable external_data_format temporary * ONNX notebook draft ready. * Updated notebooks charts + wording * Correct error while exporting last chart in notebook. * Adressing @LysandreJik comment. * Set ONNX opset to 11 as default value. * Set opset param mandatory * Added ONNX export unittests * Quality. * flake8 happy * Add keras2onnx dependency on extras["tf"] * Pin keras2onnx on github master to v1.6.5 * Second attempt. * Third attempt. * Use the right repo URL this time ... * Do the same for onnxconverter-common * Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2 * Correct commit hash. * Addressing PR review: Optimization are enabled by default. * Addressing PR review: small changes in the notebook * setup.py comment about keras2onnx versioning.	2020-05-14 16:35:52 -04:00
Sam Shleifer	7822cd38a0	[tests] make pipelines tests faster with smaller models (#4238 ) covers torch and tf. Also fixes a failing @slow test	2020-05-14 13:36:02 -04:00
Julien Chaumond	448c467256	Fix: unpin flake8 and fix cs errors (#4367 ) * Fix: unpin flake8 and fix cs errors * Ok we still need to quote those	2020-05-14 13:14:26 -04:00
Sam Shleifer	9a687ebb77	[Marian Fixes] prevent predicting pad_token_id before softmax, support language codes, name multilingual models (#4290 )	2020-05-13 17:29:41 -04:00
Julien Chaumond	241759101e	(v2) Improvements to the wandb integration (#4324 ) * Improvements to the wandb integration * small reorg + no global necessary * feat(trainer): log epoch and final metrics * Simplify logging a bit * Fixup * Fix crash when just running eval Co-authored-by: Chris Van Pelt <vanpelt@gmail.com> Co-authored-by: Boris Dayma <boris.dayma@gmail.com>	2020-05-12 21:52:01 -04:00
Julien Chaumond	4bf5042240	Fix BART tests on GPU (#4298 )	2020-05-12 09:11:50 -04:00
Sam Shleifer	3487be75ef	[Marian] documentation and AutoModel support (#4152 ) - MarianSentencepieceTokenizer - > MarianTokenizer - Start using unk token. - add docs page - add better generation params to MarianConfig - more conversion utilities	2020-05-10 13:54:57 -04:00
Patrick von Platen	cf08830c28	[Pipeline, Generation] tf generation pipeline bug (#4217 ) * fix PR * move tests to correct place	2020-05-08 08:30:05 -04:00
Jared T Nielsen	8bf7312654	Add AlbertForPreTraining and TFAlbertForPreTraining models. (#4057 ) * Add AlbertForPreTraining and TFAlbertForPreTraining models. * PyTorch conversion * TensorFlow conversion * style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2020-05-07 19:44:51 -04:00
Julien Chaumond	0ae96ff8a7	BIG Reorganize examples (#4213 ) * Created using Colaboratory * [examples] reorganize files * remove run_tpu_glue.py as superseded by TPU support in Trainer * Bugfix: int, not tuple * move files around	2020-05-07 13:48:44 -04:00
Funtowicz Morgan	0a6cbea0a5	Rewritten batch support in pipelines. (#4154 ) * Rewritten batch support in pipelines. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix imports sorting 🔧 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Set pad_to_max_length=True by default on Pipeline. * Set pad_to_max_length=False for generation pipelines. Most of generation models doesn't have padding token. * Address @joeddav review comment: Uniformized args. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> Address @joeddav review comment: Uniformized *args (second). Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-05-07 09:52:40 -04:00
Patrick von Platen	dca34695d0	Reformer (#3351 ) * first copy & past commit from Bert and morgans LSH code * add easy way to compare to trax original code * translate most of function * make trax lsh self attention deterministic with numpy seed + copy paste code * add same config * add same config * make layer init work * implemented hash_vectors function for lsh attention * continue reformer translation * hf LSHSelfAttentionLayer gives same output as trax layer * refactor code * refactor code * refactor code * refactor * refactor + add reformer config * delete bogus file * split reformer attention layer into two layers * save intermediate step * save intermediate step * make test work * add complete reformer block layer * finish reformer layer * implement causal and self mask * clean reformer test and refactor code * fix merge conflicts * fix merge conflicts * update init * fix device for GPU * fix chunk length init for tests * include morgans optimization * improve memory a bit * improve comment * factorize num_buckets * better testing parameters * make whole model work * make lm model work * add t5 copy paste tokenizer * add chunking feed forward * clean config * add improved assert statements * make tokenizer work * improve test * correct typo * extend config * add complexer test * add new axial position embeddings * add local block attention layer * clean tests * refactor * better testing * save intermediate progress * clean test file * make shorter input length work for model * allow variable input length * refactor * make forward pass for pretrained model work * add generation possibility * finish dropout and init * make style * refactor * add first version of RevNet Layers * make forward pass work and add convert file * make uploaded model forward pass work * make uploaded model forward pass work * refactor code * add namedtuples and cache buckets * correct head masks * refactor * made reformer more flexible * make style * remove set max length * add attention masks * fix up tests * fix lsh attention mask * make random seed optional for the moment * improve memory in reformer * add tests * make style * make sure masks work correctly * detach gradients * save intermediate * correct backprob through gather * make style * change back num hashes * rename to labels * fix rotation shape * fix detach * update * fix trainer * fix backward dropout * make reformer more flexible * fix conflict * fix * fix * add tests for fixed seed in reformer layer * fix trainer typo * fix typo in activations * add fp16 tests * add fp16 training * support fp16 * correct gradient bug in reformer * add fast gelu * re-add dropout for embedding dropout * better naming * better naming * renaming * finalize test branch * finalize tests * add more tests * finish tests * fix * fix type trainer * fix fp16 tests * fix tests * fix tests * fix tests * fix issue with dropout * fix dropout seeds * correct random seed on gpu * finalize random seed for dropout * finalize random seed for dropout * remove duplicate line * correct half precision bug * make style * refactor * refactor * docstring * remove sinusoidal position encodings for reformer * move chunking to modeling_utils * make style * clean config * make style * fix tests * fix auto tests * pretrained models * fix docstring * update conversion file * Update pretrained_models.rst * fix rst * fix rst * update copyright * fix test path * fix test path * fix small issue in test * include reformer in generation tests * add docs for axial position encoding * finish docs * Update convert_reformer_trax_checkpoint_to_pytorch.py * remove isort * include sams comments * remove wrong comment in utils * correct typos * fix typo * Update reformer.rst * applied morgans optimization * make style * make gpu compatible * remove bogus file * big test refactor * add example for chunking * fix typo * add to README	2020-05-07 10:17:01 +02:00
Julien Plu	aad50151f3	TF version of the trainer (#4017 ) * First commit to add a TF version of the trainer. * Make the TF trainer closer to what looks the PT trainer * Refactoring common code between the PT and TF trainer into an util file. * Some bugfix + better similarity with the PT trainer * Add missing class in transformers init * Bugfix over prediction + use classification report instead of simple metrics * Fix name error * Fix optimization tests + style * Apply style * Several bugfix for multi-gpu training * Apply style * Apply style * Add glue example for the TF trainer * Several bugix + address the reviews * Fix on the TF training args file * Add a debug mode * Bugfix in utils_ner.py when segment_ids is None * Apply style * Apply style * Add TPU strategy * Fix selection strategy	2020-05-06 12:56:52 -04:00
Lysandre Debut	79b1c6966b	Pytorch 1.5.0 (#3973 ) * Standard deviation can no longer be set to 0 * Remove torch pinned version * 9th instead of 10th, silly me	2020-05-05 10:23:01 -04:00
Patrick von Platen	8e67573a64	[EncoderDecoder Tests] Improve tests (#4046 ) * Hoist bert model tester for patric * indent * make tests work * Update tests/test_modeling_bert.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: sshleifer <sshleifer@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-04 02:18:36 +02:00
Sam Shleifer	18db92dd9a	[testing] add timeout_decorator (#3543 )	2020-05-01 09:05:47 -04:00
Julien Chaumond	f39217a5ec	[tests] Light cleanup of tempfile in tests/	2020-04-30 22:30:15 -04:00
Julien Chaumond	f54dc3f4d5	[ci] Load pretrained models into the default (long-lived) cache There's an inconsistency right now where: - we load some models into CACHE_DIR - and some models in the default cache - and often, in both for the same models When running the RUN_SLOW tests, this takes a lot of disk space, time, and bandwidth. I'd rather always use the default cache	2020-04-30 22:30:15 -04:00
Julien Chaumond	ab90353f1a	[cli] {login, upload, s3} display more helpful error messages	2020-04-30 12:51:06 -04:00
Julien Chaumond	452dd0e4d9	[ci] Align test_hf_api.py with API change	2020-04-30 12:06:01 -04:00
Sam Shleifer	2c77842887	[Fix common tests on GPU] send model, ids to torch_device (#4014 )	2020-04-29 09:47:20 -04:00
Sam Shleifer	847e7f3379	MarianMTModel.from_pretrained('Helsinki-NLP/opus-marian-en-de') (#3908 ) Co-Authored-By: Stefan Schweter <stefan@schweter.it>	2020-04-28 18:22:37 -04:00
Patrick von Platen	fa49b9afea	Clean Encoder-Decoder models with Bart/T5-like API and add generate possibility (#3383 ) * change encoder decoder style to bart & t5 style * make encoder decoder generation dummy work for bert * make style * clean init config in encoder decoder * add tests for encoder decoder models * refactor and add last tests * refactor and add last tests * fix attn masks for bert encoder decoder * make style * refactor prepare inputs for Bert * refactor * finish encoder decoder * correct typo * add docstring to config * finish * add tests * better naming * make style * fix flake8 * clean docstring * make style * rename	2020-04-28 15:11:09 +02:00
Lorenzo Ampil	f16540fcba	Pipeline for Text Generation: GenerationPipeline (#3758 ) * Add GenerationPipeline * Fix parameter names * Correct parameter __call__ parameters * Add model type attribute and correct function calls for prepare_input * Take out trailing commas from init attributes * Remove unnecessary tokenization line * Implement support for multiple text inputs * Apply generation support for multiple input text prompts * Take out tensor coersion * Take out batch index * Add text prompt to return sequence * Squeeze token tensore before decoding * Return only a single list of sequences if only one prompt was used * Correct results variable name * Add GenerationPipeline to SUPPORTED_TASKS with the alias , initalized w GPT2 * Registedred AutoModelWithLMHead for both pt and t * Update docstring for GenerationPipeline * Add kwargs parameter to mode.generate * Take out kwargs parameter after all * Add generation pipeline example in pipeline docstring * Fix max length by squeezing tokens tensor * Apply ensure_tensor_on_device to pytorch tensor * Include generation step in torch.no_grad * Take out input from prepare_xlm_input and set 'en' as default xlm_language * Apply framework specific encoding during prepare_input * Format w make style * Move GenerationPipeline import to follow proper import sorting * Take out training comma from generation dict * Apply requested changes * Change name to TextGenerationPipeline * Apply TextGenerationPipeline rename to __init___ * Changing alias to * Set input mapping as input to ensure_tensor_on_device * Fix assertion placement * Add test_text_generation * Add TextGenerationPipeline to PipelineCommonTests * Take out whitespace * Format __init__ w black * Fix __init__ style * Forman __init___ * Add line to end of __init__ * Correct model tokenizer set for test_text_generation * Ensure to return list of list, not list of string (to pass test) * Limit test models to only 3 to limit runtime to address circleCI timeout error * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update tests/test_pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Remove argument docstring, __init__, add additional __call__ arguments, and reformat results to list of dict * Fix blank result list * Add TextGenerationPipeline to pipelines.rst * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Fix typos from adding PADDING_TEXT_TOKEN_LENGTH * Fix incorrectly moved result list * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com> * Add back generation line and make style * Take out blank whitespace * Apply new alis, text-generation, to test_pipelines * Fix text generation alias in test * Update src/transformers/pipelines.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-04-22 09:37:03 -04:00
Julien Chaumond	dd9d483d03	Trainer (#3800 ) * doc * [tests] Add sample files for a regression task * [HUGE] Trainer * Feedback from @sshleifer * Feedback from @thomwolf + logging tweak * [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes * [glue] Use default max_seq_length of 128 like before * [glue] move DataTrainingArguments around * [ner] Change interface of InputExample, and align run_{tf,pl} * Re-align the pl scripts a little bit * ner * [ner] Add integration test * Fix language_modeling with API tweak * [ci] Tweak loss target * Don't break console output * amp.initialize: model must be on right device before * [multiple-choice] update for Trainer * Re-align to `827d6d6ef0`	2020-04-21 20:11:56 -04:00
Thomas Wolf	827d6d6ef0	Cleanup fast tokenizers integration (#3706 ) * First pass on utility classes and python tokenizers * finishing cleanup pass * style and quality * Fix tests * Updating following @mfuntowicz comment * style and quality * Fix Roberta * fix batch_size/seq_length inBatchEncoding * add alignement methods + tests * Fix OpenAI and Transfo-XL tokenizers * adding trim_offsets=True default for GPT2 et RoBERTa * style and quality * fix tests * add_prefix_space in roberta * bump up tokenizers to rc7 * style * unfortunately tensorfow does like these - removing shape/seq_len for now * Update src/transformers/tokenization_utils.py Co-Authored-By: Stefan Schweter <stefan@schweter.it> * Adding doc and docstrings * making flake8 happy Co-authored-by: Stefan Schweter <stefan@schweter.it>	2020-04-18 13:43:57 +02:00
Lysandre Debut	8b63a01d95	XLM tokenizer should encode with bos token (#3791 ) * XLM tokenizer should encode with bos token * Update tests	2020-04-17 11:28:55 -04:00
Patrick von Platen	1d4a35b396	Higher tolerance for past testing in TF T5 (#3844 )	2020-04-17 11:26:16 -04:00
Patrick von Platen	d13eca11e2	Higher tolerance for past testing in T5 (#3843 )	2020-04-17 11:25:14 -04:00
Pierric Cistac	6d00033e97	Question Answering support for Albert and Roberta in TF (#3812 ) * Add TFAlbertForQuestionAnswering * Add TFRobertaForQuestionAnswering * Update TFAutoModel with Roberta/Albert for QA * Clean `super` TF Albert calls	2020-04-17 10:45:30 -04:00
Patrick von Platen	baca8fa8e6	clean pipelines (#3795 )	2020-04-16 10:21:34 -04:00
Patrick von Platen	38f7461df3	[TFT5, Cache] Add cache to TFT5 (#3772 ) * correct gpt2 test inputs * make style * delete modeling_gpt2 change in test file * translate from pytorch * correct tests * fix conflicts * fix conflicts * fix conflicts * fix conflicts * make tensorflow t5 caching work * make style * clean reorder cache * remove unnecessary spaces * fix test	2020-04-16 16:14:52 +02:00
Patrick von Platen	01c37dcdb5	[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734 ) * remove output_past from pt * make style * add optional input length for gpt2 * add use cache to prepare input * save memory in gpt2 * correct gpt2 test inputs * make past input optional for gpt2 * finish use_cache for all models * make style * delete modeling_gpt2 change in test file * correct docstring * correct is true statements for gpt2	2020-04-14 14:40:28 -04:00
Teven	352d5472b0	Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716 ) * Shifting labels inside TransfoXLLMHead * Changed doc to reflect change * Updated pytorch test * removed IDE whitespace changes * black reformat Co-authored-by: TevenLeScao <teven.lescao@gmail.com>	2020-04-13 18:11:23 +02:00
Julien Chaumond	b169ac9c2b	[examples] Generate argparsers from type hints on dataclasses (#3669 ) * [examples] Generate argparsers from type hints on dataclasses * [HfArgumentParser] way simpler API * Restore run_language_modeling.py for easier diff * [HfArgumentParser] final tweaks from code review	2020-04-10 12:21:58 -04:00
Sam Shleifer	7a7fdf71f8	Multilingual BART - (#3602 ) - support mbart-en-ro weights - add MBartTokenizer	2020-04-10 11:25:39 -04:00
Patrick von Platen	ce2298fb5f	[T5, generation] Add decoder caching for T5 (#3682 ) * initial commit to add decoder caching for T5 * better naming for caching * finish T5 decoder caching * correct test * added extensive past testing for T5 * clean files * make tests cleaner * improve docstring * improve docstring * better reorder cache * make style * Update src/transformers/modeling_t5.py Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com> * make set output past work for all layers * improve docstring * improve docstring Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>	2020-04-10 01:02:50 +02:00
LysandreJik	31baeed614	Update quotes cc @julien-c	2020-04-09 09:09:00 -04:00
Lysandre Debut	6435b9f908	Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684 ) * Updating modeling tf files; adding tests * Merge `encode_plus` and `batch_encode_plus`	2020-04-08 16:22:44 -04:00
Sam Shleifer	715aa5b135	[Bart] Replace config.output_past with use_cache kwarg (#3632 )	2020-04-07 19:08:26 -04:00
Sam Shleifer	0a4b1068e1	Speedup torch summarization tests (#3663 )	2020-04-07 14:01:30 -04:00
Funtowicz Morgan	96ab75b8dd	Tokenizers v3.0.0 (#3185 ) * Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-04-07 00:29:15 +02:00
Patrick von Platen	2ee410560e	[Generate, Test] Split generate test function into beam search, no beam search (#3601 ) * split beam search and no beam search test * fix test * clean generate tests	2020-04-06 10:37:05 +02:00
Lysandre Debut	d5d7d88612	ELECTRA (#3257 ) * Electra wip * helpers * Electra wip * Electra v1 * ELECTRA may be saved/loaded * Generator & Discriminator * Embedding size instead of halving the hidden size * ELECTRA Tokenizer * Revert BERT helpers * ELECTRA Conversion script * Archive maps * PyTorch tests * Start fixing tests * Tests pass * Same configuration for both models * Compatible with base + large * Simplification + weight tying * Archives * Auto + Renaming to standard names * ELECTRA is uncased * Tests * Slight API changes * Update tests * wip * ElectraForTokenClassification * temp * Simpler arch + tests Removed ElectraForPreTraining which will be in a script * Conversion script * Auto model * Update links to S3 * Split ElectraForPreTraining and ElectraForTokenClassification * Actually test PreTraining model * Remove num_labels from configuration * wip * wip * From discriminator and generator to electra * Slight API changes * Better naming * TensorFlow ELECTRA tests * Accurate conversion script * Added to conversion script * Fast ELECTRA tokenizer * Style * Add ELECTRA to README * Modeling Pytorch Doc + Real style * TF Docs * Docs * Correct links * Correct model intialized * random fixes * style * Addressing Patrick's and Sam's comments * Correct links in docs	2020-04-03 14:10:54 -04:00
Yohei Tamura	8594dd80dd	BertJapaneseTokenizer accept options for mecab (#3566 ) * BertJapaneseTokenizer accept options for mecab * black * fix mecab_option to Option[str]	2020-04-03 11:12:19 -04:00
Patrick von Platen	a4ee4da18a	[T5, TF 2.2] change tf t5 argument naming (#3547 ) * change tf t5 argument naming for TF 2.2 * correct bug in testing	2020-04-01 22:04:20 +02:00
Patrick von Platen	b815edf69f	[T5, Testst] Add extensive hard-coded integration tests and make sure PT and TF give equal results (#3550 ) * add some t5 integration tests * finish summarization and translation integration tests for T5 - results loook good * add tf test * fix == vs is bug * fix tf beam search error and make tf t5 tests pass	2020-04-01 18:01:33 +02:00
Patrick von Platen	b38d552a92	[Generate] Add bad words list argument to the generate function (#3367 ) * add bad words list * make style * add bad_words_tokens * make style * better naming * make style * fix typo	2020-03-31 18:42:31 +02:00
Sam Shleifer	8deff3acf2	[bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488 )	2020-03-30 12:28:27 -04:00
Patrick von Platen	75ec6c9e3a	[T5] make decoder input ids optional for t5 training (#3521 ) * make decoder input ids optional for t5 training * lm_lables should not be shifted in t5 * add tests * finish shift right functionality for PT T5 * move shift right to correct class * cleaner code * replace -100 values with pad token id * add assert statement * remove unnecessary for loop * make style	2020-03-30 13:45:26 +02:00
Sam Shleifer	f6a23d1911	[BART] add bart-large-xsum weights (#3422 )	2020-03-29 10:51:13 -04:00
Sam Shleifer	3ee431dd4c	[Bart/Memory] Two separate, smaller decoder attention masks (#3371 )	2020-03-26 21:34:15 -04:00
Sam Shleifer	39371ee454	[Bart/Memory] don't create lm_head (#3323 ) * delete lm_head, skips weight tying * Fixed s3	2020-03-26 18:40:39 -04:00
sakares saengkaew	1a6c546c6f	Add missing token classification for XLM (#3277 ) * Add the missing token classification for XLM * fix styling * Add XLMForTokenClassification to AutoModelForTokenClassification class * Fix docstring typo for non-existing class * Add the missing token classification for XLM * fix styling * fix styling * Add XLMForTokenClassification to AutoModelForTokenClassification class * Fix docstring typo for non-existing class * Add missing description for AlbertForTokenClassification * fix styling * Add missing docstring for AlBert * Slow tests should be slow Co-authored-by: Sakares Saengkaew <s.sakares@gmail.com> Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-03-26 10:22:13 -04:00
Patrick von Platen	022e8fab97	Adds translation pipeline (#3419 ) * fix merge conflicts * add t5 summarization example * change parameters for t5 summarization * make style * add first code snippet for translation * only add prefixes * add prefix patterns * make style * renaming * fix conflicts * remove unused patterns * solve conflicts * fix merge conflicts * remove translation example * remove summarization example * make sure tensors are in numpy for float comparsion * re-add t5 config * fix t5 import config typo * make style * remove unused numpy statements * update doctstring * import translation pipeline	2020-03-26 13:50:58 +01:00
Patrick von Platen	9c683ef01e	Add t5 to pipeline(task='summarization') (#3413 ) * solve conflicts * move warnings below * incorporate changes * add pad_to_max_length to pipelines * add bug fix for T5 beam search * add prefix patterns * make style * fix conflicts * adapt pipelines for task specific parameters * improve docstring * remove unused patterns	2020-03-26 11:03:13 +01:00
Patrick von Platen	e392ba6938	Add camembert integration tests (#3375 ) * add integration tests for camembert * use jplu/tf-camembert fro the moment * make style	2020-03-24 10:18:37 +01:00
Patrick von Platen	95e00d0808	Clean special token init in modeling_....py (#3264 ) * make style * fix conflicts	2020-03-20 21:41:04 +01:00
Patrick von Platen	bbf26c4e61	Support T5 Generation (#3228 ) * fix conflicts * update bart max length test * correct spelling mistakes * implemented model specific encode function * fix merge conflicts * better naming * save intermediate state -> need to rethink strucuture a bit * leave tf problem as it is for now * current version * add layers.pop * remove ipdb * make style * clean return cut decoding * remove ipdbs * Fix restoring layers in the decoders that doesnt exists. * push good intermediate solution for now * fix conflicts * always good to refuse to merge conflicts when rebasing * fix small bug * improve function calls * remove unused file * add correct scope behavior for t5_generate Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>	2020-03-19 23:18:23 +01:00
Sam Shleifer	ad7233fc01	[BART] cleanup: remove redundant kwargs, improve docstrings (#3319 )	2020-03-19 11:16:51 -04:00
Lysandre Debut	d6afbd323d	XLM-R Tokenizer now passes common tests + Integration tests (#3198 ) * XLM-R now passes common tests + Integration tests * Correct mask index * Model input names * Style * Remove text preprocessing * Unneccessary import	2020-03-18 09:52:49 -04:00
Patrick von Platen	292186a3e7	Adding LM Head to Transfo-XL and first step to fixing problem with Adaptive Embeddings in TransfoXL (#3286 ) * first commit * work in progress * make language generation task pass * update to working version for LM * delete print * remove dead code * make style	2020-03-18 09:24:27 -04:00
Sam Shleifer	38a555a83c	Add Summarization to Pipelines (#3128 ) * passing * Undo stupid chg * docs * undo rename * delete-cruft * only import if you have torch * Dont rely on dict ordering * Fix dict ordering upstream * docstring link * docstring link * remove trailing comma for 3.5 compat * new name * delegate kwarging * Update kwargs	2020-03-17 18:04:21 -04:00
Patrick von Platen	e8f44af5bf	[generate] do_sample default back to False (#3298 ) * change do_samples back * None better default as boolean * adapt do_sample to True in test example * make style	2020-03-17 10:52:37 -04:00
Sam Shleifer	b2c1a447fe	[BART] Delete redundant unit test (#3302 )	2020-03-16 23:09:10 -04:00
Sam Shleifer	5ea8ba67b4	[BART] Remove unused kwargs (#3279 ) * Remove unused kwargs * dont call forward in tests	2020-03-15 23:00:44 -04:00
Thomas Wolf	3814e167d9	Merge pull request #3225 from patrickvonplaten/finalize_merge_bart_generate_into_default_generate Complete merge Seq-2-Seq generation into default generation	2020-03-14 15:08:59 +01:00
Sam Shleifer	2bd79e23de	[BART] FP16 testing fixes (#3266 )	2020-03-13 19:48:26 -04:00
Patrick von Platen	6a82f774f2	fix typo	2020-03-12 21:10:51 +01:00
Patrick von Platen	f1c71da115	fix eos_token_ids in test	2020-03-12 21:00:54 +01:00
Patrick von Platen	6047f46b19	re-add eos token to get good bart results	2020-03-12 20:17:50 +01:00
Patrick von Platen	ac303eae46	fix problem with half	2020-03-11 12:24:30 +01:00
Patrick von Platen	bc9d5d917c	make all tensors half precision	2020-03-11 12:15:38 +01:00
Patrick von Platen	a332cc9f7f	finalize generation merge	2020-03-11 11:53:36 +01:00
Patrick von Platen	7351a8dbaf	re-add scoring filtering	2020-03-11 11:06:56 +01:00
Patrick von Platen	374deef48d	fixed typo	2020-03-11 11:06:56 +01:00
patrickvonplaten	41b437ea3a	add draft version of propsoed changes for ROGUE score	2020-03-11 11:06:56 +01:00
patrickvonplaten	a5751f7578	fix bug with attention_mask as optional input argument	2020-03-11 11:06:56 +01:00
patrickvonplaten	d880a5fbde	finalized PR	2020-03-11 11:06:56 +01:00
patrickvonplaten	2acfe63964	best current version and make style	2020-03-11 11:06:56 +01:00
patrickvonplaten	c62444da39	fix conflicts	2020-03-11 11:06:56 +01:00
Patrick von Platen	77e6775065	add current changes	2020-03-11 11:06:56 +01:00
Patrick von Platen	421216997b	comment out stuff	2020-03-11 11:06:56 +01:00
Patrick von Platen	7a11e925cf	work in progress	2020-03-11 11:06:56 +01:00
Patrick von Platen	aceb3fbaf4	only do output_past=True for language generation in bart	2020-03-11 11:06:56 +01:00
Patrick von Platen	7cba11fb9b	better naming	2020-03-11 11:06:56 +01:00
Patrick von Platen	ff648221bd	fix conflicts	2020-03-11 11:06:56 +01:00
Patrick von Platen	c0d9dd3ba9	refactored code a bit and made more generic	2020-03-11 11:06:56 +01:00
Patrick von Platen	d8e2b3c547	fix conflicts	2020-03-11 11:06:56 +01:00
Patrick von Platen	31f2437f07	Merge pull request #3191 from patrickvonplaten/add_integration_tests_lm_generate_torch_tf Add integration tests lm generate torch tf	2020-03-10 11:29:17 +01:00
Julien Chaumond	cbf8f5d32b	[model upload] Support for organizations	2020-03-09 17:33:57 -04:00
Lysandre	525b6b1c54	TFQA pipeline marked as slow test	2020-03-09 16:52:30 -04:00
Lysandre Debut	5164ea91a7	Skipping outputs (#3116 ) * Minimal example * Proposal 2 * Proposal 2 for fast tokenizers * Typings * Docs * Revert "Docs" for easier review This reverts commit eaf0f97062e809887704a542144c537f769d5223. * Remove unnecessary assignments * Tests * Fix faulty type * Remove prints * return_outputs -> model_input_names * Revert "Revert "Docs" for easier review" This reverts commit 6fdc69408102bf695797f2dfddbb6350c6b9e722. * code quality	2020-03-09 13:48:58 -04:00

1 2 3 4 5 ...

351 Commits