transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-08-03 03:31:05 +06:00

Author	SHA1	Message	Date
Shashank Gupta	3dcb748e31	Added data collator for permutation (XLNet) language modeling and related calls (#5522 ) * Added data collator for XLNet language modeling and related calls Added DataCollatorForXLNetLanguageModeling in data/data_collator.py to generate necessary inputs for language modeling training with XLNetLMHeadModel. Also added related arguments, logic and calls in examples/language-modeling/run_language_modeling.py. Resolves: #4739, #2008 (partially) * Changed name to `DataCollatorForPermutationLanguageModeling` Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`. Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use. CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of similar to `mems` for XLNet). Changed calls and imports appropriately. * Added detailed comments, changed variable names Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative. * Added tests for new data collator Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences. * Fixed styling issues	2020-07-07 10:17:37 +02:00
Lysandre	1d2332861f	Post v3.0.2 release commit	2020-07-06 18:56:47 -04:00
Lysandre	b0892fa0e8	Release: v3.0.2	2020-07-06 18:49:44 -04:00
Sylvain Gugger	f1e2e423ab	Fix fast tokenizers too (#5562 )	2020-07-06 18:45:01 -04:00
Anthony MOI	5787e4c159	Various tokenizers fixes (#5558 ) * BertTokenizerFast - Do not specify strip_accents by default * Bump tokenizers to new version * Add test for AddedToken serialization	2020-07-06 18:27:53 -04:00
Sylvain Gugger	21f28c34b7	Fix #5507 (#5559 ) * Fix #5507 * Fix formatting	2020-07-06 17:26:48 -04:00
Lysandre Debut	9d9b872b66	The `add_space_before_punct_symbol` is only for TransfoXL (#5549 )	2020-07-06 12:17:05 -04:00
Lysandre Debut	d6b0b9d451	GPT2 tokenizer should not output token type IDs (#5546 ) * GPT2 tokenizer should not output token type IDs * Same for OpenAIGPT	2020-07-06 11:33:57 -04:00
Sylvain Gugger	7833b21a5a	Fix #5544 (#5551 )	2020-07-06 11:22:24 -04:00
Thomas Wolf	c473484087	Fix the tokenization warning noted in #5505 (#5550 ) * fix warning * style and quality	2020-07-06 11:15:25 -04:00
Lysandre	1bbc28bee7	Imports organization	2020-07-06 10:27:10 -04:00
Mohamed Taher Alrefaie	1bc13697b1	Update convert_pytorch_checkpoint_to_tf2.py (#5531 ) fixed ImportError: cannot import name 'hf_bucket_url'	2020-07-06 09:55:10 -04:00
Arnav Sharma	b2309cc6bf	Typo fix in `training` doc (#5495 )	2020-07-06 09:15:22 -04:00
ELanning	7ecff0ccbb	Fix typo in training (#5510 )	2020-07-06 09:14:57 -04:00
Sam Shleifer	58cca47c16	[cleanup] TF T5 tests only init t5-base once. (#5410 )	2020-07-03 14:27:49 -04:00
Patrick von Platen	991172922f	better error message (#5497 )	2020-07-03 19:25:25 +02:00
Thomas Wolf	b58a15a31e	unpining specific git versions in setup.py	2020-07-03 17:38:39 +02:00
Thomas Wolf	fedabcd154	Release: 3.0.1	2020-07-03 17:02:44 +02:00
Lysandre Debut	17ade127b9	Exposing prepare_for_model for both slow & fast tokenizers (#5479 ) * Exposing prepare_for_model for both slow & fast tokenizers * Update method signature * The traditional style commit * Hide the warnings behind the verbose flag * update default truncation strategy and prepare_for_model * fix tests and prepare_for_models methods Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-07-03 16:51:21 +02:00
Manuel Romero	814ed7ee76	Create model card (#5396 ) Create model card for electicidad-small (Spanish Electra) fine-tuned on SQUAD-esv1	2020-07-03 08:29:09 -04:00
Moseli Motsoehli	49281ac939	grammar corrections and train data update (#5448 ) - fixed grammar and spelling - added an intro - updated Training data references	2020-07-03 08:25:57 -04:00
chrisliu	97355339f6	Update upstream (#5456 )	2020-07-03 08:16:27 -04:00
Manuel Romero	55b932a818	Create model card (#5464 ) Create model card for electra-small-discriminator fine-tuned on SQUAD v2.0	2020-07-03 06:19:49 -04:00
Funtowicz Morgan	21cd8c4086	QA Pipelines fixes (#5429 ) * Make QA pipeline supports models with more than 2 outputs such as BART assuming start/end are the two first outputs. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length. This result in every sample going through QA pipelines to be of size 384 whatever the actual input size is making the overall pipeline very slow. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Mask padding & question before applying softmax. Softmax has been refactored to operate in log space for speed and stability. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Use PaddingStrategy.LONGEST instead of DO_NOT_PAD Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Revert "When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length." This reverts commit `1b00a9a2` Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Trigger CI after unattended failure * Trigger CI	2020-07-03 10:29:20 +02:00
Pierric Cistac	8438bab38e	Fix roberta model ordering for TFAutoModel (#5414 )	2020-07-02 19:23:55 -04:00
Sylvain Gugger	6b735a7253	Tokenizer summary (#5467 ) * Work on tokenizer summary * Finish tutorial * Link to it * Apply suggestions from code review Co-authored-by: Anthony MOI <xn1t0x@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Add vocab definition Co-authored-by: Anthony MOI <xn1t0x@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-07-02 17:07:42 -04:00
Shen	ef0e9d806c	Update: ElectraDiscriminatorPredictions forward. (#5471 ) `ElectraDiscriminatorPredictions.forward` should not need `attention_mask`.	2020-07-02 13:57:33 -04:00
Manuel Romero	13a8588f2d	Create model card (#5432 ) Create model card for electra-base-discriminator fine-tuned on SQUAD v1.1	2020-07-02 10:16:30 -04:00
Julien Chaumond	a0a6387a0d	[model_cards] roberta-large-mnli: fix sep_token	2020-07-02 10:04:02 -04:00
Julien Chaumond	215db688da	Create roberta-large-mnli-README.md	2020-07-02 09:43:54 -04:00
Lysandre Debut	69d313e808	Bans SentencePiece 0.1.92 (#5418 )	2020-07-02 09:23:00 -04:00
George Ho	84e56669af	Fix typo in glossary (#5466 )	2020-07-02 09:19:33 -04:00
Teven	c6a510c6fa	Fixing missing arguments for TransfoXL tokenizer when using TextGenerationPipeline (#5465 ) * overriding _parse_and_tokenize in `TextGenerationPipeine` to allow for TransfoXl tokenizer arguments	2020-07-02 13:53:33 +02:00
Teven	6726416e4a	Changed expected_output_ids in TransfoXL generation test (#5462 ) * Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR. * making black happy * making isort happy	2020-07-02 11:56:44 +02:00
tommccoy	812def00c9	fix use of mems in Transformer-XL (#4826 ) Fixed duplicated memory use in Transformer-XL generation leading to bad predictions and performance.	2020-07-02 11:19:07 +02:00
Patrick von Platen	306f1a2695	Add Reformer MLM notebook (#5450 ) * Add Reformer MLM notebook * Update notebooks/README.md	2020-07-02 00:20:49 +02:00
Patrick von Platen	d16e36c7e5	[Reformer] Add Masked LM Reformer (#5426 ) * fix conflicts * fix * happy rebasing	2020-07-01 22:43:18 +02:00
Funtowicz Morgan	f4323dbf8c	Don't discard entity_group when token is the latest in the sequence. (#5439 ) Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>	2020-07-01 20:30:42 +02:00
Joe Davison	35befd9ce3	Fix tensor label type inference in default collator (#5250 ) * allow tensor label inputs to default collator * replace try/except with type check	2020-07-01 10:40:14 -06:00
Patrick von Platen	fe81f7d12c	finish reformer qa head (#5433 )	2020-07-01 12:27:14 -04:00
Patrick von Platen	d697b6ca75	[Longformer] Major Refactor (#5219 ) * refactor naming * add small slow test * refactor * refactor naming * rename selected to extra * big global attention refactor * make style * refactor naming * save intermed * refactor functions * finish function refactor * fix tests * fix longformer * fix longformer * fix longformer * fix all tests but one * finish longformer * address sams and izs comments * fix transpose	2020-07-01 17:43:32 +02:00
Sam Shleifer	e0d58ddb65	[fix] Marian tests import (#5442 )	2020-07-01 11:42:22 -04:00
Funtowicz Morgan	608d5a7c44	Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389 ) * Added PipelineException Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fill-mask pipeline raises exception when more than one mask_token detected. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Put everything in a function. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Added tests on pipeline fill-mask when input has != 1 mask_token Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Fix numel() computation for TF Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Addressing PR comments. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Remove function typing to avoid import on specific framework. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Quality. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Retry typing with @julien-c tip. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Quality². Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Simplify fill-mask mask_token checking. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Trigger CI	2020-07-01 17:27:47 +02:00
Sylvain Gugger	6c55e9fc32	Fix dropdown bug in searches (#5440 ) * Trigger CI * Fix dropdown bug in searches	2020-07-01 11:02:59 -04:00
Sylvain Gugger	734a28a767	Clean up diffs in Trainer/TFTrainer (#5417 ) * Cleanup and unify Trainer/TFTrainer * Forgot to adapt TFTrainingArgs * In tf scripts n_gpu -> n_replicas * Update src/transformers/training_args.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments * Formatting * Fix typo Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-07-01 11:00:20 -04:00
Sam Shleifer	43cb03a93d	MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182 )	2020-07-01 10:32:50 -04:00
Sam Shleifer	13deb95a40	Move tests/utils.py -> transformers/testing_utils.py (#5350 )	2020-07-01 10:31:17 -04:00
sgugger	9c219305f5	Trigger CI	2020-07-01 10:22:50 -04:00
Sylvain Gugger	64e3d966b1	Add support for past states (#5399 ) * Add support for past states * Style and forgotten self * You mean, documenting is not enough? I have to actually add it too? * Add memory support during evaluation * Fix tests in eval and add TF support * No need to change this line anymore	2020-07-01 08:11:55 -04:00
Sylvain Gugger	4ade7491f4	Fix examples titles and optimization doc page (#5408 )	2020-07-01 08:11:25 -04:00

1 2 3 4 5 ...

4437 Commits