transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-31 02:02:21 +06:00

Author	SHA1	Message	Date
Quentin Lhoest	4fedc1256c	Fix tests imports dpr (#5576 ) * fix test imports * fix max_length * style * fix tests	2020-07-07 16:35:12 +02:00
Sam Shleifer	d4886173b2	[Bart] enable test_torchscript, update test_tie_weights (#5457 ) * Passing all but one torchscript test * Style * move comment * remove unneeded assert	2020-07-07 10:06:48 -04:00
Quentin Lhoest	fbd8792195	Add DPR model (#5279 ) * beginning of dpr modeling * wip * implement forward * remove biencoder + better init weights * export dpr model to embed model for nlp lib * add new api * remove old code * make style * fix dumb typo * don't load bert weights * docs * docs * style * move the `k` parameter * fix init_weights * add pretrained configs * minor * update config names * style * better config * style * clean code based on PR comments * change Dpr to DPR * fix config * switch encoder config to a dict * style * inheritance -> composition * add messages in assert startements * add dpr reader tokenizer * one tokenizer per model * fix base_model_prefix * fix imports * typo * add convert script * docs * change tokenizers conf names * style * change tokenizers conf names * minor * minor * fix wrong names * minor * remove unused convert functions * rename convert script * use return_tensors in tokenizers * remove n_questions dim * move generate logic to tokenizer * style * add docs * docs * quality * docs * add tests * style * add tokenization tests * DPR full tests * Stay true to the attention mask building * update docs * missing param in bert input docs * docs * style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2020-07-07 08:56:12 -04:00
Abel	6912265711	Make T5 compatible with ONNX (#5518 ) * Default decoder inputs to encoder ones for T5 if neither are specified. * Fixing typo, now all tests are passing. * Changing einsum to operations supported by onnx * Adding a test to ensure T5 can be exported to onnx op>9 * Modified test for onnx export to make it faster * Styling changes. * Styling changes. * Changing notation for matrix multiplication Co-authored-by: Abel Riboulot <tkai@protomail.com>	2020-07-07 11:32:29 +02:00
Patrick von Platen	989ae326b5	[Reformer] Adapt Reformer MaskedLM Attn mask (#5560 ) * fix attention mask * fix slow test * refactor attn masks * fix fp16 generate test	2020-07-07 10:48:06 +02:00
Shashank Gupta	3dcb748e31	Added data collator for permutation (XLNet) language modeling and related calls (#5522 ) * Added data collator for XLNet language modeling and related calls Added DataCollatorForXLNetLanguageModeling in data/data_collator.py to generate necessary inputs for language modeling training with XLNetLMHeadModel. Also added related arguments, logic and calls in examples/language-modeling/run_language_modeling.py. Resolves: #4739, #2008 (partially) * Changed name to `DataCollatorForPermutationLanguageModeling` Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`. Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use. CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of similar to `mems` for XLNet). Changed calls and imports appropriately. * Added detailed comments, changed variable names Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative. * Added tests for new data collator Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences. * Fixed styling issues	2020-07-07 10:17:37 +02:00
Anthony MOI	5787e4c159	Various tokenizers fixes (#5558 ) * BertTokenizerFast - Do not specify strip_accents by default * Bump tokenizers to new version * Add test for AddedToken serialization	2020-07-06 18:27:53 -04:00
Sam Shleifer	58cca47c16	[cleanup] TF T5 tests only init t5-base once. (#5410 )	2020-07-03 14:27:49 -04:00
Lysandre Debut	17ade127b9	Exposing prepare_for_model for both slow & fast tokenizers (#5479 ) * Exposing prepare_for_model for both slow & fast tokenizers * Update method signature * The traditional style commit * Hide the warnings behind the verbose flag * update default truncation strategy and prepare_for_model * fix tests and prepare_for_models methods Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-07-03 16:51:21 +02:00
Teven	6726416e4a	Changed expected_output_ids in TransfoXL generation test (#5462 ) * Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR. * making black happy * making isort happy	2020-07-02 11:56:44 +02:00
Patrick von Platen	d16e36c7e5	[Reformer] Add Masked LM Reformer (#5426 ) * fix conflicts * fix * happy rebasing	2020-07-01 22:43:18 +02:00
Joe Davison	35befd9ce3	Fix tensor label type inference in default collator (#5250 ) * allow tensor label inputs to default collator * replace try/except with type check	2020-07-01 10:40:14 -06:00
Patrick von Platen	fe81f7d12c	finish reformer qa head (#5433 )	2020-07-01 12:27:14 -04:00
Patrick von Platen	d697b6ca75	[Longformer] Major Refactor (#5219 ) * refactor naming * add small slow test * refactor * refactor naming * rename selected to extra * big global attention refactor * make style * refactor naming * save intermed * refactor functions * finish function refactor * fix tests * fix longformer * fix longformer * fix longformer * fix all tests but one * finish longformer * address sams and izs comments * fix transpose	2020-07-01 17:43:32 +02:00
Sam Shleifer	e0d58ddb65	[fix] Marian tests import (#5442 )	2020-07-01 11:42:22 -04:00
Funtowicz Morgan	608d5a7c44	Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389 ) * Added PipelineException Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fill-mask pipeline raises exception when more than one mask_token detected. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Put everything in a function. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Added tests on pipeline fill-mask when input has != 1 mask_token Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Fix numel() computation for TF Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Addressing PR comments. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Remove function typing to avoid import on specific framework. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Quality. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Retry typing with @julien-c tip. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Quality². Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Simplify fill-mask mask_token checking. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Trigger CI	2020-07-01 17:27:47 +02:00
Sam Shleifer	43cb03a93d	MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182 )	2020-07-01 10:32:50 -04:00
Sam Shleifer	13deb95a40	Move tests/utils.py -> transformers/testing_utils.py (#5350 )	2020-07-01 10:31:17 -04:00
Sam Shleifer	32d2031458	[fix] slow fill_mask test failure (#5406 )	2020-06-30 15:28:15 -04:00
Patrick von Platen	4bcc35cd69	[Docs] Benchmark docs (#5360 ) * first doc version * add benchmark docs * fix typos * improve README * Update docs/source/benchmarks.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * fix naming and docs Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-06-29 16:08:57 +02:00
Sam Shleifer	28a690a80e	[mBART] skip broken forward pass test, stronger integration test (#5327 )	2020-06-28 15:08:28 -04:00
Sam Shleifer	393b8dc09a	examples/seq2seq/run_eval.py fixes and docs (#5322 )	2020-06-26 19:20:43 -04:00
Thomas Wolf	601d4d699c	[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308 ) * remove references to old API in docstring - update data processors * style * fix tests - better type checking error messages * better type checking * include awesome fix by @LysandreJik for #5310 * updated doc and examples	2020-06-26 19:48:14 +02:00
Sam Shleifer	798dbff6a7	[pipelines] Change summarization default to distilbart-cnn-12-6 (#5289 )	2020-06-26 11:43:23 -04:00
Funtowicz Morgan	135791e8ef	Add pad_to_multiple_of on tokenizers (reimport) (#5054 ) * Add new parameter `pad_to_multiple_of` on tokenizers. * unittest for pad_to_multiple_of * Add .name when logging enum. * Fix missing .items() on dict in tests. * Add special check + warning if the tokenizer doesn't have proper pad_token. * Use the correct logger format specifier. * Ensure tokenizer with no pad_token do not modify the underlying padding strategy. * Skip test if tokenizer doesn't have pad_token * Fix RobertaTokenizer on empty input * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fix and updating to simpler API Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-06-26 11:55:57 +02:00
Lysandre Debut	364a5ae1f0	Refactor Code samples; Test code samples (#5036 ) * Refactor code samples * Test docstrings * Style * Tokenization examples * Run rust of tests * First step to testing source docs * Style and BART comment * Test the remainder of the code samples * Style * let to const * Formatting fixes * Ready for merge * Fix fixture + Style * Fix last tests * Update docs/source/quicktour.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Addressing @sgugger's comments + Fix MobileBERT in TF Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2020-06-25 16:46:00 -04:00
Thomas Wolf	315f464b0a	[tokenizers] Several small improvements and bug fixes (#5287 ) * avoid recursion in id checks for fast tokenizers * better typings and fix #5232 * align slow and fast tokenizers behaviors for Roberta and GPT2 * style and quality * fix tests - improve typings	2020-06-25 22:17:14 +02:00
Thomas Wolf	27cf1d97f0	[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252 ) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix #5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests	2020-06-25 17:24:28 +02:00
Thomas Wolf	7ac9110711	Add more tests on tokenizers serialization - fix bugs (#5056 ) * update tests for fast tokenizers + fix small bug in saving/loading * better tests on serialization * fixing serialization * comment cleanup	2020-06-24 21:53:08 +02:00
Lysandre Debut	cf10d4cfdd	Cleaning TensorFlow models (#5229 ) * Cleaning TensorFlow models Update all classes stylr * Don't average loss	2020-06-24 11:37:20 -04:00
Patrick von Platen	c2a26ec8a6	[Use cache] Align logic of `use_cache` with output_attentions and output_hidden_states (#5194 ) * fix use cache * add bart use cache * fix bart * finish bart	2020-06-24 16:09:17 +02:00
Patrick von Platen	9fe09cec76	[Benchmark] Extend Benchmark to all model type extensions (#5241 ) * add benchmark for all kinds of models * improved import * delete bogus files * make style	2020-06-24 15:11:42 +02:00
Sam Shleifer	58918c76f4	[bart] add config.extra_pos_embeddings to facilitate reuse (#5190 )	2020-06-23 11:35:42 -04:00
Thomas Wolf	11fdde0271	Tokenizers API developments (#5103 ) * Add return lengths * make pad a bit more flexible so it can be used as collate_fn * check all kwargs sent to encoding method are known * fixing kwargs in encodings * New AddedToken class in python This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens. * style and quality * switched to hugginface tokenizers library for AddedTokens * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state * style and quality * do not raise an error on additional or unused kwargs for tokenize() but only a warning * transfo-xl pretrained model requires torch * Update src/transformers/tokenization_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-06-23 13:36:57 +02:00
Sam Shleifer	5144104070	[fix] remove unused import (#5206 )	2020-06-22 23:39:04 -04:00
Sam Shleifer	0d158e38c9	[fix] mobilebert had wrong path, causing slow test failure (#5205 )	2020-06-22 23:31:36 -04:00
Thomas Wolf	ebc36108dc	[tokenizers] Fix #5081 and improve backward compatibility (#5125 ) * fix #5081 and improve backward compatibility (slightly) * add nlp to setup.cfg - style and quality * align default to previous default * remove test that doesn't generalize	2020-06-22 17:25:43 +02:00
Joseph Liu	f4e1f02210	Output hidden states (#4978 ) * Configure all models to use output_hidden_states as argument passed to foward() * Pass all tests * Remove cast_bool_to_primitive in TF Flaubert model * correct tf xlnet * add pytorch test * add tf test * Fix broken tests * Configure all models to use output_hidden_states as argument passed to foward() * Pass all tests * Remove cast_bool_to_primitive in TF Flaubert model * correct tf xlnet * add pytorch test * add tf test * Fix broken tests * Refactor output_hidden_states for mobilebert * Reset and remerge to master Co-authored-by: Joseph Liu <joseph.liu@coinflex.com> Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>	2020-06-22 10:10:45 -04:00
RafaelWO	b99ad457f4	Added feature to move added tokens in vocabulary for Transformer-XL (#4953 ) * Fixed resize_token_embeddings for transfo_xl model * Fixed resize_token_embeddings for transfo_xl. Added custom methods to TransfoXLPreTrainedModel for resizing layers of the AdaptiveEmbedding. * Updated docstring * Fixed resizinhg cutoffs; added check for new size of embedding layer. * Added test for resize_token_embeddings * Fixed code quality * Fixed unchanged cutoffs in model.config * Added feature to move added tokens in tokenizer. * Fixed code quality * Added feature to move added tokens in tokenizer. * Fixed code quality * Fixed docstring, renamed sym to oken. Co-authored-by: Rafael Weingartner <rweingartner.its-b2015@fh-salzburg.ac.at>	2020-06-22 15:40:52 +02:00
Patrick von Platen	fa0be6d761	Benchmarks (#4912 ) * finish benchmark * fix isort * fix setup cfg * retab * fix time measuring of tf graph mode * fix tf cuda * clean code * better error message	2020-06-22 12:06:56 +02:00
Vasily Shamporov	9a3f91088c	Add MobileBert (#4901 ) * Add MobileBert * Quality + Conversion script * style * Update src/transformers/modeling_mobilebert.py * Links to S3 * Style * TFMobileBert Slight fixes to the pytorch MobileBert Style * MobileBertForMaskedLM (PT + TF) * MobileBertForNextSentencePrediction (PT + TF) * MobileFor{MultipleChoice, TokenClassification} (PT + TF) ss * Tests + Auto * Doc * Tests * Addressing @sgugger's comments * Adressing @patrickvonplaten's comments * Style * Style * Integration test * style * Model card Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-06-19 16:38:36 -04:00
Sam Shleifer	84be482f66	AutoTokenizer supports mbart-large-en-ro (#5121 )	2020-06-18 20:47:37 -04:00
Sylvain Gugger	5f721ad6e4	Fix #5114 (#5122 )	2020-06-18 19:20:04 -04:00
Deniz	32e94cff64	tf add resize_token_embeddings method (#4351 ) * resize token embeddings * add tokens * add tokens * add tokens * add t5 token method * add t5 token method * add t5 token method * typo * debugging input * debugging input * debug * debug * debug * trying to set embedding tokens properly * set embeddings for generation head too * set embeddings for generation head too * debugging * debugging * enable generation * add base method * add base method * add base method * return logits in the main call * reverting to generation * revert back * set embeddings for the bert main layer * description * fix conflicts * logging * set base model as self * refactor * tf_bert add method * tf_bert add method * tf_bert add method * tf_bert add method * tf_bert add method * tf_bert add method * tf_bert add method * tf_bert add method * v0 * v0 * finalize * final * black * add tests * revert back the emb call * comments * comments * add the second test * add vocab size condig * add tf models * add tf models. add common tests * remove model specific embedding tests * stylish * remove files * stylez * Update src/transformers/modeling_tf_transfo_xl.py change the error. Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * adding unchanged weight test Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-06-18 18:41:26 -04:00
Suraj Patil	ca2d0f98c4	ElectraForMultipleChoice (#4954 ) * add ElectraForMultipleChoice * add test_for_multiple_choice * add ElectraForMultipleChoice in auto model * add ElectraForMultipleChoice in all_model_classes * add SequenceSummary related parameters * get rid pooler, use SequenceSummary instead * add electra multiple choice test Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-06-18 14:59:35 -04:00
Sylvain Gugger	20fa828984	Make default_data_collator more flexible and deprecate old behavior (#5060 ) * Make default_data_collator more flexible * Accept tensors for all features * Document code * Refactor * Formatting	2020-06-17 15:24:51 -04:00
Sam Shleifer	3d495c61ef	Fix marian tokenizer save pretrained (#5043 )	2020-06-16 09:48:19 -04:00
Amil Khare	c852036b4a	[cleanup] Hoist ModelTester objects to top level (#4939 ) Co-authored-by: Sam Shleifer <sshleifer@gmail.com>	2020-06-16 08:03:43 -04:00
Funtowicz Morgan	9e03364999	Ability to pickle/unpickle BatchEncoding pickle (reimport) (#5039 ) * Added is_fast property on BatchEncoding to indicate if the object comes from a Fast Tokenizer. * Added __get_state__() & __set_state__() to be pickable. * Correct tokens() return type from List[int] to List[str] * Added unittest for BatchEncoding pickle/unpickle * Added unittest for BatchEncoding is_fast * More careful checking on BatchEncoding unpickle tests. * Formatting. * is_fast should assertTrue on Rust tokenizers. * Ensure tensorflow has correct way of checking array_equal * More formatting.	2020-06-16 09:25:25 +02:00
Sylvain Gugger	f9f8a5312e	Add DistilBertForMultipleChoice (#5032 ) * Add `DistilBertForMultipleChoice`	2020-06-15 18:31:41 -04:00

1 2 3 4 5 ...

405 Commits