transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-21 05:28:21 +06:00

Author	SHA1	Message	Date
Lysandre	1abd53b1aa	Patch ALBERT with heads in TensorFlow	2020-02-19 18:24:40 -05:00
Funtowicz Morgan	e676764241	Override build_inputs_with_special_tokens for fast tokenizers (#2912 ) * Override build_inputs_with_special_tokens for fast impl + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Quality + format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-19 16:09:51 -05:00
Lysandre	59c23ad9c9	README link + better instructions for release	2020-02-19 11:57:17 -05:00
Lysandre	22b2b5790e	Documentation v2.5.0	2020-02-19 11:53:30 -05:00
Lysandre	fb560dcb07	Release: v2.5.0 Welcome Rust Tokenizers	2020-02-19 11:46:19 -05:00
Funtowicz Morgan	3f3fa7f7da	Integrate fast tokenizers library inside transformers (#2674 ) * Implemented fast version of tokenizers Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bumped tokenizers version requirements to latest 0.2.1 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added matching tests Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Matching OpenAI GPT tokenization ! Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Matching GPT2 on tokenizers Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Expose add_prefix_space as constructor parameter for GPT2 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Matching Roberta tokenization ! Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Removed fast implementation of CTRL. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Binding TransformerXL tokenizers to Rust. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Updating tests accordingly. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added tokenizers as top-level modules. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Black & isort. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Rename LookupTable to WordLevel to match Rust side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Black. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use "fast" suffix instead of "ru" for rust tokenizers implementations. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Introduce tokenize() method on fast tokenizers. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * encode_plus dispatchs to batch_encode_plus Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * batch_encode_plus now dispatchs to encode if there is only one input element. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bind all the encode_plus parameter to the forwarded batch_encode_plus call. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.3.0 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Formatting. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix tokenization_auto with support for new (python, fast) mapping schema. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Give correct fixtures path in test_tokenization_fast.py for the CLI. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Expose max_len_ properties on BertTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * _convert_encoding should keep the batch axis tensor if only one sample in the batch. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add warning message for RobertaTokenizerFast if used for MLM. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added use_fast (bool) parameter on AutoTokenizer.from_pretrained(). This allows to easily enable/disable Rust-based tokenizer instantiation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Let's tokenizers handle all the truncation and padding stuff. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Allow to provide tokenizer arguments during pipeline creation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Update test_fill_mask pipeline to not use fast tokenizers. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix too much parameters for convert_encoding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * When enabling padding, max_length should be set to None. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Avoid returning nested tensors of length 1 when calling encode_plus Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Ensure output is padded when return_tensor is not None. Tensor creation requires the inital list input to be of the exact same size. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Disable transfoxl unittest if pytorch is not available (required to load the model) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * encode_plus should not remove the leading batch axis if return_tensor is set Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Temporary disable fast tokenizers on QA pipelines. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix formatting issues. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Update tokenizers to 0.4.0 * Update style * Enable truncation + stride unit test on fast tokenizers. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add unittest ensuring special_tokens set match between Python and Rust. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Ensure special_tokens are correctly set during construction. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Give more warning feedback to the user in case of padding without pad_token. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * quality & format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added possibility to add a single token as str Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added unittest for add_tokens and add_special_tokens on fast tokenizers. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix rebase mismatch on pipelines qa default model. QA requires cased input while the tokenizers would be uncased. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Using offset mapping relative to the original string + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: save_vocabulary requires folder and file name Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Simplify import for Bert. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Remove private member access in tokenize() Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Bump tokenizers dependency to 0.4.2 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * format & quality. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Use named arguments when applicable. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Relax type checking to include tuple and list object. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Document the truncate_and_pad manager behavior. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Raise an exception if return_offsets_mapping is not available with the current tokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Ensure padding is set on the tokenizers before setting any padding strategy + unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * On pytorch we need to stack tensor to get proper new axis. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Generalize tests to different framework removing hard written return_tensors="..." Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizer dependency for num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Overflowing tokens in batch_encode_plus are now stacked over the batch axis. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Improved error message for padding strategy without pad token. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bumping tokenizers dependency to 0.5.0 for release. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Optimizing convert_encoding around 4x improvement. 🚀 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix unittests for overflow_to_sampling_mapping not being returned as tensor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Format & quality. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove perfect alignment constraint for Roberta (allowing 1% difference max) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Triggering final CI Co-authored-by: MOI Anthony <xn1t0x@gmail.com>	2020-02-19 11:35:40 -05:00
Bin Wang	ffb93ec0cc	Create README.md	2020-02-19 10:51:16 -05:00
Sam Shleifer	20fc18fbda	Skip flaky test_tf_question_answering (#2845 ) * Skip flaky test * Style	2020-02-18 16:14:50 -05:00
VictorSanh	2ae98336d1	fix vocab size in binarized_data (distil): int16 vs int32	2020-02-18 16:17:35 +00:00
VictorSanh	0dbddba6d2	fix typo in hans example call	2020-02-17 20:19:57 +00:00
Manuel Romero	29ab4b7f40	Create README.md	2020-02-17 10:58:43 -05:00
Stefan Schweter	c88ed74ccf	[model_cards] 🇹🇷 Add new (cased) BERTurk model	2020-02-17 09:54:46 -05:00
Thomas Wolf	5b2d4f2657	Merge pull request #2881 from patrickvonplaten/add_vim_swp_to_gitignore update .gitignore to ignore .swp files created when using vim	2020-02-17 14:36:49 +01:00
Patrick von Platen	fb4d8d0832	update .gitignore to ignore .swp files created when using vim	2020-02-17 14:26:32 +01:00
Manuel Romero	6083c1566e	Update README.md I trained the model for more epochs so I improved the results. This commit will update the results of the model and add a gif using it with transformers/pipelines	2020-02-16 10:09:34 -05:00
Julien Chaumond	73028c5df0	[model_cards] EsperBERTo	2020-02-14 15:16:33 -05:00
Timo Moeller	81fb8d3251	Update model card: new performance chart (#2864 ) * Update model performance for correct German conll03 dataset * Adjust text * Adjust line spacing	2020-02-14 13:39:23 -05:00
Julien Chaumond	4e69104a1f	[model_cards] Also use the thumbnail as meta Co-Authored-By: Ilias Chalkidis <ihalk@di.uoa.gr>	2020-02-14 10:27:11 -05:00
Julien Chaumond	73d79d42b4	[model_cards] nlptown/bert-base-multilingual-uncased-sentiment cc @yvespeirsman Co-Authored-By: Yves Peirsman <yvespeirsman@users.noreply.github.com>	2020-02-14 09:51:11 -05:00
Yves Peirsman	47b735f994	Added model card for bert-base-multilingual-uncased-sentiment (#2859 ) * Created model card for nlptown/bert-base-multilingual-sentiment * Delete model card * Created model card for bert-base-multilingual-uncased-sentiment as README	2020-02-14 09:31:15 -05:00
Julien Chaumond	7d22fefd37	[pipeline] Alias NerPipeline as TokenClassificationPipeline	2020-02-14 09:18:10 -05:00
Manuel Romero	61a2b7dc9d	Fix typo	2020-02-14 09:13:07 -05:00
Ilias Chalkidis	6e261d3a22	Fix typos	2020-02-14 09:11:07 -05:00
Manuel Romero	4e597c8e4d	Fix typo	2020-02-14 09:07:42 -05:00
Julien Chaumond	925a13ced1	[model_cards] mv README.md	2020-02-13 23:07:29 -05:00
Manuel Romero	575a3b7aa1	Create distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es.md	2020-02-13 23:04:52 -05:00
Julien Chaumond	4d36472b96	[run_ner] Don't crash if fine-tuning local model that doesn't end with digit	2020-02-14 03:25:29 +00:00
Ilias Chalkidis	8514018300	Update with additional information Added a "Pre-training details" section	2020-02-13 21:54:42 -05:00
Ilias Chalkidis	1eec69a900	Create README.md	2020-02-13 19:27:22 -05:00
Felix MIKAELIAN	8744402f1e	add model_card flaubert-base-uncased-squad (#2833 ) * add model_card * Add tag cc @fmikaelian Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-02-13 17:19:13 -05:00
Severin Simmler	7f98edd7e3	Model card: Literary German BERT (#2843 ) * feat: create model card * chore: add description * feat: stats plot * Delete prosa-jahre.svg * feat: years plot (again) * chore: add more details * fix: typos * feat: kfold plot * feat: kfold plot * Rename model_cards/severinsimmler/literary-german-bert.md to model_cards/severinsimmler/literary-german-bert/README.md * Support for linked images + add tags cc @severinsimmler Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-02-13 15:43:44 -05:00
Joe Davison	f1e8a51f08	Preserve spaces in GPT-2 tokenizers (#2778 ) * Preserve spaces in GPT-2 tokenizers Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa) tokenizers, enabling correct BPE encoding. Automatically inserts a space in front of first token in encode function when adding special tokens. * Add tokenization preprocessing method * Add framework argument to pipeline factory Also fixes pipeline test issue. Each test input now treated as a distinct sequence.	2020-02-13 13:29:43 -05:00
Sam Shleifer	0ed630f139	Attempt to increase timeout for circleci slow tests (#2844 )	2020-02-13 09:11:03 -05:00
Sam Shleifer	ef74b0f07a	get_activation('relu') provides a simple mapping from strings i… (#2807 ) * activations.py contains a mapping from string to activation function * resolves some `gelu` vs `gelu_new` ambiguity	2020-02-13 08:28:33 -05:00
Lysandre	f54a5bd37f	Raise error when using an mlm flag for a clm model + correct TextDataset	2020-02-12 13:23:14 -05:00
Lysandre	569897ce2c	Fix a few issues regarding the language modeling script	2020-02-12 13:23:14 -05:00
Julien Chaumond	21da895013	[model_cards] Better image for social sharing	2020-02-11 20:30:08 -05:00
Julien Chaumond	9a70910d47	[model_cards] Tweak @mrm8488's model card	2020-02-11 20:20:39 -05:00
Julien Chaumond	9274734a0d	[model_cards] mv to correct location + tweak tag	2020-02-11 20:13:57 -05:00
Manuel Romero	69f948461f	Create bert-base-spanish-wwm-cased-finetuned-spa-squad2-es.md	2020-02-11 20:07:15 -05:00
Julien Chaumond	e0b6247cf7	[model_cards] Change formatting slightly as we updated our markdown engine cc @tholor @loretoparisi @simonefrancia	2020-02-11 18:25:21 -05:00
sshleifer	5f2dd71d1b	Smaller diff	2020-02-11 17:20:09 -05:00
sshleifer	31158af57c	formatting	2020-02-11 17:20:09 -05:00
sshleifer	5dd61fb9a9	Add more specific testing advice to Contributing.md	2020-02-11 17:20:09 -05:00
Oleksiy Syvokon	ee5de0ba44	BERT decoder: Fix causal mask dtype. PyTorch < 1.3 requires multiplication operands to be of the same type. This was violated when using default attention mask (i.e., attention_mask=None in arguments) given BERT in the decoder mode. In particular, this was breaking Model2Model and made tutorial from the quickstart failing.	2020-02-11 15:19:22 -05:00
jiyeon	bed38d3afe	Fix typo in src/transformers/data/processors/squad.py	2020-02-11 11:22:24 -05:00
Stefan Schweter	498d06e914	[model_cards] Add new German Europeana BERT models (#2805 ) * [model_cards] New German Europeana BERT models from dbmdz * [model_cards] Update German Europeana BERT models from dbmdz	2020-02-11 10:49:39 -05:00
Funtowicz Morgan	3e3a9e2c01	Merge pull request #2793 from huggingface/tensorflow-210-circleci-fix Fix circleci cuInit error on Tensorflow >= 2.1.0.	2020-02-11 10:48:42 +00:00
Julien Chaumond	1f5db9a13c	[model_cards] Rm extraneous tag	2020-02-10 17:45:13 -05:00
Julien Chaumond	95bac8dabb	[model_cards] Add language metadata to existing model cards This will enable filtering on language (amongst other tags) on the website cc @loretoparisi, @stefan-it, @HenrykBorzymowski, @marma	2020-02-10 17:42:42 -05:00

... 235 236 237 238 239 ...

15053 Commits