transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-29 09:12:21 +06:00

Author	SHA1	Message	Date
Patrick von Platen	092cf881a5	[Generation, EncoderDecoder] Apply Encoder Decoder 1.5GB memory… (#3778 )	2020-04-13 22:29:28 -04:00
Teven	352d5472b0	Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716 ) * Shifting labels inside TransfoXLLMHead * Changed doc to reflect change * Updated pytorch test * removed IDE whitespace changes * black reformat Co-authored-by: TevenLeScao <teven.lescao@gmail.com>	2020-04-13 18:11:23 +02:00
elk-cloner	5ebd898953	fix dataset shuffling for Distributed training (#huggingface#3721) (#3766 )	2020-04-13 10:11:18 -04:00
HenrykBorzymowski	7972a4019f	updated dutch squad model card (#3736 ) * added model_cards for polish squad models * corrected mistake in polish design cards * updated model_cards for squad2_dutch model * added links to benchmark models Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>	2020-04-11 06:44:59 -04:00
HUSEIN ZOLKEPLI	f8c1071c51	Added README huseinzol05/albert-tiny-bahasa-cased (#3746 ) * add bert bahasa readme * update readme * update readme * added xlnet * added tiny-bert and fix xlnet readme * added albert base * added albert tiny	2020-04-11 06:42:06 -04:00
Jin Young Sohn	700ccf6e35	Fix glue_convert_examples_to_features API breakage (#3742 )	2020-04-10 16:03:27 -04:00
Anthony MOI	b7cf9f43d2	Update tokenizers to 0.7.0-rc5 (#3705 )	2020-04-10 14:23:49 -04:00
Jin Young Sohn	551b450527	Add `run_glue_tpu.py` that trains models on TPUs (#3702 ) * Initial commit to get BERT + run_glue.py on TPU * Add README section for TPU and address comments. * Cleanup TPU bits from run_glue.py (#3) TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * Cleanup TPU bits from run_glue.py TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * No need to call `xm.mark_step()` explicitly (#4) Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself. * Resolve R/W conflicts from multiprocessing (#5) * Add XLNet in list of models for `run_glue_tpu.py` (#6) * Add RoBERTa to list of models in TPU GLUE (#7) * Add RoBERTa and DistilBert to list of models in TPU GLUE (#8) * Use barriers to reduce duplicate work/resources (#9) * Shard eval dataset and aggregate eval metrics (#10) * Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead. * Only use tb_writer from master (#11) * Apply huggingface black code formatting * Style * Remove `--do_lower_case` as example uses cased * Add option to specify tensorboard logdir This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer. * Using configuration for `xla_device` * Prefix TPU specific comments. * num_cores clarification and namespace eval metrics * Cache features file under `args.cache_dir` Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem. * Rename `run_glue_tpu` to `run_tpu_glue` Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-04-10 12:53:54 -04:00
Julien Chaumond	cbad305ce6	[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738 )	2020-04-10 12:34:04 -04:00
Julien Chaumond	b169ac9c2b	[examples] Generate argparsers from type hints on dataclasses (#3669 ) * [examples] Generate argparsers from type hints on dataclasses * [HfArgumentParser] way simpler API * Restore run_language_modeling.py for easier diff * [HfArgumentParser] final tweaks from code review	2020-04-10 12:21:58 -04:00
Sam Shleifer	7a7fdf71f8	Multilingual BART - (#3602 ) - support mbart-en-ro weights - add MBartTokenizer	2020-04-10 11:25:39 -04:00
Julien Chaumond	f98d0ef2a2	Big cleanup of `glue_convert_examples_to_features` (#3688 ) * Big cleanup of `glue_convert_examples_to_features` * Use batch_encode_plus * Cleaner wrapping of glue_convert_examples_to_features for TF @lysandrejik * Cleanup syntax, thanks to @mfuntowicz * Raise explicit error in case of user error	2020-04-10 10:20:18 -04:00
Patrick von Platen	ce2298fb5f	[T5, generation] Add decoder caching for T5 (#3682 ) * initial commit to add decoder caching for T5 * better naming for caching * finish T5 decoder caching * correct test * added extensive past testing for T5 * clean files * make tests cleaner * improve docstring * improve docstring * better reorder cache * make style * Update src/transformers/modeling_t5.py Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com> * make set output past work for all layers * improve docstring * improve docstring Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>	2020-04-10 01:02:50 +02:00
calpt	9384e5f6de	Fix force_download of files on Windows (#3697 )	2020-04-09 14:44:57 -04:00
Julien Chaumond	bc65afc4df	[Exbert] Change style of button	2020-04-09 10:44:42 -04:00
LysandreJik	31baeed614	Update quotes cc @julien-c	2020-04-09 09:09:00 -04:00
Teven	f8208fa456	Correct transformers-cli env call	2020-04-09 09:03:19 +02:00
Lysandre Debut	6435b9f908	Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684 ) * Updating modeling tf files; adding tests * Merge `encode_plus` and `batch_encode_plus`	2020-04-08 16:22:44 -04:00
LysandreJik	500aa12318	close #3699	2020-04-08 14:32:47 -04:00
Julien Chaumond	a594ee9c84	More doc for model cards (#3698 ) see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270	2020-04-08 12:12:52 -04:00
Julien Chaumond	83703cd077	Update doc for {Summarization,Translation}Pipeline and other tweaks	2020-04-08 09:45:00 -04:00
Seyone Chithrananda	a1b3b4167e	Created README.md for model card ChemBERTa (#3666 ) * created readme.md * update readme with fixes Fixes from PR comments	2020-04-08 09:10:20 -04:00
Lorenzo Ampil	747907dc5e	Fix typo in FeatureExtractionPipeline docstring	2020-04-08 09:08:56 -04:00
Sam Shleifer	715aa5b135	[Bart] Replace config.output_past with use_cache kwarg (#3632 )	2020-04-07 19:08:26 -04:00
Sam Shleifer	e344e3d402	[examples] SummarizationDataset cleanup (#3451 )	2020-04-07 19:05:58 -04:00
Patrick von Platen	b0ad069517	[Tokenization] fix edge case for bert tokenization (#3517 ) * fix egde gase for bert tokenization * add Lysandres comments for improvement * use new is_pretokenized_flag	2020-04-07 16:26:31 -04:00
Patrick von Platen	80fa0f7812	[Examples, Benchmark] Improve benchmark utils (#3674 ) * improve and add features to benchmark utils * update benchmark style * remove output files	2020-04-07 16:25:57 -04:00
Michael Pang	05deb52dc1	Optimize causal mask using torch.where (#2715 ) * Optimize causal mask using torch.where Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance. * Maintain compatiblity with torch 1.0.0 - thanks for PR feedback * Fix typo * reformat line for CI	2020-04-07 22:19:18 +02:00
Sam Shleifer	0a4b1068e1	Speedup torch summarization tests (#3663 )	2020-04-07 14:01:30 -04:00
Myle Ott	5aa8a278a3	Fix roberta checkpoint conversion script (#3642 )	2020-04-07 12:03:23 -04:00
Julien Chaumond	11cc1e168b	[model_cards] Turn down spurious warnings Close #3639 + spurious warning mentioned in #3227 cc @lysandrejik @thomwolf	2020-04-07 10:20:19 -04:00
Teven	0a9d09b42a	fixed TransfoXLLMHeadModel documentation (#3661 ) Co-authored-by: TevenLeScao <teven.lescao@gmail.com>	2020-04-07 00:47:51 +02:00
Funtowicz Morgan	96ab75b8dd	Tokenizers v3.0.0 (#3185 ) * Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-04-07 00:29:15 +02:00
Ethan Perez	e52d1258e0	Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631 ) * Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py `convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase. * Simplifying change to match recent commits	2020-04-06 16:52:22 -04:00
ktrapeznikov	0ac33ddd8d	Create README.md	2020-04-06 16:35:29 -04:00
Manuel Romero	326e6ebae7	Add model card	2020-04-06 16:30:01 -04:00
Manuel Romero	43eca3f878	Add model card	2020-04-06 16:29:51 -04:00
Manuel Romero	6bec88ca42	Create README.md	2020-04-06 16:29:44 -04:00
Manuel Romero	769b60f935	Add model card (#3655 ) * Add model card * Fix model name in fine-tuning script	2020-04-06 16:29:36 -04:00
Manuel Romero	c4bcb01906	Create model card (#3654 ) * Create model card * Fix model name in fine-tuning script	2020-04-06 16:29:25 -04:00
Manuel Romero	6903a987b8	Create README.md	2020-04-06 16:29:02 -04:00
MichalMalyska	760872dbde	Create README.md (#3662 )	2020-04-06 16:27:50 -04:00
jjacampos	47e1334c0b	Add model card for BERTeus (#3649 ) * Add model card for BERTeus * Update README	2020-04-06 16:21:25 -04:00
Suchin	529534dc2f	BioMed Roberta-Base (AllenAI) (#3643 ) * added model card * updated README * updated README * updated README * added evals * removed pico eval * Tweaks Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-04-06 16:12:09 -04:00
Lysandre Debut	261c4ff4e2	Update notebooks (#3620 ) * Update notebooks * From local to global link * from local links to actual global links	2020-04-06 14:32:39 -04:00
Julien Chaumond	39a34cc375	[model_cards] ELECTRA (w/ examples of usage) Co-Authored-By: Kevin Clark <clarkkev@users.noreply.github.com> Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2020-04-06 11:43:33 -04:00
LysandreJik	ea6dba2787	Re-pin isort	2020-04-06 10:09:54 -04:00
LysandreJik	11c3257a18	unpin isort for pypi	2020-04-06 10:06:41 -04:00
LysandreJik	36bffc81b3	Release: v2.8.0	2020-04-06 10:03:53 -04:00
Patrick von Platen	2ee410560e	[Generate, Test] Split generate test function into beam search, no beam search (#3601 ) * split beam search and no beam search test * fix test * clean generate tests	2020-04-06 10:37:05 +02:00

1 2 3 4 5 ...

3776 Commits