Commit Graph

3776 Commits

Author SHA1 Message Date
Patrick von Platen
092cf881a5
[Generation, EncoderDecoder] Apply Encoder Decoder 1.5GB memory… (#3778) 2020-04-13 22:29:28 -04:00
Teven
352d5472b0
Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716)
* Shifting labels inside TransfoXLLMHead

* Changed doc to reflect change

* Updated pytorch test

* removed IDE whitespace changes

* black reformat

Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
2020-04-13 18:11:23 +02:00
elk-cloner
5ebd898953
fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) 2020-04-13 10:11:18 -04:00
HenrykBorzymowski
7972a4019f
updated dutch squad model card (#3736)
* added model_cards for polish squad models

* corrected mistake in polish design cards

* updated model_cards for squad2_dutch model

* added links to benchmark models

Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>
2020-04-11 06:44:59 -04:00
HUSEIN ZOLKEPLI
f8c1071c51
Added README huseinzol05/albert-tiny-bahasa-cased (#3746)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base

* added albert tiny
2020-04-11 06:42:06 -04:00
Jin Young Sohn
700ccf6e35
Fix glue_convert_examples_to_features API breakage (#3742) 2020-04-10 16:03:27 -04:00
Anthony MOI
b7cf9f43d2
Update tokenizers to 0.7.0-rc5 (#3705) 2020-04-10 14:23:49 -04:00
Jin Young Sohn
551b450527
Add run_glue_tpu.py that trains models on TPUs (#3702)
* Initial commit to get BERT + run_glue.py on TPU

* Add README section for TPU and address comments.

* Cleanup TPU bits from run_glue.py (#3)

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* Cleanup TPU bits from run_glue.py

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* No need to call `xm.mark_step()` explicitly (#4)

Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.

* Resolve R/W conflicts from multiprocessing (#5)

* Add XLNet in list of models for `run_glue_tpu.py` (#6)

* Add RoBERTa to list of models in TPU GLUE (#7)

* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)

* Use barriers to reduce duplicate work/resources (#9)

* Shard eval dataset and aggregate eval metrics (#10)

* Shard eval dataset and aggregate eval metrics

Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.

* Change defaultdict to float

* Reduce the pred, label tensors instead of metrics

As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.

* Only use tb_writer from master (#11)

* Apply huggingface black code formatting

* Style

* Remove `--do_lower_case` as example uses cased

* Add option to specify tensorboard logdir

This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.

* Using configuration for `xla_device`

* Prefix TPU specific comments.

* num_cores clarification and namespace eval metrics

* Cache features file under `args.cache_dir`

Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.

* Rename `run_glue_tpu` to `run_tpu_glue`

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-04-10 12:53:54 -04:00
Julien Chaumond
cbad305ce6
[docs] The use of do_lower_case in scripts is on its way to deprecation (#3738) 2020-04-10 12:34:04 -04:00
Julien Chaumond
b169ac9c2b
[examples] Generate argparsers from type hints on dataclasses (#3669)
* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review
2020-04-10 12:21:58 -04:00
Sam Shleifer
7a7fdf71f8
Multilingual BART - (#3602)
- support mbart-en-ro weights
- add MBartTokenizer
2020-04-10 11:25:39 -04:00
Julien Chaumond
f98d0ef2a2
Big cleanup of glue_convert_examples_to_features (#3688)
* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error
2020-04-10 10:20:18 -04:00
Patrick von Platen
ce2298fb5f
[T5, generation] Add decoder caching for T5 (#3682)
* initial commit to add decoder caching for T5

* better naming for caching

* finish T5 decoder caching

* correct test

* added extensive past testing for T5

* clean files

* make tests cleaner

* improve docstring

* improve docstring

* better reorder cache

* make style

* Update src/transformers/modeling_t5.py

Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com>

* make set output past work for all layers

* improve docstring

* improve docstring

Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>
2020-04-10 01:02:50 +02:00
calpt
9384e5f6de
Fix force_download of files on Windows (#3697) 2020-04-09 14:44:57 -04:00
Julien Chaumond
bc65afc4df [Exbert] Change style of button 2020-04-09 10:44:42 -04:00
LysandreJik
31baeed614 Update quotes
cc @julien-c
2020-04-09 09:09:00 -04:00
Teven
f8208fa456
Correct transformers-cli env call 2020-04-09 09:03:19 +02:00
Lysandre Debut
6435b9f908
Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684)
* Updating modeling tf files; adding tests

* Merge `encode_plus` and `batch_encode_plus`
2020-04-08 16:22:44 -04:00
LysandreJik
500aa12318 close #3699 2020-04-08 14:32:47 -04:00
Julien Chaumond
a594ee9c84
More doc for model cards (#3698)
see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270
2020-04-08 12:12:52 -04:00
Julien Chaumond
83703cd077 Update doc for {Summarization,Translation}Pipeline and other tweaks 2020-04-08 09:45:00 -04:00
Seyone Chithrananda
a1b3b4167e
Created README.md for model card ChemBERTa (#3666)
* created readme.md

* update readme with fixes

Fixes from PR comments
2020-04-08 09:10:20 -04:00
Lorenzo Ampil
747907dc5e Fix typo in FeatureExtractionPipeline docstring 2020-04-08 09:08:56 -04:00
Sam Shleifer
715aa5b135
[Bart] Replace config.output_past with use_cache kwarg (#3632) 2020-04-07 19:08:26 -04:00
Sam Shleifer
e344e3d402
[examples] SummarizationDataset cleanup (#3451) 2020-04-07 19:05:58 -04:00
Patrick von Platen
b0ad069517
[Tokenization] fix edge case for bert tokenization (#3517)
* fix egde gase for bert tokenization

* add Lysandres comments for improvement

* use new is_pretokenized_flag
2020-04-07 16:26:31 -04:00
Patrick von Platen
80fa0f7812
[Examples, Benchmark] Improve benchmark utils (#3674)
* improve and add features to benchmark utils

* update benchmark style

* remove output files
2020-04-07 16:25:57 -04:00
Michael Pang
05deb52dc1
Optimize causal mask using torch.where (#2715)
* Optimize causal mask using torch.where

Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance.

* Maintain compatiblity with torch 1.0.0 - thanks for PR feedback

* Fix typo

* reformat line for CI
2020-04-07 22:19:18 +02:00
Sam Shleifer
0a4b1068e1
Speedup torch summarization tests (#3663) 2020-04-07 14:01:30 -04:00
Myle Ott
5aa8a278a3
Fix roberta checkpoint conversion script (#3642) 2020-04-07 12:03:23 -04:00
Julien Chaumond
11cc1e168b [model_cards] Turn down spurious warnings
Close #3639 + spurious warning mentioned in #3227

cc @lysandrejik @thomwolf
2020-04-07 10:20:19 -04:00
Teven
0a9d09b42a
fixed TransfoXLLMHeadModel documentation (#3661)
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
2020-04-07 00:47:51 +02:00
Funtowicz Morgan
96ab75b8dd
Tokenizers v3.0.0 (#3185)
* Renamed num_added_tokens to num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-04-07 00:29:15 +02:00
Ethan Perez
e52d1258e0
Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631)
* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py

`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.

* Simplifying change to match recent commits
2020-04-06 16:52:22 -04:00
ktrapeznikov
0ac33ddd8d Create README.md 2020-04-06 16:35:29 -04:00
Manuel Romero
326e6ebae7 Add model card 2020-04-06 16:30:01 -04:00
Manuel Romero
43eca3f878 Add model card 2020-04-06 16:29:51 -04:00
Manuel Romero
6bec88ca42 Create README.md 2020-04-06 16:29:44 -04:00
Manuel Romero
769b60f935
Add model card (#3655)
* Add model card

* Fix model name in fine-tuning script
2020-04-06 16:29:36 -04:00
Manuel Romero
c4bcb01906
Create model card (#3654)
* Create model card

* Fix model name in fine-tuning script
2020-04-06 16:29:25 -04:00
Manuel Romero
6903a987b8 Create README.md 2020-04-06 16:29:02 -04:00
MichalMalyska
760872dbde
Create README.md (#3662) 2020-04-06 16:27:50 -04:00
jjacampos
47e1334c0b
Add model card for BERTeus (#3649)
* Add model card for BERTeus

* Update README
2020-04-06 16:21:25 -04:00
Suchin
529534dc2f
BioMed Roberta-Base (AllenAI) (#3643)
* added model card

* updated README

* updated README

* updated README

* added evals

* removed pico eval

* Tweaks

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-06 16:12:09 -04:00
Lysandre Debut
261c4ff4e2
Update notebooks (#3620)
* Update notebooks

* From local to global link

* from local links to *actual* global links
2020-04-06 14:32:39 -04:00
Julien Chaumond
39a34cc375 [model_cards] ELECTRA (w/ examples of usage)
Co-Authored-By: Kevin Clark <clarkkev@users.noreply.github.com>
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
2020-04-06 11:43:33 -04:00
LysandreJik
ea6dba2787 Re-pin isort 2020-04-06 10:09:54 -04:00
LysandreJik
11c3257a18 unpin isort for pypi 2020-04-06 10:06:41 -04:00
LysandreJik
36bffc81b3 Release: v2.8.0 2020-04-06 10:03:53 -04:00
Patrick von Platen
2ee410560e
[Generate, Test] Split generate test function into beam search, no beam search (#3601)
* split beam search and no beam search test

* fix test

* clean generate tests
2020-04-06 10:37:05 +02:00