Commit Graph

4435 Commits

Author SHA1 Message Date
Lysandre
b0892fa0e8 Release: v3.0.2 2020-07-06 18:49:44 -04:00
Sylvain Gugger
f1e2e423ab
Fix fast tokenizers too (#5562) 2020-07-06 18:45:01 -04:00
Anthony MOI
5787e4c159
Various tokenizers fixes (#5558)
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
2020-07-06 18:27:53 -04:00
Sylvain Gugger
21f28c34b7
Fix #5507 (#5559)
* Fix #5507

* Fix formatting
2020-07-06 17:26:48 -04:00
Lysandre Debut
9d9b872b66
The add_space_before_punct_symbol is only for TransfoXL (#5549) 2020-07-06 12:17:05 -04:00
Lysandre Debut
d6b0b9d451
GPT2 tokenizer should not output token type IDs (#5546)
* GPT2 tokenizer should not output token type IDs

* Same for OpenAIGPT
2020-07-06 11:33:57 -04:00
Sylvain Gugger
7833b21a5a
Fix #5544 (#5551) 2020-07-06 11:22:24 -04:00
Thomas Wolf
c473484087
Fix the tokenization warning noted in #5505 (#5550)
* fix warning

* style and quality
2020-07-06 11:15:25 -04:00
Lysandre
1bbc28bee7 Imports organization 2020-07-06 10:27:10 -04:00
Mohamed Taher Alrefaie
1bc13697b1
Update convert_pytorch_checkpoint_to_tf2.py (#5531)
fixed ImportError: cannot import name 'hf_bucket_url'
2020-07-06 09:55:10 -04:00
Arnav Sharma
b2309cc6bf
Typo fix in training doc (#5495) 2020-07-06 09:15:22 -04:00
ELanning
7ecff0ccbb
Fix typo in training (#5510) 2020-07-06 09:14:57 -04:00
Sam Shleifer
58cca47c16
[cleanup] TF T5 tests only init t5-base once. (#5410) 2020-07-03 14:27:49 -04:00
Patrick von Platen
991172922f
better error message (#5497) 2020-07-03 19:25:25 +02:00
Thomas Wolf
b58a15a31e unpining specific git versions in setup.py 2020-07-03 17:38:39 +02:00
Thomas Wolf
fedabcd154 Release: 3.0.1 2020-07-03 17:02:44 +02:00
Lysandre Debut
17ade127b9
Exposing prepare_for_model for both slow & fast tokenizers (#5479)
* Exposing prepare_for_model for both slow & fast tokenizers

* Update method signature

* The traditional style commit

* Hide the warnings behind the verbose flag

* update default truncation strategy and prepare_for_model

* fix tests and prepare_for_models methods

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-07-03 16:51:21 +02:00
Manuel Romero
814ed7ee76
Create model card (#5396)
Create model card for electicidad-small (Spanish Electra) fine-tuned on SQUAD-esv1
2020-07-03 08:29:09 -04:00
Moseli Motsoehli
49281ac939
grammar corrections and train data update (#5448)
- fixed grammar and spelling
- added an intro
- updated Training data references
2020-07-03 08:25:57 -04:00
chrisliu
97355339f6
Update upstream (#5456) 2020-07-03 08:16:27 -04:00
Manuel Romero
55b932a818
Create model card (#5464)
Create model card for electra-small-discriminator fine-tuned on SQUAD v2.0
2020-07-03 06:19:49 -04:00
Funtowicz Morgan
21cd8c4086
QA Pipelines fixes (#5429)
* Make QA pipeline supports models with more than 2 outputs such as BART assuming start/end are the two first outputs.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length.

This result in every sample going through QA pipelines to be of size 384 whatever the actual input size is making the overall pipeline very slow.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Mask padding & question before applying softmax. Softmax has been refactored to operate in log space for speed and stability.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Format.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use PaddingStrategy.LONGEST instead of DO_NOT_PAD

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Revert "When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length."

This reverts commit 1b00a9a2

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI after unattended failure

* Trigger CI
2020-07-03 10:29:20 +02:00
Pierric Cistac
8438bab38e
Fix roberta model ordering for TFAutoModel (#5414) 2020-07-02 19:23:55 -04:00
Sylvain Gugger
6b735a7253
Tokenizer summary (#5467)
* Work on tokenizer summary

* Finish tutorial

* Link to it

* Apply suggestions from code review

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add vocab definition

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-07-02 17:07:42 -04:00
Shen
ef0e9d806c
Update: ElectraDiscriminatorPredictions forward. (#5471)
`ElectraDiscriminatorPredictions.forward` should not need `attention_mask`.
2020-07-02 13:57:33 -04:00
Manuel Romero
13a8588f2d
Create model card (#5432)
Create model card for electra-base-discriminator fine-tuned on SQUAD v1.1
2020-07-02 10:16:30 -04:00
Julien Chaumond
a0a6387a0d
[model_cards] roberta-large-mnli: fix sep_token 2020-07-02 10:04:02 -04:00
Julien Chaumond
215db688da
Create roberta-large-mnli-README.md 2020-07-02 09:43:54 -04:00
Lysandre Debut
69d313e808
Bans SentencePiece 0.1.92 (#5418) 2020-07-02 09:23:00 -04:00
George Ho
84e56669af
Fix typo in glossary (#5466) 2020-07-02 09:19:33 -04:00
Teven
c6a510c6fa
Fixing missing arguments for TransfoXL tokenizer when using TextGenerationPipeline (#5465)
* overriding _parse_and_tokenize in `TextGenerationPipeine` to allow for TransfoXl tokenizer arguments
2020-07-02 13:53:33 +02:00
Teven
6726416e4a
Changed expected_output_ids in TransfoXL generation test (#5462)
* Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR.

* making black happy

* making isort happy
2020-07-02 11:56:44 +02:00
tommccoy
812def00c9
fix use of mems in Transformer-XL (#4826)
Fixed duplicated memory use in Transformer-XL generation leading to bad predictions and performance.
2020-07-02 11:19:07 +02:00
Patrick von Platen
306f1a2695
Add Reformer MLM notebook (#5450)
* Add Reformer MLM notebook

* Update notebooks/README.md
2020-07-02 00:20:49 +02:00
Patrick von Platen
d16e36c7e5
[Reformer] Add Masked LM Reformer (#5426)
* fix conflicts

* fix

* happy rebasing
2020-07-01 22:43:18 +02:00
Funtowicz Morgan
f4323dbf8c
Don't discard entity_group when token is the latest in the sequence. (#5439)
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2020-07-01 20:30:42 +02:00
Joe Davison
35befd9ce3
Fix tensor label type inference in default collator (#5250)
* allow tensor label inputs to default collator

* replace try/except with type check
2020-07-01 10:40:14 -06:00
Patrick von Platen
fe81f7d12c
finish reformer qa head (#5433) 2020-07-01 12:27:14 -04:00
Patrick von Platen
d697b6ca75
[Longformer] Major Refactor (#5219)
* refactor naming

* add small slow test

* refactor

* refactor naming

* rename selected to extra

* big global attention refactor

* make style

* refactor naming

* save intermed

* refactor functions

* finish function refactor

* fix tests

* fix longformer

* fix longformer

* fix longformer

* fix all tests but one

* finish longformer

* address sams and izs comments

* fix transpose
2020-07-01 17:43:32 +02:00
Sam Shleifer
e0d58ddb65
[fix] Marian tests import (#5442) 2020-07-01 11:42:22 -04:00
Funtowicz Morgan
608d5a7c44
Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389)
* Added PipelineException

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fill-mask pipeline raises exception when more than one mask_token detected.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Put everything in a function.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added tests on pipeline fill-mask when input has != 1 mask_token

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix numel() computation for TF

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing PR comments.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove function typing to avoid import on specific framework.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Retry typing with @julien-c tip.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality².

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify fill-mask mask_token checking.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI
2020-07-01 17:27:47 +02:00
Sylvain Gugger
6c55e9fc32
Fix dropdown bug in searches (#5440)
* Trigger CI

* Fix dropdown bug in searches
2020-07-01 11:02:59 -04:00
Sylvain Gugger
734a28a767
Clean up diffs in Trainer/TFTrainer (#5417)
* Cleanup and unify Trainer/TFTrainer

* Forgot to adapt TFTrainingArgs

* In tf scripts n_gpu -> n_replicas

* Update src/transformers/training_args.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* Formatting

* Fix typo

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-07-01 11:00:20 -04:00
Sam Shleifer
43cb03a93d
MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182) 2020-07-01 10:32:50 -04:00
Sam Shleifer
13deb95a40
Move tests/utils.py -> transformers/testing_utils.py (#5350) 2020-07-01 10:31:17 -04:00
sgugger
9c219305f5 Trigger CI 2020-07-01 10:22:50 -04:00
Sylvain Gugger
64e3d966b1
Add support for past states (#5399)
* Add support for past states

* Style and forgotten self

* You mean, documenting is not enough? I have to actually add it too?

* Add memory support during evaluation

* Fix tests in eval and add TF support

* No need to change this line anymore
2020-07-01 08:11:55 -04:00
Sylvain Gugger
4ade7491f4
Fix examples titles and optimization doc page (#5408) 2020-07-01 08:11:25 -04:00
Moseli Motsoehli
d60d231ea4
Create README.md (#5422)
* Create README.md

* Update model_cards/MoseliMotsoehli/TswanaBert/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-07-01 05:01:51 -04:00
Jay
298bdab18a
Create model card for schmidek/electra-small-cased (#5400) 2020-07-01 04:01:56 -04:00