Commit Graph

83 Commits

Author SHA1 Message Date
Thomas Wolf
ba8c4d0ac0
[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)
* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-18 20:51:24 +02:00
Sam Shleifer
96e47d9229
[cleanup] assign todos, faster bart-cnn test (#7835)
* 2 beam output

* unassign/remove TODOs

* remove one more
2020-10-16 03:11:18 -04:00
Sam Shleifer
e3e6517355
Fix 3 failing slow bart/blender tests (#7652) 2020-10-07 22:05:03 -04:00
Sam Shleifer
960faaaf28
Blenderbot (#7418)
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-07 19:09:23 -04:00
Patrick von Platen
62f5ae68ec
[Seq2Seq] Fix a couple of bugs and clean examples (#7474)
* clean T5

* fix t5 tests

* fix index typo

* fix tf common test

* fix examples

* change positional ordering for Bart and FSTM

* add signature test

* clean docs and add tests

* add docs to encoder decoder

* clean docs

* correct two doc strings

* remove sig test for TF Elektra & Funnel

* fix tf t5 slow tests

* fix input_ids to inputs in tf

* Update src/transformers/modeling_bart.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_bart.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* implement lysandre results

* make style

* fix encoder decoder typo

* fix tf slow tests

* fix slow tests

* renaming

* remove unused input

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-01 17:38:50 +02:00
Sam Shleifer
9336086ab5
prepare_seq2seq_batch makes labels/ decoder_input_ids made later. (#6654)
* broken test

* batch parity

* tests pass

* boom boom

* boom boom

* split out bart tokenizer tests

* fix tests

* boom boom

* Fixed dataset bug

* Fix marian

* Undo extra

* Get marian working

* Fix t5 tok tests

* Test passing

* Cleanup

* better assert msg

* require torch

* Fix mbart tests

* undo extra decoder_attn_mask change

* Fix import

* pegasus tokenizer can ignore src_lang kwargs

* unused kwarg test cov

* boom boom

* add todo for pegasus issue

* cover one word translation edge case

* Cleanup

* doc
2020-08-28 11:15:17 -04:00
Lysandre
a75c64d80c Black 20 release 2020-08-26 17:20:22 +02:00
Sylvain Gugger
a573777901
Update repo to isort v5 (#6686)
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
2020-08-24 11:03:01 -04:00
Suraj Patil
7581884dee
[BartTokenizerFast] add prepare_seq2seq_batch (#6543) 2020-08-19 10:37:48 -04:00
Sam Shleifer
ab42d74850
Fix bart base test (#6587) 2020-08-18 21:28:10 -04:00
Sam Shleifer
1529bf9680
add BartConfig.force_bos_token_to_be_generated (#6526)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-08-18 19:15:50 -04:00
Suraj Patil
2a77813d53
[BartTokenizer] add prepare s2s batch (#6212)
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
2020-08-17 11:44:46 -04:00
Sam Shleifer
66fa8ceaea
PegasusForConditionalGeneration (torch version) (#6340)
Co-authored-by: Jingqing  Zhang <jingqing.zhang15@imperial.ac.uk>
2020-08-11 14:31:23 -04:00
Sylvain Gugger
d951c14ae4
Model output test (#6155)
* Use return_dict=True in all tests

* Formatting
2020-07-31 09:44:37 -04:00
Sam Shleifer
353b8f1e7a
Add mbart-large-cc25, support translation finetuning (#5129)
improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg
2020-07-07 13:23:01 -04:00
Sam Shleifer
d4886173b2
[Bart] enable test_torchscript, update test_tie_weights (#5457)
* Passing all but one torchscript test

* Style

* move comment

* remove unneeded assert
2020-07-07 10:06:48 -04:00
Sam Shleifer
13deb95a40
Move tests/utils.py -> transformers/testing_utils.py (#5350) 2020-07-01 10:31:17 -04:00
Sam Shleifer
28a690a80e
[mBART] skip broken forward pass test, stronger integration test (#5327) 2020-06-28 15:08:28 -04:00
Sam Shleifer
393b8dc09a
examples/seq2seq/run_eval.py fixes and docs (#5322) 2020-06-26 19:20:43 -04:00
Thomas Wolf
601d4d699c
[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308)
* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples
2020-06-26 19:48:14 +02:00
Patrick von Platen
c2a26ec8a6
[Use cache] Align logic of use_cache with output_attentions and output_hidden_states (#5194)
* fix use cache

* add bart use cache

* fix bart

* finish bart
2020-06-24 16:09:17 +02:00
Sam Shleifer
84be482f66
AutoTokenizer supports mbart-large-en-ro (#5121) 2020-06-18 20:47:37 -04:00
Patrick von Platen
ebba39e4e1
[Bart] Question Answering Model is added to tests (#5024)
* fix test

* Update tests/test_modeling_common.py

* Update tests/test_modeling_common.py
2020-06-15 22:50:09 +02:00
Sam Shleifer
a9f1fc6c94
Add bart-base (#5014) 2020-06-15 13:29:26 -04:00
Suraj Patil
e93ccb3290
BartForQuestionAnswering (#4908) 2020-06-12 15:47:57 -04:00
Sam Shleifer
5620033115
[mbart] Fix fp16 testing logic (#4949) 2020-06-11 22:11:34 -04:00
Sam Shleifer
08b59d10e5
MBartTokenizer:add language codes (#3776) 2020-06-11 13:02:33 -04:00
Sylvain Gugger
f1fe18465d
Use labels to remove deprecation warnings (#4807) 2020-06-05 16:41:46 -04:00
Julien Chaumond
b42586ea56
Fix CI after killing archive maps (#4724)
* 🐛 Fix model ids for BART and Flaubert
2020-06-02 10:21:09 -04:00
Julien Chaumond
d4c2cb402d
Kill model archive maps (#4636)
* Kill model archive maps

* Fixup

* Also kill model_archive_map for MaskedBertPreTrainedModel

* Unhook config_archive_map

* Tokenizers: align with model id changes

* make style && make quality

* Fix CI
2020-06-02 09:39:33 -04:00
Sam Shleifer
b86e42e0ac
[ci] fix 3 remaining slow GPU failures (#4584) 2020-05-25 19:20:50 -04:00
Sam Shleifer
956c4c4eb4
[gpu slow tests] fix mbart-large-enro gpu tests (#4472) 2020-05-19 19:45:31 -04:00
Julien Chaumond
4bf5042240
Fix BART tests on GPU (#4298) 2020-05-12 09:11:50 -04:00
Sam Shleifer
18db92dd9a
[testing] add timeout_decorator (#3543) 2020-05-01 09:05:47 -04:00
Julien Chaumond
f54dc3f4d5 [ci] Load pretrained models into the default (long-lived) cache
There's an inconsistency right now where:
- we load some models into CACHE_DIR
- and some models in the default cache
- and often, in both for the same models

When running the RUN_SLOW tests, this takes a lot of disk space, time, and bandwidth.

I'd rather always use the default cache
2020-04-30 22:30:15 -04:00
Sam Shleifer
847e7f3379
MarianMTModel.from_pretrained('Helsinki-NLP/opus-marian-en-de') (#3908)
Co-Authored-By: Stefan Schweter <stefan@schweter.it>
2020-04-28 18:22:37 -04:00
Sam Shleifer
7a7fdf71f8
Multilingual BART - (#3602)
- support mbart-en-ro weights
- add MBartTokenizer
2020-04-10 11:25:39 -04:00
Sam Shleifer
715aa5b135
[Bart] Replace config.output_past with use_cache kwarg (#3632) 2020-04-07 19:08:26 -04:00
Sam Shleifer
8deff3acf2
[bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488) 2020-03-30 12:28:27 -04:00
Patrick von Platen
75ec6c9e3a
[T5] make decoder input ids optional for t5 training (#3521)
* make decoder input ids optional for t5 training

* lm_lables should not be shifted in t5

* add tests

* finish shift right functionality for PT T5

* move shift right to correct class

* cleaner code

* replace -100 values with pad token id

* add assert statement

* remove unnecessary for loop

* make style
2020-03-30 13:45:26 +02:00
Sam Shleifer
f6a23d1911
[BART] add bart-large-xsum weights (#3422) 2020-03-29 10:51:13 -04:00
Sam Shleifer
3ee431dd4c
[Bart/Memory] Two separate, smaller decoder attention masks (#3371) 2020-03-26 21:34:15 -04:00
Sam Shleifer
39371ee454
[Bart/Memory] don't create lm_head (#3323)
* delete lm_head, skips weight tying
* Fixed s3
2020-03-26 18:40:39 -04:00
Patrick von Platen
95e00d0808
Clean special token init in modeling_....py (#3264)
* make style

* fix conflicts
2020-03-20 21:41:04 +01:00
Patrick von Platen
bbf26c4e61
Support T5 Generation (#3228)
* fix conflicts

* update bart max length test

* correct spelling mistakes

* implemented model specific encode function

* fix merge conflicts

* better naming

* save intermediate state -> need to rethink strucuture a bit

* leave tf problem as it is for now

* current version

* add layers.pop

* remove ipdb

* make style

* clean return cut decoding

* remove ipdbs

* Fix restoring layers in the decoders that doesnt exists.

* push good intermediate solution for now

* fix conflicts

* always good to refuse to merge conflicts when rebasing

* fix small bug

* improve function calls

* remove unused file

* add correct scope behavior for t5_generate

Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2020-03-19 23:18:23 +01:00
Sam Shleifer
ad7233fc01
[BART] cleanup: remove redundant kwargs, improve docstrings (#3319) 2020-03-19 11:16:51 -04:00
Patrick von Platen
e8f44af5bf
[generate] do_sample default back to False (#3298)
* change do_samples back

* None better default as boolean

* adapt do_sample to True in test example

* make style
2020-03-17 10:52:37 -04:00
Sam Shleifer
b2c1a447fe
[BART] Delete redundant unit test (#3302) 2020-03-16 23:09:10 -04:00
Sam Shleifer
5ea8ba67b4
[BART] Remove unused kwargs (#3279)
* Remove unused kwargs
* dont call forward in tests
2020-03-15 23:00:44 -04:00
Thomas Wolf
3814e167d9
Merge pull request #3225 from patrickvonplaten/finalize_merge_bart_generate_into_default_generate
Complete merge Seq-2-Seq generation into default generation
2020-03-14 15:08:59 +01:00