Commit Graph

393 Commits

Author SHA1 Message Date
Patrick von Platen
fe81f7d12c
finish reformer qa head (#5433) 2020-07-01 12:27:14 -04:00
Patrick von Platen
d697b6ca75
[Longformer] Major Refactor (#5219)
* refactor naming

* add small slow test

* refactor

* refactor naming

* rename selected to extra

* big global attention refactor

* make style

* refactor naming

* save intermed

* refactor functions

* finish function refactor

* fix tests

* fix longformer

* fix longformer

* fix longformer

* fix all tests but one

* finish longformer

* address sams and izs comments

* fix transpose
2020-07-01 17:43:32 +02:00
Sam Shleifer
e0d58ddb65
[fix] Marian tests import (#5442) 2020-07-01 11:42:22 -04:00
Funtowicz Morgan
608d5a7c44
Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389)
* Added PipelineException

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fill-mask pipeline raises exception when more than one mask_token detected.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Put everything in a function.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added tests on pipeline fill-mask when input has != 1 mask_token

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix numel() computation for TF

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing PR comments.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove function typing to avoid import on specific framework.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Retry typing with @julien-c tip.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality².

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify fill-mask mask_token checking.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI
2020-07-01 17:27:47 +02:00
Sam Shleifer
43cb03a93d
MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182) 2020-07-01 10:32:50 -04:00
Sam Shleifer
13deb95a40
Move tests/utils.py -> transformers/testing_utils.py (#5350) 2020-07-01 10:31:17 -04:00
Sam Shleifer
32d2031458
[fix] slow fill_mask test failure (#5406) 2020-06-30 15:28:15 -04:00
Patrick von Platen
4bcc35cd69
[Docs] Benchmark docs (#5360)
* first doc version

* add benchmark docs

* fix typos

* improve README

* Update docs/source/benchmarks.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* fix naming and docs

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-29 16:08:57 +02:00
Sam Shleifer
28a690a80e
[mBART] skip broken forward pass test, stronger integration test (#5327) 2020-06-28 15:08:28 -04:00
Sam Shleifer
393b8dc09a
examples/seq2seq/run_eval.py fixes and docs (#5322) 2020-06-26 19:20:43 -04:00
Thomas Wolf
601d4d699c
[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308)
* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples
2020-06-26 19:48:14 +02:00
Sam Shleifer
798dbff6a7
[pipelines] Change summarization default to distilbart-cnn-12-6 (#5289) 2020-06-26 11:43:23 -04:00
Funtowicz Morgan
135791e8ef
Add pad_to_multiple_of on tokenizers (reimport) (#5054)
* Add new parameter `pad_to_multiple_of` on tokenizers.

* unittest for pad_to_multiple_of

* Add .name when logging enum.

* Fix missing .items() on dict in tests.

* Add special check + warning if the tokenizer doesn't have proper pad_token.

* Use the correct logger format specifier.

* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.

* Skip test if tokenizer doesn't have pad_token

* Fix RobertaTokenizer on empty input

* Format.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix and updating to simpler API

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-06-26 11:55:57 +02:00
Lysandre Debut
364a5ae1f0
Refactor Code samples; Test code samples (#5036)
* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-06-25 16:46:00 -04:00
Thomas Wolf
315f464b0a
[tokenizers] Several small improvements and bug fixes (#5287)
* avoid recursion in id checks for fast tokenizers

* better typings and fix #5232

* align slow and fast tokenizers behaviors for Roberta and GPT2

* style and quality

* fix tests - improve typings
2020-06-25 22:17:14 +02:00
Thomas Wolf
27cf1d97f0
[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252)
* fix-5181

Padding to max sequence length while truncation to another length was wrong on slow tokenizers

* clean up and fix #5155

* fix XLM test

* Fix tests for Transfo-XL

* logging only above WARNING in tests

* switch slow tokenizers tests in @slow

* fix Marian truncation tokenization test

* style and quality

* make the test a lot faster by limiting the sequence length used in tests
2020-06-25 17:24:28 +02:00
Thomas Wolf
7ac9110711
Add more tests on tokenizers serialization - fix bugs (#5056)
* update tests for fast tokenizers + fix small bug in saving/loading

* better tests on serialization

* fixing serialization

* comment cleanup
2020-06-24 21:53:08 +02:00
Lysandre Debut
cf10d4cfdd
Cleaning TensorFlow models (#5229)
* Cleaning TensorFlow models

Update all classes


stylr

* Don't average loss
2020-06-24 11:37:20 -04:00
Patrick von Platen
c2a26ec8a6
[Use cache] Align logic of use_cache with output_attentions and output_hidden_states (#5194)
* fix use cache

* add bart use cache

* fix bart

* finish bart
2020-06-24 16:09:17 +02:00
Patrick von Platen
9fe09cec76
[Benchmark] Extend Benchmark to all model type extensions (#5241)
* add benchmark for all kinds of models

* improved import

* delete bogus files

* make style
2020-06-24 15:11:42 +02:00
Sam Shleifer
58918c76f4
[bart] add config.extra_pos_embeddings to facilitate reuse (#5190) 2020-06-23 11:35:42 -04:00
Thomas Wolf
11fdde0271
Tokenizers API developments (#5103)
* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-23 13:36:57 +02:00
Sam Shleifer
5144104070
[fix] remove unused import (#5206) 2020-06-22 23:39:04 -04:00
Sam Shleifer
0d158e38c9
[fix] mobilebert had wrong path, causing slow test failure (#5205) 2020-06-22 23:31:36 -04:00
Thomas Wolf
ebc36108dc
[tokenizers] Fix #5081 and improve backward compatibility (#5125)
* fix #5081 and improve backward compatibility (slightly)

* add nlp to setup.cfg - style and quality

* align default to previous default

* remove test that doesn't generalize
2020-06-22 17:25:43 +02:00
Joseph Liu
f4e1f02210
Output hidden states (#4978)
* Configure all models to use output_hidden_states as argument passed to foward()

* Pass all tests

* Remove cast_bool_to_primitive in TF Flaubert model

* correct tf xlnet

* add pytorch test

* add tf test

* Fix broken tests

* Configure all models to use output_hidden_states as argument passed to foward()

* Pass all tests

* Remove cast_bool_to_primitive in TF Flaubert model

* correct tf xlnet

* add pytorch test

* add tf test

* Fix broken tests

* Refactor output_hidden_states for mobilebert

* Reset and remerge to master

Co-authored-by: Joseph Liu <joseph.liu@coinflex.com>
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2020-06-22 10:10:45 -04:00
RafaelWO
b99ad457f4
Added feature to move added tokens in vocabulary for Transformer-XL (#4953)
* Fixed resize_token_embeddings for transfo_xl model

* Fixed resize_token_embeddings for transfo_xl.

Added custom methods to TransfoXLPreTrainedModel for resizing layers of
the AdaptiveEmbedding.

* Updated docstring

* Fixed resizinhg cutoffs; added check for new size of embedding layer.

* Added test for resize_token_embeddings

* Fixed code quality

* Fixed unchanged cutoffs in model.config

* Added feature to move added tokens in tokenizer.

* Fixed code quality

* Added feature to move added tokens in tokenizer.

* Fixed code quality

* Fixed docstring, renamed sym to 	oken.

Co-authored-by: Rafael Weingartner <rweingartner.its-b2015@fh-salzburg.ac.at>
2020-06-22 15:40:52 +02:00
Patrick von Platen
fa0be6d761
Benchmarks (#4912)
* finish benchmark

* fix isort

* fix setup cfg

* retab

* fix time measuring of tf graph mode

* fix tf cuda

* clean code

* better error message
2020-06-22 12:06:56 +02:00
Vasily Shamporov
9a3f91088c
Add MobileBert (#4901)
* Add MobileBert

* Quality + Conversion script

* style

* Update src/transformers/modeling_mobilebert.py

* Links to S3

* Style

* TFMobileBert

Slight fixes to the pytorch MobileBert
Style

* MobileBertForMaskedLM (PT + TF)

* MobileBertForNextSentencePrediction (PT + TF)

* MobileFor{MultipleChoice, TokenClassification} (PT + TF)


ss

* Tests + Auto

* Doc

* Tests

* Addressing @sgugger's comments

* Adressing @patrickvonplaten's comments

* Style

* Style

* Integration test

* style

* Model card

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-19 16:38:36 -04:00
Sam Shleifer
84be482f66
AutoTokenizer supports mbart-large-en-ro (#5121) 2020-06-18 20:47:37 -04:00
Sylvain Gugger
5f721ad6e4
Fix #5114 (#5122) 2020-06-18 19:20:04 -04:00
Deniz
32e94cff64
tf add resize_token_embeddings method (#4351)
* resize token embeddings

* add tokens

* add tokens

* add tokens

* add t5 token method

* add t5 token method

* add t5 token method

* typo

* debugging input

* debugging input

* debug

* debug

* debug

* trying to set embedding tokens properly

* set embeddings for generation head too

* set embeddings for generation head too

* debugging

* debugging

* enable generation

* add base method

* add base method

* add base method

* return logits in the main call

* reverting to generation

* revert back

* set embeddings for the bert main layer

* description

* fix conflicts

* logging

* set base model as self

* refactor

* tf_bert add method

* tf_bert add method

* tf_bert add method

* tf_bert add method

* tf_bert add method

* tf_bert add method

* tf_bert add method

* tf_bert add method

* v0

* v0

* finalize

* final

* black

* add tests

* revert back the emb call

* comments

* comments

* add the second test

* add vocab size condig

* add tf models

* add tf models. add common tests

* remove model specific embedding tests

* stylish

* remove files

* stylez

* Update src/transformers/modeling_tf_transfo_xl.py

change the error.

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* adding unchanged weight test

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-18 18:41:26 -04:00
Suraj Patil
ca2d0f98c4
ElectraForMultipleChoice (#4954)
* add ElectraForMultipleChoice

* add  test_for_multiple_choice

* add ElectraForMultipleChoice in auto model

* add ElectraForMultipleChoice in all_model_classes

* add SequenceSummary related parameters

* get rid pooler, use SequenceSummary instead

* add electra multiple choice test

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-18 14:59:35 -04:00
Sylvain Gugger
20fa828984
Make default_data_collator more flexible and deprecate old behavior (#5060)
* Make default_data_collator more flexible

* Accept tensors for all features

* Document code

* Refactor

* Formatting
2020-06-17 15:24:51 -04:00
Sam Shleifer
3d495c61ef
Fix marian tokenizer save pretrained (#5043) 2020-06-16 09:48:19 -04:00
Amil Khare
c852036b4a
[cleanup] Hoist ModelTester objects to top level (#4939)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-06-16 08:03:43 -04:00
Funtowicz Morgan
9e03364999
Ability to pickle/unpickle BatchEncoding pickle (reimport) (#5039)
* Added is_fast property on BatchEncoding to indicate if the object comes from a Fast Tokenizer.

* Added __get_state__() & __set_state__() to be pickable.

* Correct tokens() return type from List[int] to List[str]

* Added unittest for BatchEncoding pickle/unpickle

* Added unittest for BatchEncoding is_fast

* More careful checking on BatchEncoding unpickle tests.

* Formatting.

* is_fast should assertTrue on Rust tokenizers.

* Ensure tensorflow has correct way of checking array_equal

* More formatting.
2020-06-16 09:25:25 +02:00
Sylvain Gugger
f9f8a5312e
Add DistilBertForMultipleChoice (#5032)
* Add `DistilBertForMultipleChoice`
2020-06-15 18:31:41 -04:00
Anthony MOI
36434220fc
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-06-15 17:12:51 -04:00
Patrick von Platen
ebba39e4e1
[Bart] Question Answering Model is added to tests (#5024)
* fix test

* Update tests/test_modeling_common.py

* Update tests/test_modeling_common.py
2020-06-15 22:50:09 +02:00
Sam Shleifer
a9f1fc6c94
Add bart-base (#5014) 2020-06-15 13:29:26 -04:00
Sylvain Gugger
1affde2f10
Make DataCollator a callable (#5015)
* Make DataCollator a callable

* Update src/transformers/data/data_collator.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-06-15 11:58:33 -04:00
Suraj Patil
e93ccb3290
BartForQuestionAnswering (#4908) 2020-06-12 15:47:57 -04:00
Sylvain Gugger
538531cde5
Add AlbertForMultipleChoice (#4959)
* Add AlbertForMultipleChoice

* Make up to date and add all models to common tests
2020-06-12 14:20:19 -04:00
Patrick von Platen
86578bb04c
[AutoModel] Split AutoModelWithLMHead into clm, mlm, encoder-decoder (#4933)
* first commit

* add new auto models

* better naming

* fix bert automodel

* fix automodel for pretraining

* add models to init

* fix name typo

* fix typo

* better naming

* future warning instead of depreciation warning
2020-06-12 10:01:49 +02:00
Sam Shleifer
5620033115
[mbart] Fix fp16 testing logic (#4949) 2020-06-11 22:11:34 -04:00
Sam Shleifer
08b59d10e5
MBartTokenizer:add language codes (#3776) 2020-06-11 13:02:33 -04:00
Sylvain Gugger
20451195f0
Support multiple choice in tf common model tests (#4920)
* Support multiple choice in tf common model tests

* Add the input_embeds test
2020-06-11 10:31:26 -04:00
RafaelWO
e80d6c689b
Fix resize_token_embeddings for Transformer-XL (#4759)
* Fixed resize_token_embeddings for transfo_xl model

* Fixed resize_token_embeddings for transfo_xl.

Added custom methods to TransfoXLPreTrainedModel for resizing layers of
the AdaptiveEmbedding.

* Updated docstring

* Fixed resizinhg cutoffs; added check for new size of embedding layer.

* Added test for resize_token_embeddings

* Fixed code quality

* Fixed unchanged cutoffs in model.config

Co-authored-by: Rafael Weingartner <rweingartner.its-b2015@fh-salzburg.ac.at>
2020-06-10 19:03:06 -04:00
Sylvain Gugger
d541938c48
Make multiple choice models work with input_embeds (#4921) 2020-06-10 18:38:34 -04:00