Commit Graph

5742 Commits

Author SHA1 Message Date
Ceyda Cinarel
29b536a73a
[WIP] Ner pipeline grouped_entities fixes (#5970)
* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge

Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-03 17:21:04 -05:00
Stas Bekman
1bb4bba53c
[CIs] Better reports everywhere (#8275)
* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix
2020-11-03 16:57:12 -05:00
Sylvain Gugger
7f556d2e39
Data collator for token classification (#8274)
* Add DataCollatorForTokenClassification and clean tests

* Make quality
2020-11-03 16:33:27 -05:00
Philip May
6a064447f2
improve documentation of training_args.py (#8270)
* improve documentation of training_args.py

- do_train
- do_eval
- do_predict

* fix line too long

* fix style with black on training_args.py

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix line length with utils/style_doc

* black reformatting

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-03 15:57:17 -05:00
Sylvain Gugger
4c19f3baab
Clean Trainer tests and datasets dep (#8268) 2020-11-03 15:50:55 -05:00
Patrick von Platen
068e6b5edd
make files independent (#8267) 2020-11-03 21:13:33 +01:00
Stas Bekman
cd360dcb26
[examples] minimal version requirement run-time check in PL (#8133)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-03 13:17:11 -05:00
Stas Bekman
971c638ee9
forward the worker stderr to the parent process (#8262) 2020-11-03 12:04:53 -05:00
Lysandre
eb6313e823 Fix Tatoeba skip 2020-11-03 10:35:00 -05:00
guillaume-be
74f6f91a9d
Updated ConversationalPipeline to work with encoder-decoder models (#8207)
* Updated ConversationalPipeline to work with encoder-decoder models (e.g. BlenderBot)

* Addition of integration test for EncoderDecoder conversation model

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-03 10:33:01 -05:00
Nicolas Patry
c66ffa3a17
[FIX] TextGenerationPipeline is currently broken. (#8256)
* [FIX] TextGenerationPipeline is currently broken.

It's most likely due to #8180.
What's missing is a multi vs single string handler at the beginning of
the pipe.
And also there was no testing of this pipeline.

* Fixing Conversational tests too.
2020-11-03 10:10:22 -05:00
Patrick von Platen
a1bbcf3f6c
Refactoring the generate() function (#6949)
* first draft

* show design proposition for new generate method

* up

* make better readable

* make first version

* gpt2 tests pass

* make beam search for gpt2 work

* add first encoder-decoder code

* delete typo

* make t5 work

* save indermediate

* make bart work with beam search

* finish beam search bart / t5

* add default kwargs

* make more tests pass

* fix no bad words sampler

* some fixes and tests for all distribution processors

* fix test

* fix rag slow tests

* merge to master

* add nograd to generate

* make all slow tests pass

* speed up generate

* fix edge case bug

* small fix

* correct typo

* add type hints and docstrings

* fix typos in tests

* add beam search tests

* add tests for beam scorer

* fix test rag

* finish beam search tests

* move generation tests in seperate file

* fix generation tests

* more tests

* add aggressive generation tests

* fix tests

* add gpt2 sample test

* add more docstring

* add more docs

* finish doc strings

* apply some more of sylvains and sams comments

* fix some typos

* make fix copies

* apply lysandres and sylvains comments

* final corrections on examples

* small fix for reformer
2020-11-03 16:04:22 +01:00
Sam Shleifer
b63beb743c
Skip tatoeba tests if Tatoeba-Challenge not cloned (#8260) 2020-11-03 09:49:29 -05:00
Patrick von Platen
9f1747f999
[Seq2Seq] Correct import in Seq2Seq Trainer (#8254) 2020-11-03 07:56:41 -05:00
Stas Bekman
504ff7bb12
2 SinusoidalPositionalEmbedding fixes (#8226) 2020-11-02 18:50:26 -05:00
Patrick von Platen
f744b81572
add new notebooks (#8246) 2020-11-02 20:21:55 +01:00
Patrick von Platen
dc26726df2
fix encoder decoder bug (#8243) 2020-11-02 20:12:34 +01:00
Lysandre Debut
9a23af4aff
Add XLMProphetNetTokenizer to tokenization auto (#8245) 2020-11-02 14:10:09 -05:00
Patrick von Platen
5b178f3c87
Create README.md 2020-11-02 20:03:44 +01:00
Sylvain Gugger
e1b1b614b1
Add line by line option to mlm/plm scripts (#8240)
* Make line by line optional in run_mlm

* Add option to disable dynamic padding

* Add option to plm too and update README

* Typos

* More typos

* Even more typos

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-02 12:27:04 -05:00
Patrick von Platen
ebec410c71
Create README.md 2020-11-02 17:53:22 +01:00
Sylvain Gugger
5406f31a1a
Fix TensorBoardCallback for older versions of PyTorch (#8239) 2020-11-02 10:43:28 -05:00
Sylvain Gugger
d1ad4bff44
Fix bad import with PyTorch <= 1.4.1 (#8237) 2020-11-02 10:26:37 -05:00
Lysandre Debut
3c8d401cf6
Patch reports (#8238) 2020-11-02 10:26:25 -05:00
Martin Monperrus
93354bc779
doc: fix typo (#8235) 2020-11-02 08:53:17 -05:00
Santiago Castro
0c92e7d9fa
Fix ignore list behavior in doctests (#8213) 2020-11-02 08:47:37 -05:00
Nicolas Patry
84caa23301
Fix the behaviour of DefaultArgumentHandler (removing it). (#8180)
* Some work to fix the behaviour of DefaultArgumentHandler by removing it.

* Fixing specific pipelines argument checking.
2020-11-02 12:33:50 +01:00
Zhiqi Huang
00cc2d1df2
DynaBERT model cards update (#8192)
* Update README.md

* Update README.md
2020-11-02 13:19:38 +08:00
Kushal
aa79aa4e7d
Added 12 model cards for Indian Language Models (#8198)
* Create README.md

* added model cards
2020-11-02 13:17:43 +08:00
Patrick von Platen
9bd30f7cf4
[Seq2SeqTrainer] Move import to init to make file self-contained (#8194)
* boom boom

* reverse order
2020-11-01 23:31:55 +01:00
guillaume-be
1f12934df4
[Bug fix] Fixed value for BlenderBot pad token (#8205) 2020-11-01 10:21:57 -05:00
Abi See
8f1c960ee7
Fix two bugs with --logging_first_step (#8193)
* make sure that logging_first_step evaluates

* fix bug with incorrect loss on logging_first_step

* fix style

* logging_first_step only logs, not evals
2020-10-30 16:45:38 -04:00
Avital Oliver
689ff74f99
Minor style improvements for the Flax BERT and RoBERTa examples (#8178)
* Minor style improvements:

1. Use `@nn.compact` rather than `@compact` (as to not make it seem
   like compact is a standard Python decorator.
2. Move attribute docstrings from two `__call__` methods to comments
   on the attributes themselves. (This was probably a remnant from
   the pre-Linen version where the attributes were arguments to
   `call`.)

* Use black on the Flax modeling code
2020-10-30 16:25:39 -04:00
Sylvain Gugger
9eb3a410cd
Remove deprecated arguments from new run_clm (#8197) 2020-10-30 15:27:20 -04:00
TFUsers
00112c3539
Replace swish with silu (#8166)
* Replace swish with silu

* revert nn.silu to nn.swish due to older version

* simplify optimized silu conditional and fix format

* Update activations.py

* Update activations_tf.py

* Update modeling_flax_utils.py

* Update modeling_openai.py

* add swish testcase

* add pytorch swish testcase

* Add more robust python version check

* more formatting fixes

Co-authored-by: TFUsers <TFUsers@gmail.com>
2020-10-30 15:09:10 -04:00
Sylvain Gugger
cdc48ce92d
Finalize lm examples (#8188)
* Finish the cleanup of the language-modeling examples

* Update main README

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Propagate changes

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-30 14:20:18 -04:00
Sylvain Gugger
089cc1015e
Doc fixes and filter warning in wandb (#8189) 2020-10-30 12:37:34 -04:00
Sam Shleifer
566b083eb1
TFMarian, TFMbart, TFPegasus, TFBlenderbot (#7987)
* Start plumbing

* Marian close

* Small stubs for all children

* Fixed bart

* marian working

* pegasus test is good, but failing

* Checkin tests

* More model files

* Subtle marian, pegasus integration test failures

* Works well

* rm print

* boom boom

* Still failing model2doc

* merge master

* Equivalence test failing, all others fixed

* cleanup

* Fix embed_scale

* Cleanup marian pipeline test

* Undo extra changes

* Smaller delta

* Cleanup model testers

* undo delta

* fix tests import structure

* cross test decorator

* Cleaner set_weights

* Respect authorized_unexpected_keys

* No warnings

* No warnings

* style

* Nest tf import

* black

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* functional dropout

* fixup

* Fixup

* style_doc

* embs

* shape list

* delete slow force_token_id_to_be_generated func

* fixup

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-30 11:23:16 -04:00
Santiago Castro
6279072f5f
Fix typo: s/languaged/language/ (#8165) 2020-10-30 11:22:03 -04:00
Lysandre Debut
10f8c63620
Ci test tf super slow (#8007)
* Test TF GPU CI

* Change cache

* Fix missing torch requirement

* Fix some model tests


Style

* LXMERT

* MobileBERT

* Longformer skip test

* XLNet

* The rest of the tests

* RAG goes OOM in multi gpu setup

* YAML test files

* Last fixes

* Skip doctests

* Fill mask tests

* Yaml files

* Last test fix

* Style

* Update cache

* Change ONNX tests to slow + use tiny model
2020-10-30 10:25:48 -04:00
Nicolas Patry
7e36deec7a
Fixing some warnings in DeBerta (#8176)
* Fixing some warnings in DeBerta

* Fixing docs with their rewritten version.
2020-10-30 09:15:41 -04:00
Stas Bekman
0538820737
[CI] Better reports #2 (#8163) 2020-10-29 19:30:05 -04:00
wlhgtc
9a21b50614
Fix eval ref miss in Chinese WWM. (#8115)
* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

* fix eval ref file miss bug

* format file

* MOD: move ref code to contrib

* MOD: add delimeter check

* reformat code

* refomat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-29 17:08:39 -04:00
Santiago Castro
fdf893c441
Fix typo: indinces -> indices (#8159)
* Fix typo: indinces -> indices

* Fix some more

* Fix some more

* Fix some more

* Fix CI
2020-10-29 17:04:20 -04:00
Stas Bekman
c83cec44f8
improve error checking (#8157) 2020-10-29 14:05:24 -04:00
Sylvain Gugger
691176283d
Add a template for examples and apply it for mlm and plm examples (#8153)
* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Styling
2020-10-29 13:38:11 -04:00
Sam Shleifer
49e4fece5c
[s2s] distillBART docs for paper replication (#8150) 2020-10-29 12:01:15 -04:00
Sylvain Gugger
acf56408d8
Smarter prediction loop and no- -> no_ in console args (#8151)
* Smarter prediction loop and no- -> no_ in console args

* Fix test
2020-10-29 10:56:25 -04:00
Sylvain Gugger
b0f1c0ee30
Document tokenizer_class in configurations (#8152) 2020-10-29 10:43:45 -04:00
Santiago Castro
969859d5f6
Fix doc errors and typos across the board (#8139)
* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes
2020-10-29 10:33:33 -04:00