Commit Graph

5759 Commits

Author SHA1 Message Date
Guillem García Subies
77c8f6c627
Corrected typo in readme (#8320) 2020-11-05 07:48:36 -05:00
Patrick von Platen
226b9debb7
Update PULL_REQUEST_TEMPLATE.md 2020-11-05 09:40:15 +01:00
Patrick von Platen
6f35c61f93
Update bug-report.md 2020-11-05 09:39:05 +01:00
Yifan Peng
638c0b7c50
Create README.md (#8223)
* Create README.md

* Update README.md

* Apply suggestions from code review

Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-05 03:03:19 -05:00
Sylvain Gugger
9c4aa4ac1a
Clean up data collators and datasets (#8308)
* Clean up data collators and datasets

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-04 17:24:49 -05:00
Manuel Romero
b1d3e95eb5
Fix path to old run_language_modeling.py script (#8302) 2020-11-04 13:17:57 -05:00
Sylvain Gugger
b6e58db277
Speedup doc build (#8301)
* Try -j option

* Try other thing

* Bigger machine

* Test lower sphinx version

* Remove trailing space
2020-11-04 11:51:21 -05:00
Victor SANH
969ccac2e9
adding model cards for distilled models (#8300)
* adding model cards for distil models

* forgot the languages
2020-11-04 11:41:45 -05:00
Nicolas Patry
7342d9a583
Improve QA pipeline error handling (#8286)
- The issue is that with previous code we would have the following:

```python
qa_pipeline = (...)
qa_pipeline(question="Where was he born ?", context="")
-> IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

The goal here is to improve this to actually return a ValueError
wherever possible.

While at it, I tried to simplify QuestionArgumentHandler's code to
make it smaller and more compat while keeping backward compat.
2020-11-04 11:30:42 -05:00
Branden Chan
38630e7a87
Update model cards of deepset/roberta-base-squad2 v1 and v2 (#8241)
* update deepset/roberta-base-squad2 to v2

* Update model_cards/deepset/roberta-base-squad2/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-04 11:21:25 -05:00
Manuel Romero
04561ecbe6
Model card: T5-base fine-tuned on QASC (#8299) 2020-11-04 11:20:15 -05:00
Sylvain Gugger
854b44aa38 Revert size change as it doesn't change anything 2020-11-04 11:13:24 -05:00
Sylvain Gugger
414985c427 Upgrade resource for doc building 2020-11-04 10:44:19 -05:00
Sylvain Gugger
cf89724696
Fix validation file loading in scripts (#8298) 2020-11-04 10:42:18 -05:00
Patrick von Platen
cb966e640b
[Generate Test] fix greedy generate test (#8293)
* fix greedy generate test

* delet ipdb
2020-11-04 15:44:36 +01:00
Pengzhi Gao
734afa37f6
Fix typo in language-modeling README.md (#8287) 2020-11-04 09:38:02 -05:00
Stas Bekman
7a7e2c2606
[blenderbot] regex fix (#8282)
Fixing:

```
src/transformers/tokenization_blenderbot.py:163: DeprecationWarning: invalid escape sequence \s
    token = re.sub("\s{2,}", " ", token)
```
2020-11-04 09:02:28 -05:00
Ceyda Cinarel
29b536a73a
[WIP] Ner pipeline grouped_entities fixes (#5970)
* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge

Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-03 17:21:04 -05:00
Stas Bekman
1bb4bba53c
[CIs] Better reports everywhere (#8275)
* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix
2020-11-03 16:57:12 -05:00
Sylvain Gugger
7f556d2e39
Data collator for token classification (#8274)
* Add DataCollatorForTokenClassification and clean tests

* Make quality
2020-11-03 16:33:27 -05:00
Philip May
6a064447f2
improve documentation of training_args.py (#8270)
* improve documentation of training_args.py

- do_train
- do_eval
- do_predict

* fix line too long

* fix style with black on training_args.py

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix line length with utils/style_doc

* black reformatting

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-03 15:57:17 -05:00
Sylvain Gugger
4c19f3baab
Clean Trainer tests and datasets dep (#8268) 2020-11-03 15:50:55 -05:00
Patrick von Platen
068e6b5edd
make files independent (#8267) 2020-11-03 21:13:33 +01:00
Stas Bekman
cd360dcb26
[examples] minimal version requirement run-time check in PL (#8133)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-03 13:17:11 -05:00
Stas Bekman
971c638ee9
forward the worker stderr to the parent process (#8262) 2020-11-03 12:04:53 -05:00
Lysandre
eb6313e823 Fix Tatoeba skip 2020-11-03 10:35:00 -05:00
guillaume-be
74f6f91a9d
Updated ConversationalPipeline to work with encoder-decoder models (#8207)
* Updated ConversationalPipeline to work with encoder-decoder models (e.g. BlenderBot)

* Addition of integration test for EncoderDecoder conversation model

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-03 10:33:01 -05:00
Nicolas Patry
c66ffa3a17
[FIX] TextGenerationPipeline is currently broken. (#8256)
* [FIX] TextGenerationPipeline is currently broken.

It's most likely due to #8180.
What's missing is a multi vs single string handler at the beginning of
the pipe.
And also there was no testing of this pipeline.

* Fixing Conversational tests too.
2020-11-03 10:10:22 -05:00
Patrick von Platen
a1bbcf3f6c
Refactoring the generate() function (#6949)
* first draft

* show design proposition for new generate method

* up

* make better readable

* make first version

* gpt2 tests pass

* make beam search for gpt2 work

* add first encoder-decoder code

* delete typo

* make t5 work

* save indermediate

* make bart work with beam search

* finish beam search bart / t5

* add default kwargs

* make more tests pass

* fix no bad words sampler

* some fixes and tests for all distribution processors

* fix test

* fix rag slow tests

* merge to master

* add nograd to generate

* make all slow tests pass

* speed up generate

* fix edge case bug

* small fix

* correct typo

* add type hints and docstrings

* fix typos in tests

* add beam search tests

* add tests for beam scorer

* fix test rag

* finish beam search tests

* move generation tests in seperate file

* fix generation tests

* more tests

* add aggressive generation tests

* fix tests

* add gpt2 sample test

* add more docstring

* add more docs

* finish doc strings

* apply some more of sylvains and sams comments

* fix some typos

* make fix copies

* apply lysandres and sylvains comments

* final corrections on examples

* small fix for reformer
2020-11-03 16:04:22 +01:00
Sam Shleifer
b63beb743c
Skip tatoeba tests if Tatoeba-Challenge not cloned (#8260) 2020-11-03 09:49:29 -05:00
Patrick von Platen
9f1747f999
[Seq2Seq] Correct import in Seq2Seq Trainer (#8254) 2020-11-03 07:56:41 -05:00
Stas Bekman
504ff7bb12
2 SinusoidalPositionalEmbedding fixes (#8226) 2020-11-02 18:50:26 -05:00
Patrick von Platen
f744b81572
add new notebooks (#8246) 2020-11-02 20:21:55 +01:00
Patrick von Platen
dc26726df2
fix encoder decoder bug (#8243) 2020-11-02 20:12:34 +01:00
Lysandre Debut
9a23af4aff
Add XLMProphetNetTokenizer to tokenization auto (#8245) 2020-11-02 14:10:09 -05:00
Patrick von Platen
5b178f3c87
Create README.md 2020-11-02 20:03:44 +01:00
Sylvain Gugger
e1b1b614b1
Add line by line option to mlm/plm scripts (#8240)
* Make line by line optional in run_mlm

* Add option to disable dynamic padding

* Add option to plm too and update README

* Typos

* More typos

* Even more typos

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-02 12:27:04 -05:00
Patrick von Platen
ebec410c71
Create README.md 2020-11-02 17:53:22 +01:00
Sylvain Gugger
5406f31a1a
Fix TensorBoardCallback for older versions of PyTorch (#8239) 2020-11-02 10:43:28 -05:00
Sylvain Gugger
d1ad4bff44
Fix bad import with PyTorch <= 1.4.1 (#8237) 2020-11-02 10:26:37 -05:00
Lysandre Debut
3c8d401cf6
Patch reports (#8238) 2020-11-02 10:26:25 -05:00
Martin Monperrus
93354bc779
doc: fix typo (#8235) 2020-11-02 08:53:17 -05:00
Santiago Castro
0c92e7d9fa
Fix ignore list behavior in doctests (#8213) 2020-11-02 08:47:37 -05:00
Nicolas Patry
84caa23301
Fix the behaviour of DefaultArgumentHandler (removing it). (#8180)
* Some work to fix the behaviour of DefaultArgumentHandler by removing it.

* Fixing specific pipelines argument checking.
2020-11-02 12:33:50 +01:00
Zhiqi Huang
00cc2d1df2
DynaBERT model cards update (#8192)
* Update README.md

* Update README.md
2020-11-02 13:19:38 +08:00
Kushal
aa79aa4e7d
Added 12 model cards for Indian Language Models (#8198)
* Create README.md

* added model cards
2020-11-02 13:17:43 +08:00
Patrick von Platen
9bd30f7cf4
[Seq2SeqTrainer] Move import to init to make file self-contained (#8194)
* boom boom

* reverse order
2020-11-01 23:31:55 +01:00
guillaume-be
1f12934df4
[Bug fix] Fixed value for BlenderBot pad token (#8205) 2020-11-01 10:21:57 -05:00
Abi See
8f1c960ee7
Fix two bugs with --logging_first_step (#8193)
* make sure that logging_first_step evaluates

* fix bug with incorrect loss on logging_first_step

* fix style

* logging_first_step only logs, not evals
2020-10-30 16:45:38 -04:00
Avital Oliver
689ff74f99
Minor style improvements for the Flax BERT and RoBERTa examples (#8178)
* Minor style improvements:

1. Use `@nn.compact` rather than `@compact` (as to not make it seem
   like compact is a standard Python decorator.
2. Move attribute docstrings from two `__call__` methods to comments
   on the attributes themselves. (This was probably a remnant from
   the pre-Linen version where the attributes were arguments to
   `call`.)

* Use black on the Flax modeling code
2020-10-30 16:25:39 -04:00