Commit Graph

5522 Commits

Author SHA1 Message Date
Huang Lianzhe
6303b5a718
[Bug Fix] The actual batch_size is inconsistent with the settings. (#7235)
* [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings

* reformat

* reformat

* reformat

* add support for dict and BatchEncoding

* add support for dict and BatchEncoding

* add documentation for DataCollatorForNextSentencePrediction

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rename variables

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-09-22 12:31:21 -04:00
Ola Piktus
c754c41c61
RAG (#6813)
* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* Formatting / renaming prior to actual work

* First commit

* improve comments

* Retrieval evaluation scripts

* refactor to include modeling outputs + MPI retriever

* Fix rag-token model + refactor

* Various fixes + finetuning logic

* use_bos fix

* Retrieval refactor

* Finetuning refactoring and cleanup

* Add documentation and cleanup

* Remove set_up_rag_env.sh file

* Fix retrieval wit HF index

* Fix import errors

* Fix quality errors

* Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867

* fix quality

* Fix RAG Sequence generation

* minor cleanup plus initial tests

* fix test

* fix tests 2

* Comments fix

* post-merge fixes

* Improve readme + post-rebase refactor

* Extra dependencied for tests

* Fix tests

* Fix tests 2

* Refactor test requirements

* Fix tests 3

* Post-rebase refactor

* rename nlp->datasets

* RAG integration tests

* add tokenizer to slow integration test and allow retriever to run on cpu

* add tests; fix position ids warning

* change structure

* change structure

* add from encoder generator

* save working solution

* make all integration tests pass

* add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained

* don't save paths

* delete unnecessary imports

* pass config to AutoTokenizer.from_pretrained for Rag tokenizers

* init wiki_dpr only once

* hardcode legacy index and passages paths (todo: add the right urls)

* finalize config

* finalize retriver api and config api

* LegacyIndex index download refactor

* add dpr to autotokenizer

* make from pretrained more flexible

* fix ragfortokengeneration

* small name changes in tokenizer

* add labels to models

* change default index name

* add retrieval tests

* finish token generate

* align test with previous version and make all tests pass

* add tests

* finalize tests

* implement thoms suggestions

* add first version of test

* make first tests work

* make retriever platform agnostic

* naming

* style

* add legacy index URL

* docstrings + simple retrieval test for distributed

* clean model api

* add doc_ids to retriever's outputs

* fix retrieval tests

* finish model outputs

* finalize model api

* fix generate problem for rag

* fix generate for other modles

* fix some tests

* save intermediate

* set generate to default

* big refactor generate

* delete rag_api

* correct pip faiss install

* fix auto tokenization test

* fix faiss install

* fix test

* move the distributed logic to examples

* model page

* docs

* finish tests

* fix dependencies

* fix import in __init__

* Refactor eval_rag and finetune scripts

* start docstring

* add psutil to test

* fix tf test

* move require torch to top

* fix retrieval test

* align naming

* finish automodel

* fix repo consistency

* test ragtokenizer save/load

* add rag model output docs

* fix ragtokenizer save/load from pretrained

* fix tokenizer dir

* remove torch in retrieval

* fix docs

* fixe finetune scripts

* finish model docs

* finish docs

* remove auto model for now

* add require torch

* remove solved todos

* integrate sylvains suggestions

* sams comments

* correct mistake on purpose

* improve README

* Add generation test cases

* fix rag token

* clean token generate

* fix test

* add note to test

* fix attention mask

* add t5 test for rag

* Fix handling prefix in finetune.py

* don't overwrite index_name

Co-authored-by: Patrick Lewis <plewis@fb.com>
Co-authored-by: Aleksandra Piktus <piktus@devfair0141.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5102.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5067.h2.fair>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
2020-09-22 18:29:58 +02:00
Sylvain Gugger
1ee2194fb6
Mark big downloads slow (#7325)
* Make big downloads as slow

* Add import

* Right order for slow decorator

* More slow tests
2020-09-22 12:21:52 -04:00
Julien Plu
585217c87f
Add generic text classification example in TF (#5716)
* Add new example with nlp

* Update README

* replace nlp by datasets

* Update examples/text-classification/README.md

Add Lysandre's suggestion.

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-09-22 12:05:05 -04:00
Lysandre
6e21f24220 Documentation version 2020-09-22 18:04:39 +02:00
Lysandre
3ebb1b3a2b Release: v3.2.0 2020-09-22 17:36:51 +02:00
Sylvain Gugger
01f0fd0bab
Fixes for LayoutLM (#7318) 2020-09-22 10:37:11 -04:00
Julien Plu
702a76ff92
Create an XLA parameter and fix the mixed precision (#7311)
* Create an XLA parameter and fix mixed precision creation

* Fix issue brought by intellisense

* Complete docstring
2020-09-22 10:19:34 -04:00
Sylvain Gugger
596342c2b9
Support for Windows in check_copies (#7316) 2020-09-22 10:17:48 -04:00
Sylvain Gugger
89edf504bf
Add possibility to evaluate every epoch (#7302)
* Add possibility to evaluate every epoch

* Remove multitype arg

* Remove needless import

* Use a proper enum

* Apply suggestions from @LysandreJik

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* One else and formatting

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-09-22 09:52:29 -04:00
Sylvain Gugger
21ca148090
is_pretokenized -> is_split_into_words (#7236)
* is_pretokenized -> is_split_into_words

* Fix tests
2020-09-22 09:34:35 -04:00
Julien Plu
324f361e91
Fix saving TF custom models (#7291)
* Fix #7277

* Apply style

* Add a full training pipeline test

* Apply style
2020-09-22 09:31:13 -04:00
Minghao Li
cd9a0585ea
Add LayoutLM Model (#7064)
* first version

* finish test docs readme model/config/tokenization class

* apply make style and make quality

* fix layoutlm GitHub link

* fix conflict in index.rst and add layoutlm to pretrained_models.rst

* fix bug in test_parents_and_children_in_mappings

* reformat modeling_auto.py and tokenization_auto.py

* fix bug in test_modeling_layoutlm.py

* Update docs/source/model_doc/layoutlm.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/model_doc/layoutlm.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* remove inh, add tokenizer fast, and update some doc

* copy and rename necessary class from modeling_bert to modeling_layoutlm

* Update src/transformers/configuration_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/configuration_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/configuration_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/configuration_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* add mish to activations.py, import ACT2FN and import logging from utils

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-09-22 09:28:02 -04:00
Sylvain Gugger
244e1b5ba3
Fix #7304 (#7305) 2020-09-22 09:20:03 -04:00
Lysandre Debut
e46108817e
Adds FSMT to LM head AutoModel (#7312) 2020-09-22 06:35:51 -04:00
Stas Bekman
e2964b8a19
[fsmt] no need to pass device (#7292) 2020-09-22 05:39:06 -04:00
Sylvain Gugger
e4b94d8e58
Copy code from Bert to Roberta and add safeguard script (#7219)
* Copy code from Bert to Roberta and add safeguard script

* Fix docstring

* Comment code

* Formatting

* Update src/transformers/modeling_roberta.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add test and fix bugs

* Fix style and make new comand

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-09-22 05:02:27 -04:00
Sam Shleifer
656c27c3a3
[s2s] save hostname with repo info (#7301)
* save hostname
2020-09-21 17:26:24 -04:00
Thomas Winters
34a1b75f01
Added RobBERT-v2 model card (#7286)
* Added RobBERT-v2 model card

* minor Tweaks

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-09-21 16:17:28 -04:00
jjacampos
6513d16a48
IXAmBERT model card (#7283)
This PR includes the model card for the IXAmBERT model which has been recently uploaded to the huggingface repository.
2020-09-21 16:15:31 -04:00
Stas Bekman
af4b98ed97
[s2s] adjust finetune + test to work with fsmt (#7263) 2020-09-21 15:13:19 -04:00
Stas Bekman
8d562a2d1a
[s2s] s/alpha_loss_encoder/alpha_encoder_loss/ (#7298)
fix to match `distillation.py:        self.alpha_encoder_loss`
2020-09-21 14:14:26 -04:00
Stas Bekman
cbb2f75a16
[s2s tests] fix test_run_eval_search (#7297) 2020-09-21 14:00:40 -04:00
Suraj Patil
7a88ed6c2a
[model card] distlbart-mnli model cards (#7278) 2020-09-21 12:26:18 -04:00
Sylvain Gugger
63276b76d4
Fix #7284 (#7289) 2020-09-21 10:31:26 -04:00
Raphaël Bournhonesque
8d464374ba
Disable missing weight warning (#7282) 2020-09-21 09:14:48 -04:00
Stas Bekman
8ff88d25e9
[fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test (#7224)
* fix USE_CUDA, add pipeline

* USE_CUDA fix

* recode SinusoidalPositionalEmbedding into nn.Embedding subclass

was needed for torchscript to work - this is now part of the state_dict, so will have to remove these keys during save_pretrained

* back out (ci debug)

* restore

* slow last?

* facilitate not saving certain keys and test

* remove no longer used keys

* style

* fix logging import

* cleanup

* Update src/transformers/modeling_utils.py

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* fix bug in max_positional_embeddings

* rename keys to keys_to_never_save per suggestion, improve the setup

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-09-21 09:13:35 -04:00
Dat Quoc Nguyen
67c4b0c517
Add model cards for new pre-trained BERTweet-COVID19 models (#7269)
Two new pre-trained models "vinai/bertweet-covid19-base-cased" and "vinai/bertweet-covid19-base-uncased" are resulted by further pre-training the pre-trained model "vinai/bertweet-base" on a  corpus of 23M COVID-19 English Tweets for 40 epochs.
2020-09-21 06:12:51 -04:00
Patrick von Platen
0cbe1139b1
Update README.md 2020-09-21 11:53:08 +02:00
Lysandre
aae4edb5f0 Addressing review comment 2020-09-21 11:37:00 +02:00
Suraj Patil
43b9d93875
[example/glue] fix compute_metrics_fn for bart like models (#7248)
* fix compute_metrics_fn

* p.predictions -> preds

* apply suggestions
2020-09-21 05:34:20 -04:00
guillaume-be
39062d05f0
Fixed target_mapping preparation for XLNet when batch size > 1 (incl. beam search) (#7267) 2020-09-21 04:53:52 -04:00
Nadir El Manouzi
4b3e55bdcc
Add "Fine-tune ALBERT for sentence-pair classification" notebook to the community notebooks (#7255) 2020-09-21 04:25:22 -04:00
Stas Bekman
7cbf0f722d
examples/seq2seq/__init__.py mutates sys.path (#7194) 2020-09-20 16:54:42 -04:00
Manuel Romero
a4faeceaed
Fix typo in model name (#7268) 2020-09-20 19:12:30 +02:00
Stas Bekman
47ab3e8262
@slow has to be last (#7251)
Found an issue when `@slow` isn't the last decorator (gets ignored!), so documenting this significance.
2020-09-20 09:17:29 -04:00
Stas Bekman
4f6e525742
model card improvements (#7221) 2020-09-19 17:02:05 -04:00
Stas Bekman
eb074af75e
fsmt tiny model card + script (#7244) 2020-09-19 14:37:12 -04:00
Manuel Romero
1d90d0f386
Add title to model card (#7240) 2020-09-19 02:10:45 -04:00
Manuel Romero
c9b7ef042f
Create README.md (#7239) 2020-09-19 02:09:29 -04:00
Sam Shleifer
83dba10b8f
[s2s] distributed_eval.py saves better speed info (#7242) 2020-09-18 15:46:01 -04:00
Dat Quoc Nguyen
af2322c7a0
Add new pre-trained models BERTweet and PhoBERT (#6129)
* Add BERTweet and PhoBERT models

* Update modeling_auto.py

Re-add `bart` to LM_MAPPING

* Update tokenization_auto.py

Re-add `from .configuration_mobilebert import MobileBertConfig`
not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`

* Add BERTweet and PhoBERT to pretrained_models.rst

* Update tokenization_auto.py

Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.

* Update BertweetTokenizer - without nltk

* Update model card for BERTweet

* PhoBERT - with Auto mode - without import fastBPE

* PhoBERT - with Auto mode - without import fastBPE

* BERTweet - with Auto mode - without import fastBPE

* Add PhoBERT and BERTweet to TF modeling auto

* Improve Docstrings for PhobertTokenizer and BertweetTokenizer

* Update PhoBERT and BERTweet model cards

* Fixed a merge conflict in tokenization_auto

* Used black to reformat BERTweet- and PhoBERT-related files

* Used isort to reformat BERTweet- and PhoBERT-related files

* Reformatted BERTweet- and PhoBERT-related files based on flake8

* Updated test files

* Updated test files

* Updated tf test files

* Updated tf test files

* Updated tf test files

* Updated tf test files

* Update commits from huggingface

* Delete unnecessary files

* Add tokenizers to auto and init files

* Add test files for tokenizers

* Revised model cards

* Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files

* Revised test files

* Update orders of Phobert and Bertweet tokenizers in auto tokenization file
2020-09-18 13:16:43 -04:00
Patrick von Platen
9397436ea5
Create README.md 2020-09-18 16:52:00 +02:00
Patrick von Platen
7eeca4d399
Create README.md 2020-09-18 16:44:02 +02:00
Patrick von Platen
31516c776a
Update README.md 2020-09-18 16:37:14 +02:00
Patrick von Platen
4c14669a78
Update README.md 2020-09-18 16:35:11 +02:00
Yih-Dar
3a03bab9db
Fix a few countings (steps / epochs) in trainer_tf.py (#7175) 2020-09-18 09:28:56 -04:00
Stefan Schweter
ee9eae4e06
token-classification: update url of GermEval 2014 dataset (#6571) 2020-09-18 06:18:06 -04:00
Julien Chaumond
eef8d94d19 [model_cards]
We use ISO 639-1 cc @gentaiscool
2020-09-18 12:09:24 +02:00
Patrick von Platen
afd6a9f827
Create README.md 2020-09-18 11:41:12 +02:00