* HerBERT transformer model for Polish language understanding.
* HerbertTokenizerFast generated with HerbertConverter
* Herbert base and large model cards
* Herbert model cards with tags
* Herbert tensorflow models
* Herbert model tests based on Bert test suit
* src/transformers/tokenization_herbert.py edited online with Bitbucket
* src/transformers/tokenization_herbert.py edited online with Bitbucket
* docs/source/model_doc/herbert.rst edited online with Bitbucket
* Herbert tokenizer tests and bug fixes
* src/transformers/configuration_herbert.py edited online with Bitbucket
* Copyrights and tests for TFHerbertModel
* model_cards/allegro/herbert-base-cased/README.md edited online with Bitbucket
* model_cards/allegro/herbert-large-cased/README.md edited online with Bitbucket
* Bug fixes after testing
* Reformat modified_only_fixup
* Proper order of configuration
* Herbert proper documentation formatting
* Formatting with make modified_only_fixup
* Dummies fixed
* Adding missing models to documentation
* Removing HerBERT model as it is a simple extension of BERT
* Update model_cards/allegro/herbert-base-cased/README.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* Update model_cards/allegro/herbert-large-cased/README.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* HerbertTokenizer deprecated configuration removed
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* Improving Pipelines by defaulting to framework='tf' when
pytorch seems unavailable.
* Actually changing the default resolution order to account for model
defaults
Adding a new tests for each pipeline to check that pipeline(task) works
too without manually adding the framework too.
* Add Documentation for GPT-1 Classification
* Add GPT-1 with Classification head
* Add tests for GPT-1 Classification
* Add GPT-1 For Classification to auto models
* Remove authorized missing keys, change checkpoint to openai-gpt
* Reintroduce clean_text call which was removed by mistake in #4723
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Added unittest for clean_text parameter on Bert tokenizer.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Better unittest name.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Adapt unittest to use untrained tokenizer.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Code quality + update test
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* [WIP] SP tokenizers
* fixing tests for T5
* WIP tokenizers
* serialization
* update T5
* WIP T5 tokenization
* slow to fast conversion script
* Refactoring to move tokenzier implementations inside transformers
* Adding gpt - refactoring - quality
* WIP adding several tokenizers to the fast world
* WIP Roberta - moving implementations
* update to dev4 switch file loading to in-memory loading
* Updating and fixing
* advancing on the tokenizers - updating do_lower_case
* style and quality
* moving forward with tokenizers conversion and tests
* MBart, T5
* dumping the fast version of transformer XL
* Adding to autotokenizers + style/quality
* update init and space_between_special_tokens
* style and quality
* bump up tokenizers version
* add protobuf
* fix pickle Bert JP with Mecab
* fix newly added tokenizers
* style and quality
* fix bert japanese
* fix funnel
* limite tokenizer warning to one occurence
* clean up file
* fix new tokenizers
* fast tokenizers deep tests
* WIP adding all the special fast tests on the new fast tokenizers
* quick fix
* adding more fast tokenizers in the fast tests
* all tokenizers in fast version tested
* Adding BertGenerationFast
* bump up setup.py for CI
* remove BertGenerationFast (too early)
* bump up tokenizers version
* Clean old docstrings
* Typo
* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
* Initial callback proposal
* Finish various callbacks
* Post-rebase conflicts
* Fix tests
* Don't use something that's not set
* Documentation
* Remove unwanted print.
* Document all models can work
* Add tests + small fixes
* Update docs/source/internal/trainer_utils.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Address review comments
* Fix TF tests
* Real fix this time
* This one should work
* Fix typo
* Really fix typo
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* configuration_squeezebert.py
thin wrapper around bert tokenizer
fix typos
wip sb model code
wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working
set up squeezebert to use BertModelOutput when returning results.
squeezebert documentation
formatting
allow head mask that is an array of [None, ..., None]
docs
docs cont'd
path to vocab
docs and pointers to cloud files (WIP)
line length and indentation
squeezebert model cards
formatting of model cards
untrack modeling_squeezebert_scratchpad.py
update aws paths to vocab and config files
get rid of stub of NSP code, and advise users to pretrain with mlm only
fix rebase issues
redo rebase of modeling_auto.py
fix issues with code formatting
more code format auto-fixes
move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert
tests for squeezebert modeling and tokenization
fix typo
move squeezebert before bert in modeling_auto.py to fix inheritance problem
disable test_head_masking, since squeezebert doesn't yet implement head masking
fix issues exposed by the test_modeling_squeezebert.py
fix an issue exposed by test_tokenization_squeezebert.py
fix issue exposed by test_modeling_squeezebert.py
auto generated code style improvement
issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head()
update copyright
resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask
docs
add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli
autogenerated formatting tweaks
integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings
* tiny change to order of imports
* Trainer should not modify its TrainingArguments
* Trainer should not modify its TrainingArguments
* Trainer should not modify its TrainingArguments
* Add test of resumed training
* Fixes
* Non multiGPU test
* Clean Trainer state
* Add more to the state
* Documentation
* One last test
* Make resume training test more complete
* Unwanted changes
* GPT2 gradient checkpointing
* find_unused_parameters removed if checkpointing
* find_unused_parameters removed if checkpointing
* Update src/transformers/configuration_gpt2.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Added a test for generation with checkpointing
* Update src/transformers/configuration_gpt2.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Changed name to all no_... arguments and all references to them, inverting the boolean condition
* Change benchmark tests to use new Benchmark Args
* Update src/transformers/benchmark/benchmark_args_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/transformers/benchmark/benchmark.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Fix Style. Add --no options in help
* fix some part of tests
* Update src/transformers/benchmark/benchmark_args_utils.py
* Update src/transformers/benchmark/benchmark_args_utils.py
* Update src/transformers/benchmark/benchmark_args_utils.py
* fix all tests
* make style
* add backwards compability
* make backwards compatible
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: fmcurti <fcurti@DESKTOP-RRQURBM.localdomain>
* skip decorators: docs, tests, bugs
* another important note
* style
* bloody style
* add @pytest.mark.parametrize
* add note
* no idea what it wants :(
* added rag WIP
* path fix
* Formatting / renaming prior to actual work
* added rag WIP
* path fix
* Formatting / renaming prior to actual work
* added rag WIP
* path fix
* Formatting / renaming prior to actual work
* added rag WIP
* Formatting / renaming prior to actual work
* First commit
* improve comments
* Retrieval evaluation scripts
* refactor to include modeling outputs + MPI retriever
* Fix rag-token model + refactor
* Various fixes + finetuning logic
* use_bos fix
* Retrieval refactor
* Finetuning refactoring and cleanup
* Add documentation and cleanup
* Remove set_up_rag_env.sh file
* Fix retrieval wit HF index
* Fix import errors
* Fix quality errors
* Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867
* fix quality
* Fix RAG Sequence generation
* minor cleanup plus initial tests
* fix test
* fix tests 2
* Comments fix
* post-merge fixes
* Improve readme + post-rebase refactor
* Extra dependencied for tests
* Fix tests
* Fix tests 2
* Refactor test requirements
* Fix tests 3
* Post-rebase refactor
* rename nlp->datasets
* RAG integration tests
* add tokenizer to slow integration test and allow retriever to run on cpu
* add tests; fix position ids warning
* change structure
* change structure
* add from encoder generator
* save working solution
* make all integration tests pass
* add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained
* don't save paths
* delete unnecessary imports
* pass config to AutoTokenizer.from_pretrained for Rag tokenizers
* init wiki_dpr only once
* hardcode legacy index and passages paths (todo: add the right urls)
* finalize config
* finalize retriver api and config api
* LegacyIndex index download refactor
* add dpr to autotokenizer
* make from pretrained more flexible
* fix ragfortokengeneration
* small name changes in tokenizer
* add labels to models
* change default index name
* add retrieval tests
* finish token generate
* align test with previous version and make all tests pass
* add tests
* finalize tests
* implement thoms suggestions
* add first version of test
* make first tests work
* make retriever platform agnostic
* naming
* style
* add legacy index URL
* docstrings + simple retrieval test for distributed
* clean model api
* add doc_ids to retriever's outputs
* fix retrieval tests
* finish model outputs
* finalize model api
* fix generate problem for rag
* fix generate for other modles
* fix some tests
* save intermediate
* set generate to default
* big refactor generate
* delete rag_api
* correct pip faiss install
* fix auto tokenization test
* fix faiss install
* fix test
* move the distributed logic to examples
* model page
* docs
* finish tests
* fix dependencies
* fix import in __init__
* Refactor eval_rag and finetune scripts
* start docstring
* add psutil to test
* fix tf test
* move require torch to top
* fix retrieval test
* align naming
* finish automodel
* fix repo consistency
* test ragtokenizer save/load
* add rag model output docs
* fix ragtokenizer save/load from pretrained
* fix tokenizer dir
* remove torch in retrieval
* fix docs
* fixe finetune scripts
* finish model docs
* finish docs
* remove auto model for now
* add require torch
* remove solved todos
* integrate sylvains suggestions
* sams comments
* correct mistake on purpose
* improve README
* Add generation test cases
* fix rag token
* clean token generate
* fix test
* add note to test
* fix attention mask
* add t5 test for rag
* Fix handling prefix in finetune.py
* don't overwrite index_name
Co-authored-by: Patrick Lewis <plewis@fb.com>
Co-authored-by: Aleksandra Piktus <piktus@devfair0141.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5102.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5067.h2.fair>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
* Copy code from Bert to Roberta and add safeguard script
* Fix docstring
* Comment code
* Formatting
* Update src/transformers/modeling_roberta.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Add test and fix bugs
* Fix style and make new comand
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* fix USE_CUDA, add pipeline
* USE_CUDA fix
* recode SinusoidalPositionalEmbedding into nn.Embedding subclass
was needed for torchscript to work - this is now part of the state_dict, so will have to remove these keys during save_pretrained
* back out (ci debug)
* restore
* slow last?
* facilitate not saving certain keys and test
* remove no longer used keys
* style
* fix logging import
* cleanup
* Update src/transformers/modeling_utils.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* fix bug in max_positional_embeddings
* rename keys to keys_to_never_save per suggestion, improve the setup
* Update src/transformers/modeling_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Add BERTweet and PhoBERT models
* Update modeling_auto.py
Re-add `bart` to LM_MAPPING
* Update tokenization_auto.py
Re-add `from .configuration_mobilebert import MobileBertConfig`
not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`
* Add BERTweet and PhoBERT to pretrained_models.rst
* Update tokenization_auto.py
Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.
* Update BertweetTokenizer - without nltk
* Update model card for BERTweet
* PhoBERT - with Auto mode - without import fastBPE
* PhoBERT - with Auto mode - without import fastBPE
* BERTweet - with Auto mode - without import fastBPE
* Add PhoBERT and BERTweet to TF modeling auto
* Improve Docstrings for PhobertTokenizer and BertweetTokenizer
* Update PhoBERT and BERTweet model cards
* Fixed a merge conflict in tokenization_auto
* Used black to reformat BERTweet- and PhoBERT-related files
* Used isort to reformat BERTweet- and PhoBERT-related files
* Reformatted BERTweet- and PhoBERT-related files based on flake8
* Updated test files
* Updated test files
* Updated tf test files
* Updated tf test files
* Updated tf test files
* Updated tf test files
* Update commits from huggingface
* Delete unnecessary files
* Add tokenizers to auto and init files
* Add test files for tokenizers
* Revised model cards
* Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files
* Revised test files
* Update orders of Phobert and Bertweet tokenizers in auto tokenization file