* Add Luke training
* Fix true label tags
* Fix true label tags
* Fix true label tags
* Update the data collator for Luke
* Some training refactor for Luke
* Improve data collator for Luke
* Fix import
* Fix datasets concatenation
* Add the --max_entity_length argument for Luke models
* Remove unused code
* Fix style issues
* Fix style issues
* Move the Luke training into a separate folder
* Fix style
* Fix naming
* Fix filtering
* Fix filtering
* Fix filter
* Update some preprocessing
* Move luke to research_projects
* Checkstyle
* Address comments
* Fix style
* Fix the inconsistency of loss calculation between PT/TF XLNetLMHeadModel
* overwrite test_loss_computation
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* add xlm roberta xl
* add convert xlm xl fairseq checkpoint to pytorch
* fix init and documents for xlm-roberta-xl
* fix indention
* add test for XLM-R xl,xxl
* fix model hub name
* fix some stuff
* up
* correct init
* fix more
* fix as suggestions
* add torch_device
* fix default values of doc strings
* fix leftovers
* merge to master
* up
* correct hub names
* fix docs
* fix model
* up
* finalize
* last fix
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add copied from
* make style
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* clean commit of changes
* apply review feedback, make edits
* fix backticks, minor formatting
* 🖍 make fixup and minor edits
* 🖍 fix # in header
* 📝 update code sample without from_pt
* 📝 final review
* [deepspeed] saving checkpoint fallback when fp16 weights aren't saved
* Bump required deepspeed version to match usage when saving checkpoints
* update version
Co-authored-by: Mihai Balint <balint.mihai@gmail.com>
* Fixing support `batch_size` and `num_return_Sequences` in
`text-generation` pipeline
And `text2text-generation` too.
The bug was caused by the batch_size containing both the incoming batch
**and** the generated `num_sequences`.
The fix simply consists into splitting both of these again into
different dimensions.
* TF support.
* Odd backward compatibility script in the way.
* add new test
* add a feature to same the sentencepiece tokenizer model when the init file was deleted
* update marian
* update m2m_100
* fix marian
* update speech to text
* override test for layoutxlm
* fix saving bartpho
* remove harcoded values bartpho
* special token string version
* finish bartpho
* override layoutxml test
* add mbart
* move special tokens list
* format
* Revert "format"
This reverts commit 37a40df379.
* simplify list of string of special tokens
* Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
* Fix prediction with generate() and the inference of column names
Should now have very few differences with the PyTorch implementation
* Minor edit to parent class
* Update src/transformers/keras_callbacks.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Explaining the dict conversion
* Putting main_input_name back
* Fixes to main_input_name
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* fix_torch_device_generate_test
* remove @
* doc tests
* up
* up
* fix doctests
* adapt files
* finish refactor
* up
* save intermediate
* add more logic
* new change
* improve
* next try
* next try
* next try
* next try
* fix final spaces
* fix final spaces
* improve
* renaming
* correct more bugs
* finish wavlm
* add comment
* run on test runner
* finish all speech models
* adapt
* finish
* Added missing code in exemplary notebook - custom datasets fine-tuning
Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification.
The missing code concerns adding labels for all but first token in a single word.
The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb).
* Changes requested in the review - keep the code as simple as possible
* Avoid using get_list_of_files in config
* Wip, change tokenizer file getter
* Remove call in tokenizer files
* Remove last call to get_list_model_files
* Better tests
* Unit tests for new function
* Document bad API