* Adding `AutomaticSpeechRecognitionPipeline`.
- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.
Future work:
- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.
* Update tests/test_pipelines_automatic_speech_recognition.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Adding docs + main import + type checking + LICENSE.
* Doc style !.
* Fixing TYPE_HINT.
* Specifying waveform shape in the docs.
* Adding asserts + specify in the documentation the shape of the input
np.ndarray.
* Update src/transformers/pipelines/automatic_speech_recognition.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Adding require to tests + move the `feature_extractor` doc.
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Initial support for upload to hub
* push -> upload
* Fixes + examples
* Fix torchhub test
* Torchhub test I hate you
* push_model_to_hub -> push_to_hub
* Apply mixin to other pretrained models
* Remove ABC inheritance
* Add tests
* Typo
* Run tests
* Install git-lfs
* Change approach
* Add push_to_hub to all
* Staging test suite
* Typo
* Maybe like this?
* More deps
* Cache
* Adapt name
* Quality
* MOAR tests
* Put it in testing_utils
* Docs + torchhub last hope
* Styling
* Wrong method
* Typos
* Update src/transformers/file_utils.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Base move
* Examples reorganization
* Update references
* Put back test data
* Move conftest
* More fixes
* Move test data to test fixtures
* Update path
* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Address review comments and clean
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Added documentation for data collator.
* Update docs/source/data_collator.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Added documentation for data collator.
* Added documentation for the data collator.
* Merge branch 'doc_DataCollator' of C:\Users\mahii\PycharmProjects\transformers with conflicts.
* Update documentation for the data collator.
* Update documentation for the data collator.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Amna <A.A.Ahmad@student.tudelft.nl>
* make fairscale and deepspeed setup extras
* fix default
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* no reason not to ask for the good version
* update the CIs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Documentation about loading a fast tokenizer within Transformers
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* pass hf optimizer and scheduler to deepspeed if not specified in ds config
* pass hf optimizer and scheduler to deepspeed if not specified in ds config
* update
* make init_deepspeed support config dict
* fix docstring formatting
* clean up trainer's comments
* add new tests
* fix type
* composit argparse doesn't work
* style
* add a new test, rename others
* document new functionality
* complete tests, add docs
* style
* correct level
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add new methods to the doc
* must tell DS we are using a non-native optimizer
* add protection against cpu_offload + HF optimizer combo
* fix the cli overrides
* sync docs + tests
* restore AdamW
* better docs
* need new version
* no longer needed
* remove outdate information
* refactor duplicated code
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Use extlinks to point hyperlink with the version of code
* Point to version on release and master until then
* Apply style
* Correct links
* Add missing backtick
* Simple missing backtick after all.
Co-authored-by: Raghavendra Sugeeth P S <raghav-5305@raghav-5305.csez.zohocorpin.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* add past_key_values
* add use_cache option
* make mask before cutting ids
* adjust position_ids according to past_key_values
* flatten past_key_values
* fix positional embeds
* fix _reorder_cache
* set use_cache to false when not decoder, fix attention mask init
* add test for caching
* add past_key_values for Roberta
* fix position embeds
* add caching test for roberta
* add doc
* make style
* doc, fix attention mask, test
* small fixes
* adress patrick's comments
* input_ids shouldn't start with pad token
* use_cache only when decoder
* make consistent with bert
* make copies consistent
* add use_cache to encoder
* add past_key_values to tapas attention
* apply suggestions from code review
* make coppies consistent
* add attn mask in tests
* remove copied from longformer
* apply suggestions from code review
* fix bart test
* nit
* simplify model outputs
* fix doc
* fix output ordering
* Add label smoothing in Trainer
* Add options for scheduler and Adafactor in Trainer
* Put Seq2SeqTrainer in the main lib
* Apply suggestions from code review
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Address review comments and adapt scripts
* Documentation
* Move test not using script to tests folder
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Add early stopping patience and minimum threshold metric must improve to prevent early stopping to pytorch trainer
* Add early stopping test
* Set patience counter to 0 if best metric not defined yet
* Make early stopping a callback. Add callback event for updating the best metric for early stopping callback to trigger on.
* Run make style
* make funciton name sensible
* Improve new argument docstring wording and hope that flakey CI test passes.
* Use on_evaluation callback instead of custom. Remove some debug printing
* Move early stopping arguments and state into early stopping callback
* Run make style
* Remove old code
* Fix docs formatting. make style went rogue on me.
* Remove copied attributes and fix variable
* Add assertions on training arguments instead of mutating them. Move comment out of public docs.
* Make separate test for early stopping callback. Add test of invalid arguments.
* Run make style... I remembered before CI this time!
* appease flake8
* Add EarlyStoppingCallback to callback docs
* Make docstring EarlyStoppingCallabck match other callbacks.
* Fix typo in docs
* Output cross-attention with decoder attention output
* Update src/transformers/modeling_bert.py
* add cross-attention for t5 and bart as well
* fix tests
* correct typo in docs
* add sylvains and sams comments
* correct typo
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* first draft
* show design proposition for new generate method
* up
* make better readable
* make first version
* gpt2 tests pass
* make beam search for gpt2 work
* add first encoder-decoder code
* delete typo
* make t5 work
* save indermediate
* make bart work with beam search
* finish beam search bart / t5
* add default kwargs
* make more tests pass
* fix no bad words sampler
* some fixes and tests for all distribution processors
* fix test
* fix rag slow tests
* merge to master
* add nograd to generate
* make all slow tests pass
* speed up generate
* fix edge case bug
* small fix
* correct typo
* add type hints and docstrings
* fix typos in tests
* add beam search tests
* add tests for beam scorer
* fix test rag
* finish beam search tests
* move generation tests in seperate file
* fix generation tests
* more tests
* add aggressive generation tests
* fix tests
* add gpt2 sample test
* add more docstring
* add more docs
* finish doc strings
* apply some more of sylvains and sams comments
* fix some typos
* make fix copies
* apply lysandres and sylvains comments
* final corrections on examples
* small fix for reformer
* first attempt to add AzureML callbacks
* func arg fix
* var name fix, but still won't fix error...
* fixing as in https://discuss.huggingface.co/t/how-to-integrate-an-azuremlcallback-for-logging-in-azure/1713/2
* Avoid lint check of azureml import
* black compliance
* Make isort happy
* Fix point typo in docs
* Add AzureML to Callbacks docs
* Attempt to make sphinx happy
* Format callback docs
* Make documentation style happy
* Make docs compliant to style
Co-authored-by: Davide Fiocco <davide.fiocco@frontiersin.net>
* Important files
* Styling them all
* Revert "Styling them all"
This reverts commit 7d029395fd.
* Syling them for realsies
* Fix syntax error
* Fix benchmark_utils
* More fixes
* Fix modeling auto and script
* Remove new line
* Fixes
* More fixes
* Fix more files
* Style
* Add FSMT
* More fixes
* More fixes
* More fixes
* More fixes
* Fixes
* More fixes
* More fixes
* Last fixes
* Make sphinx happy
* Add MLflow integration class
Add integration code for MLflow in integrations.py along with the code
that checks that MLflow is installed.
* Add MLflowCallback import
Add import of MLflowCallback in trainer.py
* Handle model argument
Allow the callback to handle model argument and store model config items as hyperparameters.
* Log parameters to MLflow in batches
MLflow cannot log more than a hundred parameters at once.
Code added to split the parameters into batches of 100 items and log the batches one by one.
* Fix style
* Add docs on MLflow callback
* Fix issue with unfinished runs
The "fluent" api used in MLflow integration allows only one run to be active at any given moment. If the Trainer is disposed off and a new one is created, but the training is not finished, it will refuse to log the results when the next trainer is created.
* Add MLflow integration class
Add integration code for MLflow in integrations.py along with the code
that checks that MLflow is installed.
* Add MLflowCallback import
Add import of MLflowCallback in trainer.py
* Handle model argument
Allow the callback to handle model argument and store model config items as hyperparameters.
* Log parameters to MLflow in batches
MLflow cannot log more than a hundred parameters at once.
Code added to split the parameters into batches of 100 items and log the batches one by one.
* Fix style
* Add docs on MLflow callback
* Fix issue with unfinished runs
The "fluent" api used in MLflow integration allows only one run to be active at any given moment. If the Trainer is disposed off and a new one is created, but the training is not finished, it will refuse to log the results when the next trainer is created.