* Adding support for `microphone` streaming within pipeline.
- Uses `ffmpeg` to get microphone data.
- Makes sure alignment is made to `size_of_sample`.
- Works by sending `{"raw": ..data.., "stride": (n, left, right),
"partial": bool}`
directly to the pipeline enabling to stream partial results and still
get inference.
- Let's `partial` information flow through the pipeline to enable caller
to get it back and choose to display text or not.
- The striding reconstitution is bound to have errors since CTC does not
keep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols)
Fixing tests.
Protecting with `require_torch`.
`raw_ctc` support for nicer demo.
Post rebase fixes.
Revamp to split raw_mic_data from it's live chunking.
- Requires a refactor to make everything a bit cleaner.
Automatic resampling.
Small fix.
Small fix.
* Post rebase fix (need to let super handle more logic, reorder args.)
* Update docstrings
* Docstring format.
* Remove print.
* Prevent flow of `input_values`.
* Fixing `stride` too.
* Fixing the PR by removing `raw_ctc`.
* Better docstrings.
* Fixing init.
* Update src/transformers/pipelines/audio_utils.py
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
* Update tests/test_pipelines_automatic_speech_recognition.py
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
* Quality.
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
* [WIP] Make chuking smartly (long files) work on asr ctc_with_lm.
* Slow test with functionality.
* Fixing regular test.
* fix for batch size 1
* Handling batch outside `rescale_Stride`.
- Renamed to `rescale_stride`.
* Disable equality in the test.
* Remove print.
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Pipeline ASR with LM.
* Revamped into `self.decoder`.
* Fixing.
* 2nd fix.
* Update src/transformers/pipelines/__init__.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Fixing.
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Hotfix `chunk_length_s` instead of `_ms`.
* Adding fix of `pad_token` which should be last/previous token for CTC
proper decoding
* Fixing ChunkPipeline unwrapping.
* Adding a PackIterator specific test.
* Pipeline chunks.
* Batching for Chunking pipelines ?
* Batching for `question-answering` and `zero-shot-cls`.
* Fixing for FNet.
* Making ASR a chunk pipeline.
* Chunking ASR API.
* doc style.
* Fixing ASR test.
* Fixing QA eror (p_mask, padding is 1, not 0).
* Enable both vad and simple chunking.
* Max length for vad.
* remove inference mode, crashing on s2t.
* Revert ChunkPipeline for ASRpipeline.
Too many knobs for simple integration within the pipeline, better stick
to external convenience functions instead, more control to be had,
simpler pipeline and also easier to replace with other things later.
* Drop necessity for PT for these.
* Enabling generators.
* Add mic + cleanup.
* Typo.
* Typo2.
* Remove ASR work, it does not belong in this PR anymore.
* Update src/transformers/pipelines/pt_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Update src/transformers/pipelines/zero_shot_classification.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Adding many comments.
* Doc quality.
* `hidden_states` handling.
* Adding doc.
* Bad rebase.
* Autofixing docs.
* Fixing CRITICAL bug in the new Zerocls pipeline.
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* add new wav2vec2 translation
* correct
* up
* add tests
* correct end copy
* correct more
* up
* correct unispeech sat
* finish
* finalize
* finish
* up
* fix_torch_device_generate_test
* remove @
* up
* correct some bugs
* correct model
* finish speech2text extension
* up
* up
* up
* up
* Update utils/custom_init_isort.py
* up
* up
* update with tokenizer
* correct old tok
* correct old tok
* fix bug
* up
* up
* add more tests
* up
* fix docs
* up
* fix some more tests
* add better config
* correct some more things
"
* fix tests
* improve docs
* Apply suggestions from code review
* Apply suggestions from code review
* final fixes
* finalize
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* apply suggestions Lysandre and Sylvain
* apply nicos suggestions
* upload everything
* finish
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
Co-authored-by: your_github_username <your_github_email>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Adding support for `pipeline("automatic-speech-recognition")`.
- Ugly `"config"` choice for AutoModel. It would be great to have the
possibility to have something like `AutoModelFor` that would implement
the same logic (Load the config, check Architectures and load the first
one)
* Remove `model_id` was not needed in the end.
* Rebased !
* Remove old code.
* Rename `nlp`.
* Adding `AutomaticSpeechRecognitionPipeline`.
- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.
Future work:
- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.
* Update tests/test_pipelines_automatic_speech_recognition.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Adding docs + main import + type checking + LICENSE.
* Doc style !.
* Fixing TYPE_HINT.
* Specifying waveform shape in the docs.
* Adding asserts + specify in the documentation the shape of the input
np.ndarray.
* Update src/transformers/pipelines/automatic_speech_recognition.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Adding require to tests + move the `feature_extractor` doc.
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>