Commit Graph

17 Commits

Author SHA1 Message Date
Nima Yaqmuri
07ae53e6e7
Fix/speecht5 bug (#28481)
* Fix bug in SpeechT5 speech decoder prenet's forward method

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

* Refactor SpeechT5 text to speech integration tests

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

* Fix bug in SpeechT5 speech decoder prenet's forward method

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

* Refactor SpeechT5 text to speech integration tests

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

* Enhance handling of speaker embeddings in SpeechT5

- Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch).
- The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions.
- Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations.

* Improve Test Robustness with Randomized Speaker Embeddings
2024-01-16 14:14:28 +00:00
Sihan Chen
4309abedbc
Add speecht5 batch generation and fix wrong attention mask when padding (#25943)
* fix speecht5 wrong attention mask when padding

* enable batch generation and add parameter attention_mask

* fix doc

* fix format

* batch postnet inputs, return batched lengths, and consistent to old api

* fix format

* fix format

* fix the format

* fix doc-builder error

* add test, cross attention and docstring

* optimize code based on reviews

* docbuild

* refine

* not skip slow test

* add consistent dropout for batching

* loose atol

* add another test regarding to the consistency of vocoder

* fix format

* refactor

* add return_concrete_lengths as parameter for consistency w/wo batching

* fix review issues

* fix cross_attention issue
2023-11-14 09:54:09 +00:00
Younes Belkada
f7ea959b96
[core/ GC / tests] Stronger GC tests (#27124)
* stronger GC tests

* better tests and skip failing tests

* break down into 3 sub-tests

* break down into 3 sub-tests

* refactor a bit

* more refactor

* fix

* last nit

* credits contrib and suggestions

* credits contrib and suggestions

---------

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-10-30 19:53:46 +01:00
Yoach Lacombe
db611aabee
🚨 🚨 Raise error when no speaker embeddings in speecht5._generate_speech (#26418)
* add warning when no speaker embeddings in speecht5._generate_speech

* modify warning to error

* adapt generation test
2023-10-17 15:59:35 +02:00
Yih-Dar
54e17a15dc
Fix failing tests on main due to torch 2.1 (#26607)
* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-10-05 10:27:05 +02:00
Yoach Lacombe
6d3f9c1e2e
add generate method to SpeechT5ForTextToSpeech (#25233)
* add generate method to SpeechT5ForTextToSpeech

* update speecht5forTTS docstrings

* Remove defaults to None in generate docstrings

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

---------

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-08-03 14:12:07 +01:00
Yih-Dar
bd90cda9a6
CI with num_hidden_layers=2 🚀🚀🚀 (#25266)
* CI with layers=2

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-08-02 20:22:36 +02:00
Sylvain Gugger
f295fc8a16
Fix last models for common tests that are too big. (#25058)
* Fix last models for common tests that are too big.

* Remove print statement
2023-07-25 07:56:04 -04:00
Sylvain Gugger
1023705440
Check models used for common tests are small (#24824)
* First models

* Conditional DETR

* Treat DETR models, skip others

* Skip LayoutLMv2 as well

* Fix last tests
2023-07-14 14:43:19 -04:00
Xiaoli Wang
239ace152b
Fix TypeError: Object of type int64 is not JSON serializable (#24340)
* Fix TypeError: Object of type int64 is not JSON serializable

* Convert numpy.float64 and numpy.int64 to float and int for json serialization

* Black reformatted examples/pytorch/token-classification/run_ner_no_trainer.py

* * make style
2023-06-27 12:15:49 +01:00
Matthijs Hollemans
ac2bc50a10
TTS fine-tuning for SpeechT5 (#21824)
* wrong argument name

* append eos_token_id

* all tokenizers need mask and ctc_blank tokens

* remove reduction factor from feature extractor

* add proper TTS loss

* did shifting the wrong way around

* mask out padded portions

* remove logits again (don't really need it)

* fix unit tests

* fixup

* pad also returns the decoder attention mask, since that's useful to have

* clean up feature extractor logic

* pad can handle TTS task too

* remove stop_labels from loss calculation

* simplify logic

* fixup

* do -100 masking properly

* small STFT optimization (calculate mel filterbanks only once)

* replace torchaudio fbanks with audio_utils

* remove torchaudio dependency

* simplify & speed up the STFT

* don't serialize window and mel filters

* output cross attentions when generating speech

* add guided attention loss

* fix failing test

* Update src/transformers/models/speecht5/feature_extraction_speecht5.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* Update src/transformers/models/speecht5/modeling_speecht5.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* change type annotation of attention_mask to LongTensor

* extract loss into class

* remove unused frame_signal_scale argument

* use config object in loss class

* fix type annotations in doc comments

* change optional to just bool

* implement missing tokenizer method

* add deprecation warning

* Update src/transformers/models/speecht5/feature_extraction_speecht5.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/speecht5/feature_extraction_speecht5.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add deprecation warning for stop_labels

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-04-18 10:12:30 +01:00
Yih-Dar
b338414e61
Update tiny model creation script and some others files (#22006)
* Update 1

* Update 2

* Update 3

* Update 4

* Update 5

* Update 6

* Update 7

* Update 8

* Update 9

* Update 10

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-03-07 22:31:14 +01:00
Yih-Dar
871c31a6f1
🔥Rework pipeline testing by removing PipelineTestCaseMeta 🚀 (#21516)
* Add PipelineTesterMixin

* remove class PipelineTestCaseMeta

* move validate_test_components

* Add for ViT

* Add to SPECIAL_MODULE_TO_TEST_MAP

* style and quality

* Add feature-extraction

* update

* raise instead of skip

* add tiny_model_summary.json

* more explicit

* skip tasks not in mapping

* add availability check

* Add Copyright

* A way to diable irrelevant tests

* update with main

* remove disable_irrelevant_tests

* skip tests

* better skip message

* better skip message

* Add all pipeline task tests

* revert

* Import PipelineTesterMixin

* subclass test classes with PipelineTesterMixin

* Add pipieline_model_mapping

* Fix import after adding pipieline_model_mapping

* Fix style and quality after adding pipieline_model_mapping

* Fix one more import after adding pipieline_model_mapping

* Fix style and quality after adding pipieline_model_mapping

* Fix test issues

* Fix import requirements

* Fix mapping for MobileViTModelTest

* Update

* Better skip message

* pipieline_model_mapping could not be None

* Remove some PipelineTesterMixin

* Fix typo

* revert tests_fetcher.py

* update

* rename

* revert

* Remove PipelineTestCaseMeta from ZeroShotAudioClassificationPipelineTests

* style and quality

* test fetcher for all pipeline/model tests

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-02-28 19:40:57 +01:00
Sanchit Gandhi
3dae0d7b4f
[SpeechT5] Fix HiFiGAN tests (#21788) 2023-02-24 16:55:38 +01:00
Sanchit Gandhi
82e61f3445
[SpeechT5HifiGan] Handle batched inputs (#21702)
* [SpeechT5HifiGan] Handle batched inputs

* fix docstring

* rebase and new ruff style
2023-02-22 11:16:56 +01:00
Yih-Dar
0db5d911fc
Fix SpeechT5ForSpeechToSpeechIntegrationTests device issue (#21460)
* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-02-06 10:43:07 +01:00
Matthijs Hollemans
e4bacf6614
[WIP] add SpeechT5 model (#18922)
* make SpeechT5 model by copying Wav2Vec2

* add paper to docs

* whoops added docs in wrong file

* remove SpeechT5Tokenizer + put CTC back in the name

* remove deprecated class

* remove unused docstring

* delete SpeechT5FeatureExtractor, use Wav2Vec2FeatureExtractor instead

* remove classes we don't need right now

* initial stab at speech encoder prenet

* add more speech encoder prenet stuff

* improve SpeechEncoderPrenet

* add encoder (not finished yet)

* add relative position bias to self-attention

* add encoder CTC layers

* fix formatting

* add decoder from BART, doesn't work yet

* make it work with generate loop

* wrap the encoder into a speech encoder class

* wrap the decoder in a text decoder class

* changed my mind

* changed my mind again ;-)

* load decoder weights, make it work

* add weights for text decoder postnet

* add SpeechT5ForCTC model that uses only the encoder

* clean up EncoderLayer and DecoderLayer

* implement _init_weights in SpeechT5PreTrainedModel

* cleanup config + Encoder and Decoder

* add head + cross attention masks

* improve doc comments

* fixup

* more cleanup

* more fixup

* TextDecoderPrenet works now, thanks Kendall

* add CTC loss

* add placeholders for other pre/postnets

* add type annotation

* fix freeze_feature_encoder

* set padding tokens to 0 in decoder attention mask

* encoder attention mask downsampling

* remove features_pen calculation

* disable the padding tokens thing again

* fixup

* more fixup

* code review fixes

* rename encoder/decoder wrapper classes

* allow checkpoints to be loaded into SpeechT5Model

* put encoder into wrapper for CTC model

* clean up conversion script

* add encoder for TTS model

* add speech decoder prenet

* add speech decoder post-net

* attempt to reconstruct the generation loop

* add speech generation loop

* clean up generate_speech

* small tweaks

* fix forward pass

* enable always dropout on speech decoder prenet

* sort declaration

* rename models

* fixup

* fix copies

* more fixup

* make consistency checker happy

* add Seq2SeqSpectrogramOutput class

* doc comments

* quick note about loss and labels

* add HiFi-GAN implementation (from Speech2Speech PR)

* rename file

* add vocoder to TTS model

* improve vocoder

* working on tokenizer

* more better tokenizer

* add CTC tokenizer

* fix decode and batch_code in CTC tokenizer

* fix processor

* two processors and feature extractors

* use SpeechT5WaveformFeatureExtractor instead of Wav2Vec2

* cleanup

* more cleanup

* even more fixup

* notebooks

* fix log-mel spectrograms

* support reduction factor

* fixup

* shift spectrograms to right to create decoder inputs

* return correct labels

* add labels for stop token prediction

* fix doc comments

* fixup

* remove SpeechT5ForPreTraining

* more fixup

* update copyright headers

* add usage examples

* add SpeechT5ProcessorForCTC

* fixup

* push unofficial checkpoints to hub

* initial version of tokenizer unit tests

* add slow test

* fix failing tests

* tests for CTC tokenizer

* finish CTC tokenizer tests

* processor tests

* initial test for feature extractors

* tests for spectrogram feature extractor

* fixup

* more fixup

* add decorators

* require speech for tests

* modeling tests

* more tests for ASR model

* fix imports

* add fake tests for the other models

* fixup

* remove jupyter notebooks

* add missing SpeechT5Model tests

* add missing tests for SpeechT5ForCTC

* add missing tests for SpeechT5ForTextToSpeech

* sort tests by name

* fix Hi-Fi GAN tests

* fixup

* add speech-to-speech model

* refactor duplicate speech generation code

* add processor for SpeechToSpeech model

* add usage example

* add tests for speech-to-speech model

* fixup

* enable gradient checkpointing for SpeechT5FeatureEncoder

* code review

* push_to_hub now takes repo_id

* improve doc comments for HiFi-GAN config

* add missing test

* add integration tests

* make number of layers in speech decoder prenet configurable

* rename variable

* rename variables

* add auto classes for TTS and S2S

* REMOVE CTC!!!

* S2S processor does not support save/load_pretrained

* fixup

* these models are now in an auto mapping

* fix doc links

* rename HiFiGAN to HifiGan, remove separate config file

* REMOVE auto classes

* there can be only one

* fixup

* replace assert

* reformat

* feature extractor can process input and target at same time

* update checkpoint names

* fix commit hash
2023-02-03 12:43:46 -05:00