Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration (#24799)

* initial commit

* Add inital testing files and modify __init__ files to add UnivNet imports.

* Fix some bugs

* Add checkpoint conversion script and add references to transformers pre-trained model.

* Add UnivNet entries for auto.

* Add initial docs for UnivNet.

* Handle input and output shapes in UnivNetGan.forward and add initial docstrings.

* Write tests and make them pass.

* Write docs.

* Add UnivNet doc to _toctree.yml and improve docs.

* fix typo

* make fixup

* make fix-copies

* Add upsample_rates parameter to config and improve config documentation.

* make fixup

* make fix-copies

* Remove unused upsample_rates config parameter.

* apply suggestions from review

* make style

* Verify and add reason for skipped tests inherited from ModelTesterMixin.

* Add initial UnivNetGan integration tests

* make style

* Remove noise_length input to UnivNetGan and improve integration tests.

* Fix bug and make style

* Make UnivNet integration tests pass

* Add initial code for UnivNetFeatureExtractor.

* make style

* Add initial tests for UnivNetFeatureExtractor.

* make style

* Properly initialize weights for UnivNetGan

* Get feature extractor fast tests passing

* make style

* Get feature extractor integration tests passing

* Get UnivNet integration tests passing

* make style

* Add UnivNetGan usage example

* make style and use feature extractor from hub in integration tests

* Update tips in docs

* apply suggestions from review

* make style

* Calculate padding directly instead of using get_padding methods.

* Update UnivNetFeatureExtractor.to_dict to be UnivNet-specific.

* Update feature extractor to support using model(**inputs) and add the ability to generate noise and pad the end of the spectrogram in __call__.

* Perform padding before generating noise to ensure the shapes are correct.

* Rename UnivNetGan.forward's noise_waveform argument to noise_sequence.

* make style

* Add tests to test generating noise and padding the end for UnivNetFeatureExtractor.__call__.

* Add tests for checking batched vs unbatched inputs for UnivNet feature extractor and model.

* Add expected mean and stddev checks to the integration tests and make them pass.

* make style

* Make it possible to use model(**inputs), where inputs is the output of the feature extractor.

* fix typo in UnivNetGanConfig example

* Calculate spectrogram_zero from other config values.

* apply suggestions from review

* make style

* Refactor UnivNet conversion script to use load_state_dict (following persimmon).

* Rename UnivNetFeatureExtractor to UnivNetGanFeatureExtractor.

* make style

* Switch to using torch.tensor and torch.testing.assert_close for testing expected values/slices.

* make style

* Use config in UnivNetGan modeling blocks.

* make style

* Rename the spectrogram argument of UnivNetGan.forward to input_features, following Whisper.

* make style

* Improving padding documentation.

* Add UnivNet usage example to the docs.

* apply suggestions from review

* Move dynamic_range_compression computation into the mel_spectrogram method of the feature extractor.

* Improve UnivNetGan.forward return docstring.

* Update table in docs/source/en/index.md.

* make fix-copies

* Rename UnivNet components to have pattern UnivNet*.

* make style

* make fix-copies

* Update docs

* make style

* Increase tolerance on flaky unbatched integration test.

* Remove torch.no_grad decorators from UnivNet integration tests to try to avoid flax/Tensorflow test errors.

* Add padding_mask argument to UnivNetModel.forward and add batch_decode feature extractor method to remove padding.

* Update documentation and clean up padding code.

* make style

* make style

* Remove torch dependency from UnivNetFeatureExtractor.

* make style

* Fix UnivNetModel usage example

* Clean up feature extractor code/docstrings.

* apply suggestions from review

* make style

* Add comments for tests skipped via ModelTesterMixin flags.

* Add comment for model parallel tests skipped via the test_model_parallel ModelTesterMixin flag.

* Add # Copied from statements to copied UnivNetFeatureExtractionTest tests.

* Simplify UnivNetFeatureExtractorTest.test_batch_decode.

* Add support for unbatched padding_masks in UnivNetModel.forward.

* Refactor unbatched padding_mask support.

* make style
This commit is contained in:
dg845 2023-11-22 08:21:36 -08:00 committed by GitHub
parent 4151fbb49c
commit 7f6a804d30
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
24 changed files with 2299 additions and 0 deletions

View File

@ -494,6 +494,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.

View File

@ -469,6 +469,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.

View File

@ -443,6 +443,7 @@ conda install -c huggingface transformers
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research से) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. द्वाराअनुसंधान पत्र [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) के साथ जारी किया गया
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (माइक्रोसॉफ्ट रिसर्च से) साथ में दिया गया पेपर [UniSpeech: यूनिफाइड स्पीच रिप्रेजेंटेशन लर्निंग विद लेबलेड एंड अनलेबल्ड डेटा](https:/ /arxiv.org/abs/2101.07597) चेंगई वांग, यू वू, याओ कियान, केनिची कुमातानी, शुजी लियू, फुरु वेई, माइकल ज़ेंग, ज़ुएदोंग हुआंग द्वारा।
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [UNISPEECH-SAT: यूनिवर्सल स्पीच रिप्रेजेंटेशन लर्निंग विद स्पीकर अवेयर प्री-ट्रेनिंग ](https://arxiv.org/abs/2110.05752) सानयुआन चेन, यू वू, चेंग्यी वांग, झेंगयांग चेन, झूओ चेन, शुजी लियू, जियान वू, याओ कियान, फुरु वेई, जिन्यु ली, जियांगज़ान यू द्वारा पोस्ट किया गया।
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (सिंघुआ यूनिवर्सिटी और ननकाई यूनिवर्सिटी से) साथ में पेपर [विजुअल अटेंशन नेटवर्क](https://arxiv.org/ pdf/2202.09741.pdf) मेंग-हाओ गुओ, चेंग-ज़े लू, झेंग-निंग लियू, मिंग-मिंग चेंग, शि-मिन हू द्वारा।
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (मल्टीमीडिया कम्प्यूटिंग ग्रुप, नानजिंग यूनिवर्सिटी से) साथ में पेपर [वीडियोएमएई: मास्क्ड ऑटोएन्कोडर स्व-पर्यवेक्षित वीडियो प्री-ट्रेनिंग के लिए डेटा-कुशल सीखने वाले हैं] (https://arxiv.org/abs/2203.12602) ज़ान टोंग, यिबिंग सॉन्ग, जुए द्वारा वांग, लिमिन वांग द्वारा पोस्ट किया गया।

View File

@ -503,6 +503,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research から) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. から公開された研究論文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research から) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang から公開された研究論文: [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (Microsoft Research から) Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu から公開された研究論文: [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752)
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (Peking University から) Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. から公開された研究論文 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221)
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (Tsinghua University and Nankai University から) Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu から公開された研究論文: [Visual Attention Network](https://arxiv.org/abs/2202.09741)
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (Multimedia Computing Group, Nanjing University から) Zhan Tong, Yibing Song, Jue Wang, Limin Wang から公開された研究論文: [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602)

View File

@ -418,6 +418,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research 에서 제공)은 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.의 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)논문과 함께 발표했습니다.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research 에서) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 의 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 논문과 함께 발표했습니다.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (Microsoft Research 에서) Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 의 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 논문과 함께 발표했습니다.
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (Peking University 에서 제공)은 Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.의 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221)논문과 함께 발표했습니다.
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (Tsinghua University and Nankai University 에서) Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu 의 [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) 논문과 함께 발표했습니다.
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (Multimedia Computing Group, Nanjing University 에서) Zhan Tong, Yibing Song, Jue Wang, Limin Wang 의 [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 논문과 함께 발표했습니다.

View File

@ -442,6 +442,7 @@ conda install -c huggingface transformers
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (来自 Google Research) 伴随论文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) 由 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant 发布。
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (来自 Peking University) 伴随论文 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) 由 Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun 发布。
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (来自 Tsinghua University and Nankai University) 伴随论文 [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) 由 Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu 发布。
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (来自 Multimedia Computing Group, Nanjing University) 伴随论文 [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 由 Zhan Tong, Yibing Song, Jue Wang, Limin Wang 发布。

View File

@ -454,6 +454,7 @@ conda install -c huggingface transformers
1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.

View File

@ -628,6 +628,8 @@
title: UniSpeech
- local: model_doc/unispeech-sat
title: UniSpeech-SAT
- local: model_doc/univnet
title: UnivNet
- local: model_doc/vits
title: VITS
- local: model_doc/wav2vec2

View File

@ -269,6 +269,7 @@ Flax), PyTorch, and/or TensorFlow.
| [UMT5](model_doc/umt5) | ✅ | ❌ | ❌ |
| [UniSpeech](model_doc/unispeech) | ✅ | ❌ | ❌ |
| [UniSpeechSat](model_doc/unispeech-sat) | ✅ | ❌ | ❌ |
| [UnivNet](model_doc/univnet) | ✅ | ❌ | ❌ |
| [UPerNet](model_doc/upernet) | ✅ | ❌ | ❌ |
| [VAN](model_doc/van) | ✅ | ❌ | ❌ |
| [VideoMAE](model_doc/videomae) | ✅ | ❌ | ❌ |

View File

@ -0,0 +1,80 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# UnivNet
## Overview
The UnivNet model was proposed in [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kin, and Juntae Kim.
The UnivNet model is a generative adversarial network (GAN) trained to synthesize high fidelity speech waveforms. The UnivNet model shared in `transformers` is the *generator*, which maps a conditioning log-mel spectrogram and optional noise sequence to a speech waveform (e.g. a vocoder). Only the generator is required for inference. The *discriminator* used to train the `generator` is not implemented.
The abstract from the paper is the following:
*Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch.*
Tips:
- The `noise_sequence` argument for [`UnivNetModel.forward`] should be standard Gaussian noise (such as from `torch.randn`) of shape `([batch_size], noise_length, model.config.model_in_channels)`, where `noise_length` should match the length dimension (dimension 1) of the `input_features` argument. If not supplied, it will be randomly generated; a `torch.Generator` can be supplied to the `generator` argument so that the forward pass can be reproduced. (Note that [`UnivNetFeatureExtractor`] will return generated noise by default, so it shouldn't be necessary to generate `noise_sequence` manually.)
- Padding added by [`UnivNetFeatureExtractor`] can be removed from the [`UnivNetModel`] output through the [`UnivNetFeatureExtractor.batch_decode`] method, as shown in the usage example below.
- Padding the end of each waveform with silence can reduce artifacts at the end of the generated audio sample. This can be done by supplying `pad_end = True` to [`UnivNetFeatureExtractor.__call__`]. See [this issue](https://github.com/seungwonpark/melgan/issues/8) for more details.
Usage Example:
```python
import torch
from scipy.io.wavfile import write
from datasets import Audio, load_dataset
from transformers import UnivNetFeatureExtractor, UnivNetModel
model_id_or_path = "dg845/univnet-dev"
model = UnivNetModel.from_pretrained(model_id_or_path)
feature_extractor = UnivNetFeatureExtractor.from_pretrained(model_id_or_path)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# Resample the audio to the model and feature extractor's sampling rate.
ds = ds.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
# Pad the end of the converted waveforms to reduce artifacts at the end of the output audio samples.
inputs = feature_extractor(
ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], pad_end=True, return_tensors="pt"
)
with torch.no_grad():
audio = model(**inputs)
# Remove the extra padding at the end of the output.
audio = feature_extractor.batch_decode(**audio)[0]
# Convert to wav file
write("sample_audio.wav", feature_extractor.sampling_rate, audio)
```
This model was contributed by [dg845](https://huggingface.co/dg845).
To the best of my knowledge, there is no official code release, but an unofficial implementation can be found at [maum-ai/univnet](https://github.com/maum-ai/univnet) with pretrained checkpoints [here](https://github.com/maum-ai/univnet#pre-trained-model).
## UnivNetConfig
[[autodoc]] UnivNetConfig
## UnivNetFeatureExtractor
[[autodoc]] UnivNetFeatureExtractor
- __call__
## UnivNetModel
[[autodoc]] UnivNetModel
- forward

View File

@ -611,6 +611,11 @@ _import_structure = {
"UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP",
"UniSpeechSatConfig",
],
"models.univnet": [
"UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP",
"UnivNetConfig",
"UnivNetFeatureExtractor",
],
"models.upernet": ["UperNetConfig"],
"models.videomae": ["VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "VideoMAEConfig"],
"models.vilt": [
@ -2977,6 +2982,12 @@ else:
"UniSpeechSatPreTrainedModel",
]
)
_import_structure["models.univnet"].extend(
[
"UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST",
"UnivNetModel",
]
)
_import_structure["models.upernet"].extend(
[
"UperNetForSemanticSegmentation",
@ -4817,6 +4828,11 @@ if TYPE_CHECKING:
from .models.umt5 import UMT5Config
from .models.unispeech import UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechConfig
from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig
from .models.univnet import (
UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
UnivNetConfig,
UnivNetFeatureExtractor,
)
from .models.upernet import UperNetConfig
from .models.videomae import VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP, VideoMAEConfig
from .models.vilt import (
@ -6807,6 +6823,7 @@ if TYPE_CHECKING:
UniSpeechSatModel,
UniSpeechSatPreTrainedModel,
)
from .models.univnet import UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST, UnivNetModel
from .models.upernet import UperNetForSemanticSegmentation, UperNetPreTrainedModel
from .models.videomae import (
VIDEOMAE_PRETRAINED_MODEL_ARCHIVE_LIST,

View File

@ -211,6 +211,7 @@ from . import (
umt5,
unispeech,
unispeech_sat,
univnet,
upernet,
videomae,
vilt,

View File

@ -218,6 +218,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("umt5", "UMT5Config"),
("unispeech", "UniSpeechConfig"),
("unispeech-sat", "UniSpeechSatConfig"),
("univnet", "UnivNetConfig"),
("upernet", "UperNetConfig"),
("van", "VanConfig"),
("videomae", "VideoMAEConfig"),
@ -424,6 +425,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("tvp", "TVP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("unispeech", "UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("unispeech-sat", "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("univnet", "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("van", "VAN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("videomae", "VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("vilt", "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -667,6 +669,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("umt5", "UMT5"),
("unispeech", "UniSpeech"),
("unispeech-sat", "UniSpeechSat"),
("univnet", "UnivNet"),
("upernet", "UPerNet"),
("van", "VAN"),
("videomae", "VideoMAE"),

View File

@ -91,6 +91,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("tvlt", "TvltFeatureExtractor"),
("unispeech", "Wav2Vec2FeatureExtractor"),
("unispeech-sat", "Wav2Vec2FeatureExtractor"),
("univnet", "UnivNetFeatureExtractor"),
("van", "ConvNextFeatureExtractor"),
("videomae", "VideoMAEFeatureExtractor"),
("vilt", "ViltFeatureExtractor"),

View File

@ -204,6 +204,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("umt5", "UMT5Model"),
("unispeech", "UniSpeechModel"),
("unispeech-sat", "UniSpeechSatModel"),
("univnet", "UnivNetModel"),
("van", "VanModel"),
("videomae", "VideoMAEModel"),
("vilt", "ViltModel"),

View File

@ -0,0 +1,65 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)
_import_structure = {
"configuration_univnet": [
"UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP",
"UnivNetConfig",
],
"feature_extraction_univnet": ["UnivNetFeatureExtractor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_univnet"] = [
"UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST",
"UnivNetModel",
]
if TYPE_CHECKING:
from .configuration_univnet import (
UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
UnivNetConfig,
)
from .feature_extraction_univnet import UnivNetFeatureExtractor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_univnet import (
UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST,
UnivNetModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@ -0,0 +1,127 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" UnivNetModel model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"dg845/univnet-dev": "https://huggingface.co/dg845/univnet-dev/resolve/main/config.json",
}
class UnivNetConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`UnivNetModel`]. It is used to instantiate a
UnivNet vocoder model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the UnivNet
[dg845/univnet-dev](https://huggingface.co/dg845/univnet-dev) architecture, which corresponds to the 'c32'
architecture in [maum-ai/univnet](https://github.com/maum-ai/univnet/blob/master/config/default_c32.yaml).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
model_in_channels (`int`, *optional*, defaults to 64):
The number of input channels for the UnivNet residual network. This should correspond to
`noise_sequence.shape[1]` and the value used in the [`UnivNetFeatureExtractor`] class.
model_hidden_channels (`int`, *optional*, defaults to 32):
The number of hidden channels of each residual block in the UnivNet residual network.
num_mel_bins (`int`, *optional*, defaults to 100):
The number of frequency bins in the conditioning log-mel spectrogram. This should correspond to the value
used in the [`UnivNetFeatureExtractor`] class.
resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 3, 3]`):
A tuple of integers defining the kernel sizes of the 1D convolutional layers in the UnivNet residual
network. The length of `resblock_kernel_sizes` defines the number of resnet blocks and should match that of
`resblock_stride_sizes` and `resblock_dilation_sizes`.
resblock_stride_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 4]`):
A tuple of integers defining the stride sizes of the 1D convolutional layers in the UnivNet residual
network. The length of `resblock_stride_sizes` should match that of `resblock_kernel_sizes` and
`resblock_dilation_sizes`.
resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 9, 27], [1, 3, 9, 27], [1, 3, 9, 27]]`):
A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
UnivNet residual network. The length of `resblock_dilation_sizes` should match that of
`resblock_kernel_sizes` and `resblock_stride_sizes`. The length of each nested list in
`resblock_dilation_sizes` defines the number of convolutional layers per resnet block.
kernel_predictor_num_blocks (`int`, *optional*, defaults to 3):
The number of residual blocks in the kernel predictor network, which calculates the kernel and bias for
each location variable convolution layer in the UnivNet residual network.
kernel_predictor_hidden_channels (`int`, *optional*, defaults to 64):
The number of hidden channels for each residual block in the kernel predictor network.
kernel_predictor_conv_size (`int`, *optional*, defaults to 3):
The kernel size of each 1D convolutional layer in the kernel predictor network.
kernel_predictor_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for each residual block in the kernel predictor network.
initializer_range (`float`, *optional*, defaults to 0.01):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
leaky_relu_slope (`float`, *optional*, defaults to 0.2):
The angle of the negative slope used by the leaky ReLU activation.
Example:
```python
>>> from transformers import UnivNetModel, UnivNetConfig
>>> # Initializing a Tortoise TTS style configuration
>>> configuration = UnivNetConfig()
>>> # Initializing a model (with random weights) from the Tortoise TTS style configuration
>>> model = UnivNetModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "univnet"
def __init__(
self,
model_in_channels=64,
model_hidden_channels=32,
num_mel_bins=100,
resblock_kernel_sizes=[3, 3, 3],
resblock_stride_sizes=[8, 8, 4],
resblock_dilation_sizes=[[1, 3, 9, 27], [1, 3, 9, 27], [1, 3, 9, 27]],
kernel_predictor_num_blocks=3,
kernel_predictor_hidden_channels=64,
kernel_predictor_conv_size=3,
kernel_predictor_dropout=0.0,
initializer_range=0.01,
leaky_relu_slope=0.2,
**kwargs,
):
if not (len(resblock_kernel_sizes) == len(resblock_stride_sizes) == len(resblock_dilation_sizes)):
raise ValueError(
"`resblock_kernel_sizes`, `resblock_stride_sizes`, and `resblock_dilation_sizes` must all have the"
" same length (which will be the number of resnet blocks in the model)."
)
self.model_in_channels = model_in_channels
self.model_hidden_channels = model_hidden_channels
self.num_mel_bins = num_mel_bins
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_stride_sizes = resblock_stride_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.kernel_predictor_num_blocks = kernel_predictor_num_blocks
self.kernel_predictor_hidden_channels = kernel_predictor_hidden_channels
self.kernel_predictor_conv_size = kernel_predictor_conv_size
self.kernel_predictor_dropout = kernel_predictor_dropout
self.initializer_range = initializer_range
self.leaky_relu_slope = leaky_relu_slope
super().__init__(**kwargs)

View File

@ -0,0 +1,162 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import torch
from transformers import UnivNetConfig, UnivNetModel, logging
logging.set_verbosity_info()
logger = logging.get_logger("transformers.models.univnet")
def get_kernel_predictor_key_mapping(config: UnivNetConfig, old_prefix: str = "", new_prefix: str = ""):
mapping = {}
# Initial conv layer
mapping[f"{old_prefix}.input_conv.0.weight_g"] = f"{new_prefix}.input_conv.weight_g"
mapping[f"{old_prefix}.input_conv.0.weight_v"] = f"{new_prefix}.input_conv.weight_v"
mapping[f"{old_prefix}.input_conv.0.bias"] = f"{new_prefix}.input_conv.bias"
# Kernel predictor resnet blocks
for i in range(config.kernel_predictor_num_blocks):
mapping[f"{old_prefix}.residual_convs.{i}.1.weight_g"] = f"{new_prefix}.resblocks.{i}.conv1.weight_g"
mapping[f"{old_prefix}.residual_convs.{i}.1.weight_v"] = f"{new_prefix}.resblocks.{i}.conv1.weight_v"
mapping[f"{old_prefix}.residual_convs.{i}.1.bias"] = f"{new_prefix}.resblocks.{i}.conv1.bias"
mapping[f"{old_prefix}.residual_convs.{i}.3.weight_g"] = f"{new_prefix}.resblocks.{i}.conv2.weight_g"
mapping[f"{old_prefix}.residual_convs.{i}.3.weight_v"] = f"{new_prefix}.resblocks.{i}.conv2.weight_v"
mapping[f"{old_prefix}.residual_convs.{i}.3.bias"] = f"{new_prefix}.resblocks.{i}.conv2.bias"
# Kernel output conv
mapping[f"{old_prefix}.kernel_conv.weight_g"] = f"{new_prefix}.kernel_conv.weight_g"
mapping[f"{old_prefix}.kernel_conv.weight_v"] = f"{new_prefix}.kernel_conv.weight_v"
mapping[f"{old_prefix}.kernel_conv.bias"] = f"{new_prefix}.kernel_conv.bias"
# Bias output conv
mapping[f"{old_prefix}.bias_conv.weight_g"] = f"{new_prefix}.bias_conv.weight_g"
mapping[f"{old_prefix}.bias_conv.weight_v"] = f"{new_prefix}.bias_conv.weight_v"
mapping[f"{old_prefix}.bias_conv.bias"] = f"{new_prefix}.bias_conv.bias"
return mapping
def get_key_mapping(config: UnivNetConfig):
mapping = {}
# NOTE: inital conv layer keys are the same
# LVC Residual blocks
for i in range(len(config.resblock_stride_sizes)):
# LVCBlock initial convt layer
mapping[f"res_stack.{i}.convt_pre.1.weight_g"] = f"resblocks.{i}.convt_pre.weight_g"
mapping[f"res_stack.{i}.convt_pre.1.weight_v"] = f"resblocks.{i}.convt_pre.weight_v"
mapping[f"res_stack.{i}.convt_pre.1.bias"] = f"resblocks.{i}.convt_pre.bias"
# Kernel predictor
kernel_predictor_mapping = get_kernel_predictor_key_mapping(
config, old_prefix=f"res_stack.{i}.kernel_predictor", new_prefix=f"resblocks.{i}.kernel_predictor"
)
mapping.update(kernel_predictor_mapping)
# LVC Residual blocks
for j in range(len(config.resblock_dilation_sizes[i])):
mapping[f"res_stack.{i}.conv_blocks.{j}.1.weight_g"] = f"resblocks.{i}.resblocks.{j}.conv.weight_g"
mapping[f"res_stack.{i}.conv_blocks.{j}.1.weight_v"] = f"resblocks.{i}.resblocks.{j}.conv.weight_v"
mapping[f"res_stack.{i}.conv_blocks.{j}.1.bias"] = f"resblocks.{i}.resblocks.{j}.conv.bias"
# Output conv layer
mapping["conv_post.1.weight_g"] = "conv_post.weight_g"
mapping["conv_post.1.weight_v"] = "conv_post.weight_v"
mapping["conv_post.1.bias"] = "conv_post.bias"
return mapping
def rename_state_dict(state_dict, keys_to_modify, keys_to_remove):
model_state_dict = {}
for key, value in state_dict.items():
if key in keys_to_remove:
continue
if key in keys_to_modify:
new_key = keys_to_modify[key]
model_state_dict[new_key] = value
else:
model_state_dict[key] = value
return model_state_dict
def convert_univnet_checkpoint(
checkpoint_path,
pytorch_dump_folder_path,
config_path=None,
repo_id=None,
safe_serialization=False,
):
model_state_dict_base = torch.load(checkpoint_path, map_location="cpu")
# Get the generator's state dict
state_dict = model_state_dict_base["model_g"]
if config_path is not None:
config = UnivNetConfig.from_pretrained(config_path)
else:
config = UnivNetConfig()
keys_to_modify = get_key_mapping(config)
keys_to_remove = set()
hf_state_dict = rename_state_dict(state_dict, keys_to_modify, keys_to_remove)
model = UnivNetModel(config)
# Apply weight norm since the original checkpoint has weight norm applied
model.apply_weight_norm()
model.load_state_dict(hf_state_dict)
# Remove weight norm in preparation for inference
model.remove_weight_norm()
model.save_pretrained(pytorch_dump_folder_path, safe_serialization=safe_serialization)
if repo_id:
print("Pushing to the hub...")
model.push_to_hub(repo_id)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
parser.add_argument(
"--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
)
parser.add_argument(
"--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
)
parser.add_argument(
"--safe_serialization", action="store_true", help="Whether to save the model using `safetensors`."
)
args = parser.parse_args()
convert_univnet_checkpoint(
args.checkpoint_path,
args.pytorch_dump_folder_path,
args.config_path,
args.push_to_hub,
args.safe_serialization,
)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,456 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for UnivNetModel."""
from typing import Any, Dict, List, Optional, Union
import numpy as np
from ...audio_utils import mel_filter_bank, optimal_fft_length, spectrogram, window_function
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import PaddingStrategy, TensorType, logging
logger = logging.get_logger(__name__)
class UnivNetFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs a UnivNet feature extractor.
This class extracts log-mel-filter bank features from raw speech using the short time Fourier Transform (STFT). The
STFT implementation follows that of TacoTron 2 and Hifi-GAN.
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.
Args:
feature_size (`int`, *optional*, defaults to 1):
The feature dimension of the extracted features.
sampling_rate (`int`, *optional*, defaults to 24000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
padding_value (`float`, *optional*, defaults to 0.0):
The value to pad with when applying the padding strategy defined by the `padding` argument to
[`UnivNetFeatureExtractor.__call__`]. Should correspond to audio silence. The `pad_end` argument to
`__call__` will also use this padding value.
do_normalize (`bool`, *optional*, defaults to `False`):
Whether to perform Tacotron 2 normalization on the input. Normalizing can help to significantly improve the
performance for some models.
num_mel_bins (`int`, *optional*, defaults to 100):
The number of mel-frequency bins in the extracted spectrogram features. This should match
`UnivNetModel.config.num_mel_bins`.
hop_length (`int`, *optional*, defaults to 256):
The direct number of samples between sliding windows. Otherwise referred to as "shift" in many papers. Note
that this is different from other audio feature extractors such as [`SpeechT5FeatureExtractor`] which take
the `hop_length` in ms.
win_length (`int`, *optional*, defaults to 1024):
The direct number of samples for each sliding window. Note that this is different from other audio feature
extractors such as [`SpeechT5FeatureExtractor`] which take the `win_length` in ms.
win_function (`str`, *optional*, defaults to `"hann_window"`):
Name for the window function used for windowing, must be accessible via `torch.{win_function}`
filter_length (`int`, *optional*, defaults to 1024):
The number of FFT components to use. If `None`, this is determined using
`transformers.audio_utils.optimal_fft_length`.
max_length_s (`int`, *optional*, defaults to 10):
The maximum input lenght of the model in seconds. This is used to pad the audio.
fmin (`float`, *optional*, defaults to 0.0):
Minimum mel frequency in Hz.
fmax (`float`, *optional*):
Maximum mel frequency in Hz. If not set, defaults to `sampling_rate / 2`.
mel_floor (`float`, *optional*, defaults to 1e-09):
Minimum value of mel frequency banks. Note that the way [`UnivNetFeatureExtractor`] uses `mel_floor` is
different than in [`transformers.audio_utils.spectrogram`].
center (`bool`, *optional*, defaults to `False`):
Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`, frame
`t` will start at time `t * hop_length`.
compression_factor (`float`, *optional*, defaults to 1.0):
The multiplicative compression factor for dynamic range compression during spectral normalization.
compression_clip_val (`float`, *optional*, defaults to 1e-05):
The clip value applied to the waveform before applying dynamic range compression during spectral
normalization.
normalize_min (`float`, *optional*, defaults to -11.512925148010254):
The min value used for Tacotron 2-style linear normalization. The default is the original value from the
Tacotron 2 implementation.
normalize_max (`float`, *optional*, defaults to 2.3143386840820312):
The max value used for Tacotron 2-style linear normalization. The default is the original value from the
Tacotron 2 implementation.
model_in_channels (`int`, *optional*, defaults to 64):
The number of input channels to the [`UnivNetModel`] model. This should match
`UnivNetModel.config.model_in_channels`.
pad_end_length (`int`, *optional*, defaults to 10):
If padding the end of each waveform, the number of spectrogram frames worth of samples to append. The
number of appended samples will be `pad_end_length * hop_length`.
return_attention_mask (`bool`, *optional*, defaults to `True`):
Whether or not [`~UnivNetFeatureExtractor.__call__`] should return `attention_mask`.
"""
model_input_names = ["input_features", "noise_sequence", "padding_mask"]
def __init__(
self,
feature_size: int = 1,
sampling_rate: int = 24000,
padding_value: float = 0.0,
do_normalize: bool = False,
num_mel_bins: int = 100,
hop_length: int = 256,
win_length: int = 1024,
win_function: str = "hann_window",
filter_length: Optional[int] = 1024,
max_length_s: int = 10,
fmin: float = 0.0,
fmax: Optional[float] = None,
mel_floor: float = 1e-9,
center: bool = False,
compression_factor: float = 1.0,
compression_clip_val: float = 1e-5,
normalize_min: float = -11.512925148010254,
normalize_max: float = 2.3143386840820312,
model_in_channels: int = 64,
pad_end_length: int = 10,
return_attention_mask=True,
**kwargs,
):
super().__init__(
feature_size=feature_size,
sampling_rate=sampling_rate,
padding_value=padding_value,
return_attention_mask=return_attention_mask,
**kwargs,
)
self.do_normalize = do_normalize
self.num_mel_bins = num_mel_bins
self.hop_length = hop_length
self.win_length = win_length
self.win_function = win_function
self.filter_length = filter_length
self.fmin = fmin
if fmax is None:
# Follows the librosa.filters.mel implementation
fmax = float(sampling_rate) / 2
self.fmax = fmax
self.mel_floor = mel_floor
self.max_length_s = max_length_s
self.num_max_samples = max_length_s * sampling_rate
if self.filter_length is None:
self.n_fft = optimal_fft_length(self.win_length)
else:
self.n_fft = self.filter_length
self.n_freqs = (self.n_fft // 2) + 1
self.window = window_function(window_length=self.win_length, name=self.win_function, periodic=True)
self.mel_filters = mel_filter_bank(
num_frequency_bins=self.n_freqs,
num_mel_filters=self.num_mel_bins,
min_frequency=self.fmin,
max_frequency=self.fmax,
sampling_rate=self.sampling_rate,
norm="slaney",
mel_scale="slaney",
)
self.center = center
self.compression_factor = compression_factor
self.compression_clip_val = compression_clip_val
self.normalize_min = normalize_min
self.normalize_max = normalize_max
self.model_in_channels = model_in_channels
self.pad_end_length = pad_end_length
def normalize(self, spectrogram):
return 2 * ((spectrogram - self.normalize_min) / (self.normalize_max - self.normalize_min)) - 1
def denormalize(self, spectrogram):
return self.normalize_min + (self.normalize_max - self.normalize_min) * ((spectrogram + 1) / 2)
def mel_spectrogram(self, waveform: np.ndarray) -> np.ndarray:
"""
Calculates log MEL spectrograms from a batch of waveforms. Note that the input waveform(s) will be padded by
`int(self.n_fft - self.hop_length) / 2` on both sides using the `reflect` padding mode.
Args:
waveform (`np.ndarray` of shape `(length,)`):
The input waveform. This must be a single real-valued, mono waveform.
Returns:
`numpy.ndarray`: Array containing a log-mel spectrogram of shape `(num_frames, num_mel_bins)`.
"""
# Do custom padding based on the official MelGAN and Hifi-GAN implementations
# See https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/utils/stft.py#L84-L86
waveform = np.pad(
waveform,
(int((self.n_fft - self.hop_length) / 2), int((self.n_fft - self.hop_length) / 2)),
mode="reflect",
)
# Get the complex spectrogram.
# Note: waveform must be unbatched currently due to the implementation of spectrogram(...).
complex_spectrogram = spectrogram(
waveform,
window=self.window,
frame_length=self.n_fft,
hop_length=self.hop_length,
fft_length=self.n_fft,
power=None,
center=self.center,
mel_filters=None,
mel_floor=None,
)
# Apply the MEL filter bank and MEL floor manually since UnivNet uses a slightly different implementation
amplitude_spectrogram = np.sqrt(
np.real(complex_spectrogram) ** 2 + np.imag(complex_spectrogram) ** 2 + self.mel_floor
)
mel_spectrogram = np.matmul(self.mel_filters.T, amplitude_spectrogram)
# Perform spectral normalization to get the log mel spectrogram.
log_mel_spectrogram = np.log(
np.clip(mel_spectrogram, a_min=self.compression_clip_val, a_max=None) * self.compression_factor
)
# Return spectrogram with num_mel_bins last
return log_mel_spectrogram.T
def generate_noise(
self,
noise_length: int,
generator: Optional[np.random.Generator] = None,
) -> np.ndarray:
"""
Generates a random noise sequence of standard Gaussian noise for use in the `noise_sequence` argument of
[`UnivNetModel.forward`].
Args:
spectrogram_length (`int`):
The length (dim 0) of the generated noise.
model_in_channels (`int`, *optional*, defaults to `None`):
The number of features (dim 1) of the generated noise. This should correspond to the
`model_in_channels` of the [`UnivNetGan`] model. If not set, this will default to
`self.config.model_in_channels`.
generator (`numpy.random.Generator`, *optional*, defaults to `None`)
An optional `numpy.random.Generator` random number generator to control noise generation. If not set, a
new generator with fresh entropy will be created.
Returns:
`numpy.ndarray`: Array containing random standard Gaussian noise of shape `(noise_length,
model_in_channels)`.
"""
if generator is None:
generator = np.random.default_rng()
noise_shape = (noise_length, self.model_in_channels)
noise = generator.standard_normal(noise_shape, dtype=np.float32)
return noise
def batch_decode(self, waveforms, waveform_lengths=None) -> List[np.ndarray]:
r"""
Removes padding from generated audio after running [`UnivNetModel.forward`]. This returns a ragged list of 1D
audio waveform arrays and not a single tensor/array because in general the waveforms will have different
lengths after removing padding.
Args:
waveforms (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
The batched output waveforms from the [`UnivNetModel`].
waveform_lengths (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
The batched lengths of each waveform before padding.
Returns:
`List[np.ndarray]`: A ragged list of 1D waveform arrays with padding removed.
"""
# Collapse the batched waveform tensor to a list of 1D audio waveforms
waveforms = [waveform.detach().clone().cpu().numpy() for waveform in waveforms]
if waveform_lengths is not None:
waveforms = [waveform[: waveform_lengths[i]] for i, waveform in enumerate(waveforms)]
return waveforms
def __call__(
self,
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
sampling_rate: Optional[int] = None,
padding: Union[bool, str, PaddingStrategy] = True,
max_length: Optional[int] = None,
truncation: bool = True,
pad_to_multiple_of: Optional[int] = None,
return_noise: bool = True,
generator: Optional[np.random.Generator] = None,
pad_end: bool = False,
pad_length: Optional[int] = None,
do_normalize: Optional[str] = None,
return_attention_mask: Optional[bool] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model one or several sequence(s).
Args:
raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
stereo, i.e. single float per timestep.
sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
pipeline.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the input `raw_speech` waveforms (according to the model's padding side and
padding index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
If `pad_end = True`, that padding will occur before the `padding` strategy is applied.
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
truncation (`bool`, *optional*, defaults to `True`):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
return_noise (`bool`, *optional*, defaults to `True`):
Whether to generate and return a noise waveform for use in [`UnivNetModel.forward`].
generator (`numpy.random.Generator`, *optional*, defaults to `None`):
An optional `numpy.random.Generator` random number generator to use when generating noise.
pad_end (`bool`, *optional*, defaults to `False`):
Whether to pad the end of each waveform with silence. This can help reduce artifacts at the end of the
generated audio sample; see https://github.com/seungwonpark/melgan/issues/8 for more details. This
padding will be done before the padding strategy specified in `padding` is performed.
pad_length (`int`, *optional*, defaults to `None`):
If padding the end of each waveform, the length of the padding in spectrogram frames. If not set, this
will default to `self.config.pad_end_length`.
do_normalize (`bool`, *optional*):
Whether to perform Tacotron 2 normalization on the input. Normalizing can help to significantly improve
the performance for some models. If not set, this will default to `self.config.do_normalize`.
return_attention_mask (`bool`, *optional*):
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific feature_extractor's default.
[What are attention masks?](../glossary#attention-mask)
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.np.array` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
"""
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
if sampling_rate is not None:
if sampling_rate != self.sampling_rate:
raise ValueError(
f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
f" was sampled with {self.sampling_rate} and not {sampling_rate}."
)
else:
logger.warning(
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
"Failing to do so can result in silent errors that might be hard to debug."
)
is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
if is_batched_numpy and len(raw_speech.shape) > 2:
raise ValueError(f"Only mono-channel audio is supported for input to {self}")
is_batched = is_batched_numpy or (
isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
)
if is_batched:
raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech]
elif not is_batched and not isinstance(raw_speech, np.ndarray):
raw_speech = np.asarray(raw_speech, dtype=np.float32)
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
raw_speech = raw_speech.astype(np.float32)
# always return batch
if not is_batched:
raw_speech = [np.asarray(raw_speech, dtype=np.float32)]
# Pad end to reduce artifacts
if pad_end:
pad_length = pad_length if pad_length is not None else self.pad_end_length
raw_speech = [
np.pad(waveform, (0, pad_length * self.hop_length), constant_values=self.padding_value)
for waveform in raw_speech
]
batched_speech = BatchFeature({"input_features": raw_speech})
padded_inputs = self.pad(
batched_speech,
padding=padding,
max_length=max_length if max_length is not None else self.num_max_samples,
truncation=truncation,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
)
# make sure list is in array format
# input_features = padded_inputs.get("input_features").transpose(2, 0, 1)
input_features = padded_inputs.get("input_features")
mel_spectrograms = [self.mel_spectrogram(waveform) for waveform in input_features]
if isinstance(input_features[0], List):
batched_speech["input_features"] = [np.asarray(mel, dtype=np.float32) for mel in mel_spectrograms]
else:
batched_speech["input_features"] = [mel.astype(np.float32) for mel in mel_spectrograms]
# convert attention_mask to correct format
attention_mask = padded_inputs.get("attention_mask")
if attention_mask is not None:
batched_speech["padding_mask"] = [np.asarray(array, dtype=np.int32) for array in attention_mask]
if return_noise:
noise = [
self.generate_noise(spectrogram.shape[0], generator)
for spectrogram in batched_speech["input_features"]
]
batched_speech["noise_sequence"] = noise
if do_normalize:
batched_speech["input_features"] = [
self.normalize(spectrogram) for spectrogram in batched_speech["input_features"]
]
if return_tensors is not None:
batched_speech = batched_speech.convert_to_tensors(return_tensors)
return batched_speech
def to_dict(self) -> Dict[str, Any]:
output = super().to_dict()
# Don't serialize these as they are derived from the other properties.
names = ["window", "mel_filters", "n_fft", "n_freqs", "num_max_samples"]
for name in names:
if name in output:
del output[name]
return output

View File

@ -0,0 +1,636 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch UnivNetModel model."""
from dataclasses import dataclass
from typing import Optional, Tuple, Union
import torch
import torch.utils.checkpoint
from torch import nn
from ...modeling_utils import ModelOutput, PreTrainedModel
from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
from .configuration_univnet import UnivNetConfig
logger = logging.get_logger(__name__)
# General docstring
_CONFIG_FOR_DOC = "UnivNetConfig"
_CHECKPOINT_FOR_DOC = "dg845/univnet-dev"
UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST = [
"dg845/univnet-dev",
# See all UnivNet models at https://huggingface.co/models?filter=univnet
]
@dataclass
class UnivNetModelOutput(ModelOutput):
"""
Output class for the [`UnivNetModel`], which includes the generated audio waveforms and the original unpadded
lengths of those waveforms (so that the padding can be removed by [`UnivNetModel.batch_decode`]).
Args:
waveforms (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
Batched 1D (mono-channel) output audio waveforms.
waveform_lengths (`torch.FloatTensor` of shape `(batch_size,)`):
The batched length in samples of each unpadded waveform in `waveforms`.
"""
waveforms: torch.FloatTensor = None
waveform_lengths: torch.FloatTensor = None
class UnivNetKernelPredictorResidualBlock(nn.Module):
"""
Implementation of the residual block for the kernel predictor network inside each location variable convolution
block (LVCBlock).
Parameters:
config: (`UnivNetConfig`):
Config for the `UnivNetModel` model.
"""
def __init__(
self,
config: UnivNetConfig,
):
super().__init__()
self.channels = config.model_in_channels
self.kernel_size = config.kernel_predictor_conv_size
self.dropout_prob = config.kernel_predictor_dropout
self.leaky_relu_slope = config.leaky_relu_slope
padding = (self.kernel_size - 1) // 2
self.dropout = nn.Dropout(self.dropout_prob)
self.conv1 = nn.Conv1d(self.channels, self.channels, self.kernel_size, padding=padding, bias=True)
self.conv2 = nn.Conv1d(self.channels, self.channels, self.kernel_size, padding=padding, bias=True)
def forward(self, hidden_states: torch.FloatTensor):
# hidden_states should have shape (batch_size, channels, seq_length)
residual = hidden_states
hidden_states = self.dropout(hidden_states)
hidden_states = self.conv1(hidden_states)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = self.conv2(hidden_states)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
return hidden_states + residual
def apply_weight_norm(self):
nn.utils.weight_norm(self.conv1)
nn.utils.weight_norm(self.conv2)
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.conv1)
nn.utils.remove_weight_norm(self.conv2)
class UnivNetKernelPredictor(nn.Module):
"""
Implementation of the kernel predictor network which supplies the kernel and bias for the location variable
convolutional layers (LVCs) in each UnivNet LVCBlock.
Based on the KernelPredictor implementation in
[maum-ai/univnet](https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/model/lvcnet.py#L7).
Parameters:
config: (`UnivNetConfig`):
Config for the `UnivNetModel` model.
conv_kernel_size (`int`, *optional*, defaults to 3):
The kernel size for the location variable convolutional layer kernels (convolutional weight tensor).
conv_layers (`int`, *optional*, defaults to 4):
The number of location variable convolutional layers to output kernels and biases for.
"""
def __init__(
self,
config: UnivNetConfig,
conv_kernel_size: int = 3,
conv_layers: int = 4,
):
super().__init__()
self.conv_in_channels = config.model_hidden_channels
self.conv_out_channels = 2 * config.model_hidden_channels
self.conv_kernel_size = conv_kernel_size
self.conv_layers = conv_layers
self.kernel_channels = (
self.conv_in_channels * self.conv_out_channels * self.conv_kernel_size * self.conv_layers
)
self.bias_channels = self.conv_out_channels * self.conv_layers
self.resnet_in_channels = config.num_mel_bins
self.resnet_hidden_channels = config.kernel_predictor_hidden_channels
self.resnet_kernel_size = config.kernel_predictor_conv_size
self.num_blocks = config.kernel_predictor_num_blocks
self.leaky_relu_slope = config.leaky_relu_slope
padding = (self.resnet_kernel_size - 1) // 2
self.input_conv = nn.Conv1d(self.resnet_in_channels, self.resnet_hidden_channels, 5, padding=2, bias=True)
self.resblocks = nn.ModuleList([UnivNetKernelPredictorResidualBlock(config) for _ in range(self.num_blocks)])
self.kernel_conv = nn.Conv1d(
self.resnet_hidden_channels, self.kernel_channels, self.resnet_kernel_size, padding=padding, bias=True
)
self.bias_conv = nn.Conv1d(
self.resnet_hidden_channels, self.bias_channels, self.resnet_kernel_size, padding=padding, bias=True
)
def forward(self, spectrogram: torch.FloatTensor):
"""
Maps a conditioning log-mel spectrogram to a tensor of convolutional kernels and biases, for use in location
variable convolutional layers. Note that the input spectrogram should have shape (batch_size, input_channels,
seq_length).
Args:
spectrogram (`torch.FloatTensor` of shape `(batch_size, input_channels, seq_length)`):
Tensor containing the log-mel spectrograms.
Returns:
Tuple[`torch.FloatTensor, `torch.FloatTensor`]: tuple of tensors where the first element is the tensor of
location variable convolution kernels of shape `(batch_size, self.conv_layers, self.conv_in_channels,
self.conv_out_channels, self.conv_kernel_size, seq_length)` and the second element is the tensor of
location variable convolution biases of shape `(batch_size, self.conv_layers. self.conv_out_channels,
seq_length)`.
"""
batch_size, _, seq_length = spectrogram.shape
hidden_states = self.input_conv(spectrogram)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
for resblock in self.resblocks:
hidden_states = resblock(hidden_states)
kernel_hidden_states = self.kernel_conv(hidden_states)
bias_hidden_states = self.bias_conv(hidden_states)
# Reshape kernels and biases to appropriate shape
kernels = kernel_hidden_states.view(
batch_size,
self.conv_layers,
self.conv_in_channels,
self.conv_out_channels,
self.conv_kernel_size,
seq_length,
).contiguous()
biases = bias_hidden_states.view(
batch_size,
self.conv_layers,
self.conv_out_channels,
seq_length,
).contiguous()
return kernels, biases
def apply_weight_norm(self):
nn.utils.weight_norm(self.input_conv)
for layer in self.resblocks:
layer.apply_weight_norm()
nn.utils.weight_norm(self.kernel_conv)
nn.utils.weight_norm(self.bias_conv)
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.input_conv)
for layer in self.resblocks:
layer.remove_weight_norm()
nn.utils.remove_weight_norm(self.kernel_conv)
nn.utils.remove_weight_norm(self.bias_conv)
class UnivNetLvcResidualBlock(nn.Module):
"""
Implementation of the location variable convolution (LVC) residual block for the UnivNet residual network.
Parameters:
config: (`UnivNetConfig`):
Config for the `UnivNetModel` model.
kernel_size (`int`):
The kernel size for the dilated 1D convolutional layer.
dilation (`int`):
The dilation for the dilated 1D convolutional layer.
"""
def __init__(
self,
config: UnivNetConfig,
kernel_size: int,
dilation: int,
):
super().__init__()
self.hidden_channels = config.model_hidden_channels
self.kernel_size = kernel_size
self.dilation = dilation
self.leaky_relu_slope = config.leaky_relu_slope
padding = self.dilation * (self.kernel_size - 1) // 2
self.conv = nn.Conv1d(
self.hidden_channels,
self.hidden_channels,
self.kernel_size,
padding=padding,
dilation=self.dilation,
)
def forward(self, hidden_states, kernel, bias, hop_size=256):
residual = hidden_states
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = self.conv(hidden_states)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = self.location_variable_convolution(hidden_states, kernel, bias, hop_size=hop_size)
# Gated activation unit
hidden_states = torch.sigmoid(hidden_states[:, : self.hidden_channels, :]) * torch.tanh(
hidden_states[:, self.hidden_channels :, :]
)
# Skip connection
hidden_states = residual + hidden_states
return hidden_states
# Based on https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/model/lvcnet.py#L171
def location_variable_convolution(
self,
hidden_states: torch.FloatTensor,
kernel: torch.FloatTensor,
bias: torch.FloatTensor,
dilation: int = 1,
hop_size: int = 256,
):
"""
Performs location-variable convolution operation on the input sequence (hidden_states) using the local
convolution kernel. This was introduced in [LVCNet: Efficient Condition-Dependent Modeling Network for Waveform
Generation](https://arxiv.org/abs/2102.10815) by Zhen Zheng, Jianzong Wang, Ning Cheng, and Jing Xiao.
Time: 414 μs ± 309 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each), test on NVIDIA V100.
Args:
hidden_states (`torch.FloatTensor` of shape `(batch_size, in_channels, in_length)`):
The input sequence of shape (batch, in_channels, in_length).
kernel (`torch.FloatTensor` of shape `(batch_size, in_channels, out_channels, kernel_size, kernel_length)`):
The local convolution kernel of shape (batch, in_channels, out_channels, kernel_size, kernel_length).
bias (`torch.FloatTensor` of shape `(batch_size, out_channels, kernel_length)`):
The bias for the local convolution of shape (batch, out_channels, kernel_length).
dilation (`int`, *optional*, defaults to 1):
The dilation of convolution.
hop_size (`int`, *optional*, defaults to 256):
The hop_size of the conditioning sequence.
Returns:
`torch.FloatTensor`: the output sequence after performing local convolution with shape (batch_size,
out_channels, in_length).
"""
batch, _, in_length = hidden_states.shape
batch, _, out_channels, kernel_size, kernel_length = kernel.shape
if in_length != (kernel_length * hop_size):
raise ValueError(
f"Dim 2 of `hidden_states` should be {kernel_length * hop_size}) but got {in_length}. Please check"
" `hidden_states` or `kernel` and `hop_size` to make sure they are correct."
)
padding = dilation * int((kernel_size - 1) / 2)
# (batch, in_channels, in_length + 2*padding)
hidden_states = nn.functional.pad(hidden_states, (padding, padding), "constant", 0)
# (batch, in_channels, kernel_length, hop_size + 2*padding)
hidden_states = hidden_states.unfold(2, hop_size + 2 * padding, hop_size)
if hop_size < dilation:
hidden_states = nn.functional.pad(hidden_states, (0, dilation), "constant", 0)
# (batch, in_channels, kernel_length, (hop_size + 2*padding)/dilation, dilation)
hidden_states = hidden_states.unfold(3, dilation, dilation)
hidden_states = hidden_states[:, :, :, :, :hop_size]
# (batch, in_channels, kernel_length, dilation, (hop_size + 2*padding)/dilation)
hidden_states = hidden_states.transpose(3, 4)
# (batch, in_channels, kernel_length, dilation, _, kernel_size)
hidden_states = hidden_states.unfold(4, kernel_size, 1)
# Apply local convolution kernel to hidden_states.
output_hidden_states = torch.einsum("bildsk,biokl->bolsd", hidden_states, kernel)
output_hidden_states = output_hidden_states.to(memory_format=torch.channels_last_3d)
bias = bias.unsqueeze(-1).unsqueeze(-1).to(memory_format=torch.channels_last_3d)
output_hidden_states = output_hidden_states + bias
output_hidden_states = output_hidden_states.contiguous().view(batch, out_channels, -1)
return output_hidden_states
def apply_weight_norm(self):
nn.utils.weight_norm(self.conv)
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.conv)
class UnivNetLvcBlock(nn.Module):
"""
Implementation of the location variable convolution (LVC) residual block of the UnivNet residual block. Includes a
`UnivNetKernelPredictor` inside to predict the kernels and biases of the LVC layers.
Based on LVCBlock in
[maum-ai/univnet](https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/model/lvcnet.py#L98)
Parameters:
config (`UnivNetConfig`):
Config for the `UnivNetModel` model.
layer_id (`int`):
An integer corresponding to the index of the current LVC resnet block layer. This should be between 0 and
`len(config.resblock_stride_sizes) - 1)` inclusive.
lvc_hop_size (`int`, *optional*, defaults to 256):
The hop size for the location variable convolutional layers.
"""
def __init__(
self,
config: UnivNetConfig,
layer_id: int,
lvc_hop_size: int = 256,
):
super().__init__()
self.hidden_channels = config.model_hidden_channels
self.kernel_size = config.resblock_kernel_sizes[layer_id]
self.stride = config.resblock_stride_sizes[layer_id]
self.dilations = config.resblock_dilation_sizes[layer_id]
self.cond_hop_length = lvc_hop_size
self.leaky_relu_slope = config.leaky_relu_slope
self.num_blocks = len(self.dilations)
self.convt_pre = nn.ConvTranspose1d(
self.hidden_channels,
self.hidden_channels,
2 * self.stride,
stride=self.stride,
padding=self.stride // 2 + self.stride % 2,
output_padding=self.stride % 2,
)
self.kernel_predictor = UnivNetKernelPredictor(config, self.kernel_size, self.num_blocks)
self.resblocks = nn.ModuleList(
[UnivNetLvcResidualBlock(config, self.kernel_size, self.dilations[i]) for i in range(self.num_blocks)]
)
def forward(self, hidden_states: torch.FloatTensor, spectrogram: torch.FloatTensor):
# hidden_states: (batch_size, hidden_channels, seq_length)
# spectrogram: (batch_size, cond_channels, cond_length)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = self.convt_pre(hidden_states)
kernels, biases = self.kernel_predictor(spectrogram)
for i, resblock in enumerate(self.resblocks):
kernel = kernels[:, i, :, :, :, :]
bias = biases[:, i, :, :]
hidden_states = resblock(hidden_states, kernel, bias, hop_size=self.cond_hop_length)
return hidden_states
def apply_weight_norm(self):
nn.utils.weight_norm(self.convt_pre)
self.kernel_predictor.apply_weight_norm()
for layer in self.resblocks:
layer.apply_weight_norm()
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.convt_pre)
self.kernel_predictor.remove_weight_norm()
for layer in self.resblocks:
layer.remove_weight_norm()
UNIVNET_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`UnivNetConfig`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
UNIVNET_INPUTS_DOCSTRING = r"""
Converts a noise waveform and a conditioning spectrogram to a speech waveform. Passing a batch of log-mel
spectrograms returns a batch of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a
single, un-batched speech waveform.
Args:
input_features (`torch.FloatTensor`):
Tensor containing the log-mel spectrograms. Can be batched and of shape `(batch_size, sequence_length,
config.num_mel_channels)`, or un-batched and of shape `(sequence_length, config.num_mel_channels)`.
noise_sequence (`torch.FloatTensor`, *optional*):
Tensor containing a noise sequence of standard Gaussian noise. Can be batched and of shape `(batch_size,
sequence_length, config.model_in_channels)`, or un-batched and of shape (sequence_length,
config.model_in_channels)`. If not supplied, will be randomly generated.
padding_mask (`torch.BoolTensor`, *optional*):
Mask indicating which parts of each sequence are padded. Mask values are selected in `[0, 1]`:
- 1 for tokens that are **not masked**
- 0 for tokens that are **masked**
The mask can be batched and of shape `(batch_size, sequence_length)` or un-batched and of shape
`(sequence_length,)`.
generator (`torch.Generator`, *optional*):
A [torch generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation
deterministic.
return_dict:
Whether to return a [`~utils.ModelOutput`] subclass instead of a plain tuple.
"""
@add_start_docstrings(
"""UnivNet GAN vocoder.""",
UNIVNET_START_DOCSTRING,
)
class UnivNetModel(PreTrainedModel):
config_class = UnivNetConfig
main_input_name = "input_features"
def __init__(self, config: UnivNetConfig):
super().__init__(config)
self.num_kernels = len(config.resblock_kernel_sizes)
self.leaky_relu_slope = config.leaky_relu_slope
self.conv_pre = nn.Conv1d(
config.model_in_channels,
config.model_hidden_channels,
kernel_size=7,
stride=1,
padding=3,
padding_mode="reflect",
)
# Initialize location-variable convolution ResNet Blocks.
num_layers = len(config.resblock_stride_sizes)
hop_length = 1
hop_lengths = []
for stride in config.resblock_stride_sizes:
hop_length = hop_length * stride
hop_lengths.append(hop_length)
self.resblocks = nn.ModuleList(
[
UnivNetLvcBlock(
config,
layer_id=i,
lvc_hop_size=hop_lengths[i],
)
for i in range(num_layers)
]
)
self.conv_post = nn.Conv1d(config.model_hidden_channels, 1, 7, padding=3, padding_mode="reflect")
# Initialize weights and apply final processing
self.post_init()
@add_start_docstrings_to_model_forward(UNIVNET_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=UnivNetModelOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_features: torch.FloatTensor,
noise_sequence: Optional[torch.FloatTensor] = None,
padding_mask: Optional[torch.FloatTensor] = None,
generator: Optional[torch.Generator] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple[torch.FloatTensor], UnivNetModelOutput]:
r"""
Returns:
Example:
```python
>>> from transformers import UnivNetFeatureExtractor, UnivNetModel
>>> from datasets import load_dataset, Audio
>>> model = UnivNetModel.from_pretrained("dg845/univnet-dev")
>>> feature_extractor = UnivNetFeatureExtractor.from_pretrained("dg845/univnet-dev")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> # Resample the audio to the feature extractor's sampling rate.
>>> ds = ds.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> inputs = feature_extractor(
... ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt"
... )
>>> audio = model(**inputs).waveforms
>>> list(audio.shape)
[1, 140288]
```
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# Resolve batch sizes for noise_sequence and spectrogram
spectrogram_batched = input_features.dim() == 3
if not spectrogram_batched:
input_features = input_features.unsqueeze(0)
spectrogram_batch_size, spectrogram_length, _ = input_features.shape
if noise_sequence is not None:
noise_sequence_batched = noise_sequence.dim() == 3
if not noise_sequence_batched:
noise_sequence = noise_sequence.unsqueeze(0)
else:
# Randomly generate noise_sequence
noise_sequence_shape = (spectrogram_batch_size, spectrogram_length, self.config.model_in_channels)
noise_sequence = torch.randn(
noise_sequence_shape, generator=generator, dtype=input_features.dtype, device=input_features.device
)
noise_sequence_batch_size = noise_sequence.shape[0]
if spectrogram_batch_size > 1 and noise_sequence_batch_size == 1:
# Repeat noise_sequence spectrogram_batch_size times
noise_sequence = noise_sequence.repeat(spectrogram_batch_size, 1, 1)
elif noise_sequence_batch_size > 1 and spectrogram_batch_size == 1:
# Repeat spectrogram noise_sequence_batch_size times
input_features = input_features.repeat(noise_sequence_batch_size, 1, 1)
if noise_sequence_batch_size != spectrogram_batch_size:
raise ValueError(
f"The batch size of `noise_sequence` is {noise_sequence_batch_size} and the batch size of"
f" `input_features` is {spectrogram_batch_size}, but the two are expected to be equal."
)
if padding_mask is not None:
if padding_mask.dim() == 1:
padding_mask = padding_mask.unsqueeze(0)
padding_mask_batch_size = padding_mask.shape[0]
if padding_mask_batch_size != spectrogram_batch_size:
raise ValueError(
f"The batch size of `padding_mask` is {padding_mask_batch_size} and the batch size of"
f" `input_features` is {spectrogram_batch_size}, but the two are expected to be equal."
)
# Change shapes to have channels before sequence lengths
hidden_states = noise_sequence.transpose(2, 1)
input_features = input_features.transpose(2, 1)
hidden_states = self.conv_pre(hidden_states)
for resblock in self.resblocks:
hidden_states = resblock(hidden_states, input_features)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = self.conv_post(hidden_states)
hidden_states = torch.tanh(hidden_states)
# Remove sequence length dimension since this collapses to 1
# NOTE: keep waveforms batched even if there's only one
waveform = hidden_states.squeeze(1)
# Get sequence lengths for UnivNetFeatureExtractor.batch_decode.
waveform_lengths = None
if padding_mask is not None:
# Padding is always contiguous and added on the right
waveform_lengths = torch.sum(padding_mask, dim=1)
if not return_dict:
outputs = (waveform, waveform_lengths)
return outputs
return UnivNetModelOutput(
waveforms=waveform,
waveform_lengths=waveform_lengths,
)
def _init_weights(self, module):
"""Initialize the weights."""
if isinstance(module, (nn.Linear, nn.Conv1d, nn.ConvTranspose1d)):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
def apply_weight_norm(self):
nn.utils.weight_norm(self.conv_pre)
for layer in self.resblocks:
layer.apply_weight_norm()
nn.utils.weight_norm(self.conv_post)
def remove_weight_norm(self):
nn.utils.remove_weight_norm(self.conv_pre)
for layer in self.resblocks:
layer.remove_weight_norm()
nn.utils.remove_weight_norm(self.conv_post)

View File

@ -7985,6 +7985,16 @@ class UniSpeechSatPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"])
UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
class UnivNetModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class UperNetForSemanticSegmentation(metaclass=DummyObject):
_backends = ["torch"]

View File

View File

@ -0,0 +1,365 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import os
import random
import tempfile
import unittest
import numpy as np
from datasets import Audio, load_dataset
from transformers import UnivNetFeatureExtractor
from transformers.testing_utils import check_json_file_has_correct_format, require_torch, slow
from transformers.utils.import_utils import is_torch_available
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
if is_torch_available():
import torch
global_rng = random.Random()
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
def floats_list(shape, scale=1.0, rng=None, name=None):
"""Creates a random float32 tensor"""
if rng is None:
rng = global_rng
values = []
for batch_idx in range(shape[0]):
values.append([])
for _ in range(shape[1]):
values[-1].append(rng.random() * scale)
return values
class UnivNetFeatureExtractionTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
min_seq_length=400,
max_seq_length=2000,
feature_size=1,
sampling_rate=24000,
padding_value=0.0,
do_normalize=True,
num_mel_bins=100,
hop_length=256,
win_length=1024,
win_function="hann_window",
filter_length=1024,
max_length_s=10,
fmin=0.0,
fmax=12000,
mel_floor=1e-9,
center=False,
compression_factor=1.0,
compression_clip_val=1e-5,
normalize_min=-11.512925148010254,
normalize_max=2.3143386840820312,
model_in_channels=64,
pad_end_length=10,
):
self.parent = parent
self.batch_size = batch_size
self.min_seq_length = min_seq_length
self.max_seq_length = max_seq_length
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
self.feature_size = feature_size
self.sampling_rate = sampling_rate
self.padding_value = padding_value
self.do_normalize = do_normalize
self.num_mel_bins = num_mel_bins
self.hop_length = hop_length
self.win_length = win_length
self.win_function = win_function
self.filter_length = filter_length
self.max_length_s = max_length_s
self.fmin = fmin
self.fmax = fmax
self.mel_floor = mel_floor
self.center = center
self.compression_factor = compression_factor
self.compression_clip_val = compression_clip_val
self.normalize_min = normalize_min
self.normalize_max = normalize_max
self.model_in_channels = model_in_channels
self.pad_end_length = pad_end_length
def prepare_feat_extract_dict(self):
return {
"feature_size": self.feature_size,
"sampling_rate": self.sampling_rate,
"padding_value": self.padding_value,
"do_normalize": self.do_normalize,
"num_mel_bins": self.num_mel_bins,
"hop_length": self.hop_length,
"win_length": self.win_length,
"win_function": self.win_function,
"filter_length": self.filter_length,
"max_length_s": self.max_length_s,
"fmin": self.fmin,
"fmax": self.fmax,
"mel_floor": self.mel_floor,
"center": self.center,
"compression_factor": self.compression_factor,
"compression_clip_val": self.compression_clip_val,
"normalize_min": self.normalize_min,
"normalize_max": self.normalize_max,
"model_in_channels": self.model_in_channels,
"pad_end_length": self.pad_end_length,
}
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
def _flatten(list_of_lists):
return list(itertools.chain(*list_of_lists))
if equal_length:
speech_inputs = floats_list((self.batch_size, self.max_seq_length))
else:
# make sure that inputs increase in size
speech_inputs = [
_flatten(floats_list((x, self.feature_size)))
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
]
if numpify:
speech_inputs = [np.asarray(x) for x in speech_inputs]
return speech_inputs
class UnivNetFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
feature_extraction_class = UnivNetFeatureExtractor
def setUp(self):
self.feat_extract_tester = UnivNetFeatureExtractionTester(self)
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_feat_extract_from_and_save_pretrained
def test_feat_extract_from_and_save_pretrained(self):
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
with tempfile.TemporaryDirectory() as tmpdirname:
saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
check_json_file_has_correct_format(saved_file)
feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
dict_first = feat_extract_first.to_dict()
dict_second = feat_extract_second.to_dict()
mel_1 = feat_extract_first.mel_filters
mel_2 = feat_extract_second.mel_filters
self.assertTrue(np.allclose(mel_1, mel_2))
self.assertEqual(dict_first, dict_second)
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_feat_extract_to_json_file
def test_feat_extract_to_json_file(self):
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
with tempfile.TemporaryDirectory() as tmpdirname:
json_file_path = os.path.join(tmpdirname, "feat_extract.json")
feat_extract_first.to_json_file(json_file_path)
feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
dict_first = feat_extract_first.to_dict()
dict_second = feat_extract_second.to_dict()
mel_1 = feat_extract_first.mel_filters
mel_2 = feat_extract_second.mel_filters
self.assertTrue(np.allclose(mel_1, mel_2))
self.assertEqual(dict_first, dict_second)
def test_call(self):
# Tests that all call wrap to encode_plus and batch_encode_plus
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
# create three inputs of length 800, 1000, and 1200
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
# Test feature size
input_features = feature_extractor(
np_speech_inputs, padding="max_length", max_length=1600, return_tensors="np"
).input_features
self.assertTrue(input_features.ndim == 3)
# Note: for some reason I get a weird padding error when feature_size > 1
# self.assertTrue(input_features.shape[-2] == feature_extractor.feature_size)
# Note: we use the shape convention (batch_size, seq_len, num_mel_bins)
self.assertTrue(input_features.shape[-1] == feature_extractor.num_mel_bins)
# Test not batched input
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
# Test batched
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
# Test 2-D numpy arrays are batched.
speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
np_speech_inputs = np.asarray(speech_inputs)
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
# Test truncation required
speech_inputs = [
floats_list((1, x))[0]
for x in range((feature_extractor.num_max_samples - 100), (feature_extractor.num_max_samples + 500), 200)
]
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
speech_inputs_truncated = [x[: feature_extractor.num_max_samples] for x in speech_inputs]
np_speech_inputs_truncated = [np.asarray(speech_input) for speech_input in speech_inputs_truncated]
encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs_truncated, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
def test_batched_unbatched_consistency(self):
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
speech_inputs = floats_list((1, 800))[0]
np_speech_inputs = np.asarray(speech_inputs)
# Test unbatched vs batched list
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor([speech_inputs], return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
# Test np.ndarray vs List[np.ndarray]
encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor([np_speech_inputs], return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
# Test unbatched np.ndarray vs batched np.ndarray
encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(
np.expand_dims(np_speech_inputs, axis=0), return_tensors="np"
).input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
def test_generate_noise(self):
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
features = feature_extractor(speech_inputs, return_noise=True)
input_features = features.input_features
noise_features = features.noise_sequence
for spectrogram, noise in zip(input_features, noise_features):
self.assertEqual(spectrogram.shape[0], noise.shape[0])
def test_pad_end(self):
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
input_features1 = feature_extractor(speech_inputs, padding=False, pad_end=False).input_features
input_features2 = feature_extractor(speech_inputs, padding=False, pad_end=True).input_features
for spectrogram1, spectrogram2 in zip(input_features1, input_features2):
self.assertEqual(spectrogram1.shape[0] + self.feat_extract_tester.pad_end_length, spectrogram2.shape[0])
def test_generate_noise_and_pad_end(self):
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
features = feature_extractor(speech_inputs, padding=False, return_noise=True, pad_end=True)
input_features = features.input_features
noise_features = features.noise_sequence
for spectrogram, noise in zip(input_features, noise_features):
self.assertEqual(spectrogram.shape[0], noise.shape[0])
@require_torch
def test_batch_decode(self):
import torch
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
input_lengths = list(range(800, 1400, 200))
pad_samples = feature_extractor.pad_end_length * feature_extractor.hop_length
output_features = {
"waveforms": torch.tensor(floats_list((3, max(input_lengths) + pad_samples))),
"waveform_lengths": torch.tensor(input_lengths),
}
waveforms = feature_extractor.batch_decode(**output_features)
for input_length, waveform in zip(input_lengths, waveforms):
self.assertTrue(len(waveform.shape) == 1, msg="Individual output waveforms should be 1D")
self.assertEqual(waveform.shape[0], input_length)
@require_torch
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_double_precision_pad
def test_double_precision_pad(self):
import torch
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
py_speech_inputs = np_speech_inputs.tolist()
for inputs in [py_speech_inputs, np_speech_inputs]:
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
self.assertTrue(np_processed.input_features.dtype == np.float32)
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
def _load_datasamples(self, num_samples):
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=self.feat_extract_tester.sampling_rate))
# automatic decoding with librispeech
speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
return [x["array"] for x in speech_samples], [x["sampling_rate"] for x in speech_samples]
@slow
@require_torch
def test_integration(self):
# fmt: off
EXPECTED_INPUT_FEATURES = torch.tensor(
[
-5.0229, -6.1358, -5.8346, -5.4447, -5.6707, -5.8577, -5.0464, -5.0058,
-5.6015, -5.6410, -5.4325, -5.6116, -5.3700, -5.7956, -5.3196, -5.3274,
-5.9655, -5.6057, -5.8382, -5.9602, -5.9005, -5.9123, -5.7669, -6.1441,
-5.5168, -5.1405, -5.3927, -6.0032, -5.5784, -5.3728
],
)
# fmt: on
input_speech, sr = self._load_datasamples(1)
feature_extractor = UnivNetFeatureExtractor()
input_features = feature_extractor(input_speech, sampling_rate=sr[0], return_tensors="pt").input_features
self.assertEqual(input_features.shape, (1, 548, 100))
input_features_mean = torch.mean(input_features)
input_features_stddev = torch.std(input_features)
EXPECTED_MEAN = torch.tensor(-6.18862009)
EXPECTED_STDDEV = torch.tensor(2.80845642)
torch.testing.assert_close(input_features_mean, EXPECTED_MEAN, atol=5e-5, rtol=5e-6)
torch.testing.assert_close(input_features_stddev, EXPECTED_STDDEV)
torch.testing.assert_close(input_features[0, :30, 0], EXPECTED_INPUT_FEATURES, atol=1e-4, rtol=1e-5)

View File

@ -0,0 +1,365 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import inspect
import random
import unittest
from datasets import Audio, load_dataset
from transformers import UnivNetConfig, UnivNetFeatureExtractor
from transformers.testing_utils import (
is_torch_available,
require_torch,
require_torch_gpu,
slow,
torch_device,
)
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
ModelTesterMixin,
floats_tensor,
)
if is_torch_available():
import torch
from transformers import UnivNetModel
class UnivNetModelTester:
def __init__(
self,
parent,
batch_size=2,
seq_length=7,
in_channels=8,
hidden_channels=8,
num_mel_bins=20,
kernel_predictor_hidden_channels=8,
seed=0,
is_training=False,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.in_channels = in_channels
self.hidden_channels = hidden_channels
self.num_mel_bins = num_mel_bins
self.kernel_predictor_hidden_channels = kernel_predictor_hidden_channels
self.seed = seed
self.is_training = is_training
def prepare_noise_sequence(self):
generator = torch.manual_seed(self.seed)
noise_shape = (self.seq_length, self.in_channels)
# Create noise on CPU for reproducibility
noise_sequence = torch.randn(noise_shape, generator=generator, dtype=torch.float)
return noise_sequence
def prepare_config_and_inputs(self):
spectrogram = floats_tensor([self.seq_length, self.num_mel_bins], scale=1.0)
noise_sequence = self.prepare_noise_sequence()
noise_sequence = noise_sequence.to(spectrogram.device)
config = self.get_config()
return config, spectrogram, noise_sequence
def get_config(self):
return UnivNetConfig(
model_in_channels=self.in_channels,
model_hidden_channels=self.hidden_channels,
num_mel_bins=self.num_mel_bins,
kernel_predictor_hidden_channels=self.kernel_predictor_hidden_channels,
)
def create_and_check_model(self, config, spectrogram, noise_sequence):
model = UnivNetModel(config=config).to(torch_device).eval()
result = model(spectrogram, noise_sequence)[0]
self.parent.assertEqual(result.shape, (1, self.seq_length * 256))
def prepare_config_and_inputs_for_common(self):
config, spectrogram, noise_sequence = self.prepare_config_and_inputs()
inputs_dict = {"input_features": spectrogram, "noise_sequence": noise_sequence}
return config, inputs_dict
@require_torch
class UnivNetModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (UnivNetModel,) if is_torch_available() else ()
# UnivNetModel currently cannot be traced with torch.jit.trace.
test_torchscript = False
# The UnivNetModel is not a transformer and does not use any attention mechanisms, so skip transformer/attention
# related tests.
test_pruning = False
test_resize_embeddings = False
test_resize_position_embeddings = False
test_head_masking = False
# UnivNetModel is not a sequence classification model.
test_mismatched_shapes = False
# UnivNetModel does not have a base_model_prefix attribute.
test_missing_keys = False
# UnivNetModel does not implement a parallelize method.
test_model_parallel = False
is_encoder_decoder = False
has_attentions = False
input_name = "input_features"
def setUp(self):
self.model_tester = UnivNetModelTester(self)
self.config_tester = ConfigTester(self, config_class=UnivNetConfig)
def test_config(self):
self.config_tester.create_and_test_config_to_json_string()
self.config_tester.create_and_test_config_to_json_file()
self.config_tester.create_and_test_config_from_and_save_pretrained()
self.config_tester.create_and_test_config_from_and_save_pretrained_subfolder()
self.config_tester.create_and_test_config_with_num_labels()
self.config_tester.check_config_can_be_init_without_params()
self.config_tester.check_config_arguments_init()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = [
"input_features",
]
self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
@unittest.skip(reason="UnivNetModel does not output hidden_states.")
def test_hidden_states_output(self):
pass
@unittest.skip(reason="UnivNetModel.forward does not accept an inputs_embeds argument.")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="UnivNetModel does not use input embeddings and thus has no get_input_embeddings method.")
def test_model_common_attributes(self):
pass
@unittest.skip(reason="UnivNetModel does not support all arguments tested, such as output_hidden_states.")
def test_model_outputs_equivalence(self):
pass
@unittest.skip(reason="UnivNetModel does not output hidden_states.")
def test_retain_grad_hidden_states_attentions(self):
pass
def test_batched_inputs_outputs(self):
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
batched_spectrogram = inputs["input_features"].unsqueeze(0).repeat(2, 1, 1)
batched_noise_sequence = inputs["noise_sequence"].unsqueeze(0).repeat(2, 1, 1)
with torch.no_grad():
batched_outputs = model(
batched_spectrogram.to(torch_device),
batched_noise_sequence.to(torch_device),
)[0]
self.assertEqual(
batched_spectrogram.shape[0],
batched_outputs.shape[0],
msg="Got different batch dims for input and output",
)
def test_unbatched_inputs_outputs(self):
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(inputs["input_features"].to(torch_device), inputs["noise_sequence"].to(torch_device))[
0
]
self.assertTrue(outputs.shape[0] == 1, msg="Unbatched input should create batched output with bsz = 1")
def test_unbatched_batched_outputs_consistency(self):
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
unbatched_spectrogram = inputs["input_features"].detach().clone()
unbatched_noise_sequence = inputs["noise_sequence"].detach().clone()
batched_spectrogram = inputs["input_features"].unsqueeze(0)
batched_noise_sequence = inputs["noise_sequence"].unsqueeze(0)
with torch.no_grad():
unbatched_outputs = model(
unbatched_spectrogram.to(torch_device),
unbatched_noise_sequence.to(torch_device),
)[0]
batched_outputs = model(
batched_spectrogram.to(torch_device),
batched_noise_sequence.to(torch_device),
)[0]
torch.testing.assert_close(unbatched_outputs, batched_outputs)
@require_torch_gpu
@slow
class UnivNetModelIntegrationTests(unittest.TestCase):
def tearDown(self):
super().tearDown()
gc.collect()
torch.cuda.empty_cache()
def _load_datasamples(self, num_samples, sampling_rate=24000):
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=sampling_rate))
# automatic decoding with librispeech
speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
return [x["array"] for x in speech_samples], [x["sampling_rate"] for x in speech_samples]
def get_inputs(self, device, num_samples: int = 3, noise_length: int = 10, seed: int = 0):
generator = torch.manual_seed(seed)
# Note: hardcode model_in_channels -> 64
if num_samples == 1:
noise_sequence_shape = (64, noise_length)
else:
noise_sequence_shape = (num_samples, 64, noise_length)
# Explicity generate noise_sequence on CPU for consistency.
noise_sequence = torch.randn(noise_sequence_shape, generator=generator, dtype=torch.float32, device="cpu")
# Put noise_sequence on the desired device.
noise_sequence = noise_sequence.to(device)
# Note: hardcode num_mel_channels -> 100
if num_samples == 1:
spectrogram_shape = [100, noise_length]
else:
spectrogram_shape = [num_samples, 100, noise_length]
spectrogram = floats_tensor(spectrogram_shape, scale=1.0, rng=random.Random(seed))
# Note: spectrogram should already be on torch_device
# Permute to match diffusers implementation
if num_samples == 1:
noise_sequence = noise_sequence.transpose(1, 0)
spectrogram = spectrogram.transpose(1, 0)
else:
noise_sequence = noise_sequence.transpose(2, 1)
spectrogram = spectrogram.transpose(2, 1)
inputs = {
"input_features": spectrogram,
"noise_sequence": noise_sequence,
"generator": generator,
}
return inputs
def test_model_inference_batched(self):
# Load sample checkpoint from Tortoise TTS
model = UnivNetModel.from_pretrained("dg845/univnet-dev")
model.eval().to(torch_device)
# Get batched noise and spectrogram inputs.
input_speech = self.get_inputs(torch_device, num_samples=3)
with torch.no_grad():
waveform = model(**input_speech)[0]
waveform = waveform.cpu()
waveform_mean = torch.mean(waveform)
waveform_stddev = torch.std(waveform)
waveform_slice = waveform[-1, -9:].flatten()
EXPECTED_MEAN = torch.tensor(-0.19989729)
EXPECTED_STDDEV = torch.tensor(0.35230172)
EXPECTED_SLICE = torch.tensor([-0.3408, -0.6045, -0.5052, 0.1160, -0.1556, -0.0405, -0.3024, -0.5290, -0.5019])
torch.testing.assert_close(waveform_mean, EXPECTED_MEAN, atol=1e-4, rtol=1e-5)
torch.testing.assert_close(waveform_stddev, EXPECTED_STDDEV, atol=1e-4, rtol=1e-5)
torch.testing.assert_close(waveform_slice, EXPECTED_SLICE, atol=5e-4, rtol=1e-5)
def test_model_inference_unbatched(self):
# Load sample checkpoint from Tortoise TTS
model = UnivNetModel.from_pretrained("dg845/univnet-dev")
model.eval().to(torch_device)
# Get unbatched noise and spectrogram inputs.
input_speech = self.get_inputs(torch_device, num_samples=1)
with torch.no_grad():
waveform = model(**input_speech)[0]
waveform = waveform.cpu()
waveform_mean = torch.mean(waveform)
waveform_stddev = torch.std(waveform)
waveform_slice = waveform[-1, -9:].flatten()
EXPECTED_MEAN = torch.tensor(-0.22895093)
EXPECTED_STDDEV = torch.tensor(0.33986747)
EXPECTED_SLICE = torch.tensor([-0.3276, -0.5504, -0.3484, 0.3574, -0.0373, -0.1826, -0.4880, -0.6431, -0.5162])
torch.testing.assert_close(waveform_mean, EXPECTED_MEAN, atol=1e-4, rtol=1e-5)
torch.testing.assert_close(waveform_stddev, EXPECTED_STDDEV, atol=1e-4, rtol=1e-5)
torch.testing.assert_close(waveform_slice, EXPECTED_SLICE, atol=1e-3, rtol=1e-5)
def test_integration(self):
feature_extractor = UnivNetFeatureExtractor.from_pretrained("dg845/univnet-dev")
model = UnivNetModel.from_pretrained("dg845/univnet-dev")
model.eval().to(torch_device)
audio, sr = self._load_datasamples(1, sampling_rate=feature_extractor.sampling_rate)
input_features = feature_extractor(audio, sampling_rate=sr[0], return_tensors="pt").input_features
input_features = input_features.to(device=torch_device)
input_speech = self.get_inputs(torch_device, num_samples=1, noise_length=input_features.shape[1])
input_speech["input_features"] = input_features
with torch.no_grad():
waveform = model(**input_speech)[0]
waveform = waveform.cpu()
waveform_mean = torch.mean(waveform)
waveform_stddev = torch.std(waveform)
waveform_slice = waveform[-1, -9:].flatten()
EXPECTED_MEAN = torch.tensor(0.00051374)
EXPECTED_STDDEV = torch.tensor(0.058105603)
# fmt: off
EXPECTED_SLICE = torch.tensor([-4.3934e-04, -1.8203e-04, -3.3033e-04, -3.8716e-04, -1.6125e-04, 3.5389e-06, -3.3149e-04, -3.7613e-04, -2.3331e-04])
# fmt: on
torch.testing.assert_close(waveform_mean, EXPECTED_MEAN, atol=5e-6, rtol=1e-5)
torch.testing.assert_close(waveform_stddev, EXPECTED_STDDEV, atol=1e-4, rtol=1e-5)
torch.testing.assert_close(waveform_slice, EXPECTED_SLICE, atol=5e-6, rtol=1e-5)