Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration (#24799)

* initial commit * Add inital testing files and modify __init__ files to add UnivNet imports. * Fix some bugs * Add checkpoint conversion script and add references to transformers pre-trained model. * Add UnivNet entries for auto. * Add initial docs for UnivNet. * Handle input and output shapes in UnivNetGan.forward and add initial docstrings. * Write tests and make them pass. * Write docs. * Add UnivNet doc to _toctree.yml and improve docs. * fix typo * make fixup * make fix-copies * Add upsample_rates parameter to config and improve config documentation. * make fixup * make fix-copies * Remove unused upsample_rates config parameter. * apply suggestions from review * make style * Verify and add reason for skipped tests inherited from ModelTesterMixin. * Add initial UnivNetGan integration tests * make style * Remove noise_length input to UnivNetGan and improve integration tests. * Fix bug and make style * Make UnivNet integration tests pass * Add initial code for UnivNetFeatureExtractor. * make style * Add initial tests for UnivNetFeatureExtractor. * make style * Properly initialize weights for UnivNetGan * Get feature extractor fast tests passing * make style * Get feature extractor integration tests passing * Get UnivNet integration tests passing * make style * Add UnivNetGan usage example * make style and use feature extractor from hub in integration tests * Update tips in docs * apply suggestions from review * make style * Calculate padding directly instead of using get_padding methods. * Update UnivNetFeatureExtractor.to_dict to be UnivNet-specific. * Update feature extractor to support using model(**inputs) and add the ability to generate noise and pad the end of the spectrogram in __call__. * Perform padding before generating noise to ensure the shapes are correct. * Rename UnivNetGan.forward's noise_waveform argument to noise_sequence. * make style * Add tests to test generating noise and padding the end for UnivNetFeatureExtractor.__call__. * Add tests for checking batched vs unbatched inputs for UnivNet feature extractor and model. * Add expected mean and stddev checks to the integration tests and make them pass. * make style * Make it possible to use model(**inputs), where inputs is the output of the feature extractor. * fix typo in UnivNetGanConfig example * Calculate spectrogram_zero from other config values. * apply suggestions from review * make style * Refactor UnivNet conversion script to use load_state_dict (following persimmon). * Rename UnivNetFeatureExtractor to UnivNetGanFeatureExtractor. * make style * Switch to using torch.tensor and torch.testing.assert_close for testing expected values/slices. * make style * Use config in UnivNetGan modeling blocks. * make style * Rename the spectrogram argument of UnivNetGan.forward to input_features, following Whisper. * make style * Improving padding documentation. * Add UnivNet usage example to the docs. * apply suggestions from review * Move dynamic_range_compression computation into the mel_spectrogram method of the feature extractor. * Improve UnivNetGan.forward return docstring. * Update table in docs/source/en/index.md. * make fix-copies * Rename UnivNet components to have pattern UnivNet*. * make style * make fix-copies * Update docs * make style * Increase tolerance on flaky unbatched integration test. * Remove torch.no_grad decorators from UnivNet integration tests to try to avoid flax/Tensorflow test errors. * Add padding_mask argument to UnivNetModel.forward and add batch_decode feature extractor method to remove padding. * Update documentation and clean up padding code. * make style * make style * Remove torch dependency from UnivNetFeatureExtractor. * make style * Fix UnivNetModel usage example * Clean up feature extractor code/docstrings. * apply suggestions from review * make style * Add comments for tests skipped via ModelTesterMixin flags. * Add comment for model parallel tests skipped via the test_model_parallel ModelTesterMixin flag. * Add # Copied from statements to copied UnivNetFeatureExtractionTest tests. * Simplify UnivNetFeatureExtractorTest.test_batch_decode. * Add support for unbatched padding_masks in UnivNetModel.forward. * Refactor unbatched padding_mask support. * make style
2025-08-01 02:31:11 +06:00 · 2023-11-22 08:21:36 -08:00 · 2023-11-22 08:21:36 -08:00 · 7f6a804d30
commit 7f6a804d30
parent 4151fbb49c
24 changed files with 2299 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -494,6 +494,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim.
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
--- a/README_es.md
+++ b/README_es.md
@ -469,6 +469,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
--- a/README_hd.md
+++ b/README_hd.md
@ -443,6 +443,7 @@ conda install -c huggingface transformers
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research से) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. द्वाराअनुसंधान पत्र [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) के साथ जारी किया गया
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (माइक्रोसॉफ्ट रिसर्च से) साथ में दिया गया पेपर [UniSpeech: यूनिफाइड स्पीच रिप्रेजेंटेशन लर्निंग विद लेबलेड एंड अनलेबल्ड डेटा](https:/ /arxiv.org/abs/2101.07597) चेंगई वांग, यू वू, याओ कियान, केनिची कुमातानी, शुजी लियू, फुरु वेई, माइकल ज़ेंग, ज़ुएदोंग हुआंग द्वारा।
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [UNISPEECH-SAT: यूनिवर्सल स्पीच रिप्रेजेंटेशन लर्निंग विद स्पीकर अवेयर प्री-ट्रेनिंग ](https://arxiv.org/abs/2110.05752) सानयुआन चेन, यू वू, चेंग्यी वांग, झेंगयांग चेन, झूओ चेन, शुजी लियू, जियान वू, याओ कियान, फुरु वेई, जिन्यु ली, जियांगज़ान यू द्वारा पोस्ट किया गया।
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (सिंघुआ यूनिवर्सिटी और ननकाई यूनिवर्सिटी से) साथ में पेपर [विजुअल अटेंशन नेटवर्क](https://arxiv.org/ pdf/2202.09741.pdf) मेंग-हाओ गुओ, चेंग-ज़े लू, झेंग-निंग लियू, मिंग-मिंग चेंग, शि-मिन हू द्वारा।
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (मल्टीमीडिया कम्प्यूटिंग ग्रुप, नानजिंग यूनिवर्सिटी से) साथ में पेपर [वीडियोएमएई: मास्क्ड ऑटोएन्कोडर स्व-पर्यवेक्षित वीडियो प्री-ट्रेनिंग के लिए डेटा-कुशल सीखने वाले हैं] (https://arxiv.org/abs/2203.12602) ज़ान टोंग, यिबिंग सॉन्ग, जुए द्वारा वांग, लिमिन वांग द्वारा पोस्ट किया गया।
--- a/README_ja.md
+++ b/README_ja.md
@ -503,6 +503,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research から) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. から公開された研究論文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research から) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang から公開された研究論文: [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (Microsoft Research から) Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu から公開された研究論文: [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752)
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (Peking University から) Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. から公開された研究論文 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221)
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (Tsinghua University and Nankai University から) Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu から公開された研究論文: [Visual Attention Network](https://arxiv.org/abs/2202.09741)
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (Multimedia Computing Group, Nanjing University から) Zhan Tong, Yibing Song, Jue Wang, Limin Wang から公開された研究論文: [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602)
--- a/README_ko.md
+++ b/README_ko.md
@ -418,6 +418,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research 에서 제공)은 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.의 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)논문과 함께 발표했습니다.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research 에서) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 의 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 논문과 함께 발표했습니다.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (Microsoft Research 에서) Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 의 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 논문과 함께 발표했습니다.
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (Peking University 에서 제공)은 Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.의 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221)논문과 함께 발표했습니다.
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (Tsinghua University and Nankai University 에서) Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu 의 [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) 논문과 함께 발표했습니다.
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (Multimedia Computing Group, Nanjing University 에서) Zhan Tong, Yibing Song, Jue Wang, Limin Wang 의 [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 논문과 함께 발표했습니다.
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@ -442,6 +442,7 @@ conda install -c huggingface transformers
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (来自 Google Research) 伴随论文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) 由 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant 发布。
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (来自 Peking University) 伴随论文 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) 由 Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun 发布。
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (来自 Tsinghua University and Nankai University) 伴随论文 [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) 由 Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu 发布。
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (来自 Multimedia Computing Group, Nanjing University) 伴随论文 [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 由 Zhan Tong, Yibing Song, Jue Wang, Limin Wang 发布。
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@ -454,6 +454,7 @@ conda install -c huggingface transformers
 1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[UnivNet](https://huggingface.co/docs/transformers/main/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 
 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -628,6 +628,8 @@
        title: UniSpeech
      - local: model_doc/unispeech-sat
        title: UniSpeech-SAT
+      - local: model_doc/univnet
+        title: UnivNet
      - local: model_doc/vits
        title: VITS
      - local: model_doc/wav2vec2
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -269,6 +269,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                          [UMT5](model_doc/umt5)                          |       ✅        |         ❌         |      ❌      |
 |                     [UniSpeech](model_doc/unispeech)                     |       ✅        |         ❌         |      ❌      |
 |                 [UniSpeechSat](model_doc/unispeech-sat)                  |       ✅        |         ❌         |      ❌      |
+|                       [UnivNet](model_doc/univnet)                       |       ✅        |         ❌         |      ❌      |
 |                       [UPerNet](model_doc/upernet)                       |       ✅        |         ❌         |      ❌      |
 |                           [VAN](model_doc/van)                           |       ✅        |         ❌         |      ❌      |
 |                      [VideoMAE](model_doc/videomae)                      |       ✅        |         ❌         |      ❌      |
--- a/docs/source/en/model_doc/univnet.md
+++ b/docs/source/en/model_doc/univnet.md
@ -0,0 +1,80 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# UnivNet
+
+## Overview
+
+The UnivNet model was proposed in [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kin, and Juntae Kim.
+The UnivNet model is a generative adversarial network (GAN) trained to synthesize high fidelity speech waveforms. The UnivNet model shared in `transformers` is the *generator*, which maps a conditioning log-mel spectrogram and optional noise sequence to a speech waveform (e.g. a vocoder). Only the generator is required for inference. The *discriminator* used to train the `generator` is not implemented.
+
+The abstract from the paper is the following:
+
+*Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch.*
+
+Tips:
+
+- The `noise_sequence` argument for [`UnivNetModel.forward`] should be standard Gaussian noise (such as from `torch.randn`) of shape `([batch_size], noise_length, model.config.model_in_channels)`, where `noise_length` should match the length dimension (dimension 1) of the `input_features` argument. If not supplied, it will be randomly generated; a `torch.Generator` can be supplied to the `generator` argument so that the forward pass can be reproduced. (Note that [`UnivNetFeatureExtractor`] will return generated noise by default, so it shouldn't be necessary to generate `noise_sequence` manually.)
+- Padding added by [`UnivNetFeatureExtractor`] can be removed from the [`UnivNetModel`] output through the [`UnivNetFeatureExtractor.batch_decode`] method, as shown in the usage example below.
+- Padding the end of each waveform with silence can reduce artifacts at the end of the generated audio sample. This can be done by supplying `pad_end = True` to [`UnivNetFeatureExtractor.__call__`]. See [this issue](https://github.com/seungwonpark/melgan/issues/8) for more details.
+
+Usage Example:
+
+```python
+import torch
+from scipy.io.wavfile import write
+from datasets import Audio, load_dataset
+
+from transformers import UnivNetFeatureExtractor, UnivNetModel
+
+model_id_or_path = "dg845/univnet-dev"
+model = UnivNetModel.from_pretrained(model_id_or_path)
+feature_extractor = UnivNetFeatureExtractor.from_pretrained(model_id_or_path)
+
+ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+# Resample the audio to the model and feature extractor's sampling rate.
+ds = ds.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
+# Pad the end of the converted waveforms to reduce artifacts at the end of the output audio samples.
+inputs = feature_extractor(
+    ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], pad_end=True, return_tensors="pt"
+)
+
+with torch.no_grad():
+    audio = model(**inputs)
+
+# Remove the extra padding at the end of the output.
+audio = feature_extractor.batch_decode(**audio)[0]
+# Convert to wav file
+write("sample_audio.wav", feature_extractor.sampling_rate, audio)
+```
+
+This model was contributed by [dg845](https://huggingface.co/dg845).
+To the best of my knowledge, there is no official code release, but an unofficial implementation can be found at [maum-ai/univnet](https://github.com/maum-ai/univnet) with pretrained checkpoints [here](https://github.com/maum-ai/univnet#pre-trained-model).
+
+
+## UnivNetConfig
+
+[[autodoc]] UnivNetConfig
+
+## UnivNetFeatureExtractor
+
+[[autodoc]] UnivNetFeatureExtractor
+    - __call__
+
+## UnivNetModel
+
+[[autodoc]] UnivNetModel
+    - forward
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@ -611,6 +611,11 @@ _import_structure = {
        "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "UniSpeechSatConfig",
    ],
+    "models.univnet": [
+        "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "UnivNetConfig",
+        "UnivNetFeatureExtractor",
+    ],
    "models.upernet": ["UperNetConfig"],
    "models.videomae": ["VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "VideoMAEConfig"],
    "models.vilt": [
@ -2977,6 +2982,12 @@ else:
            "UniSpeechSatPreTrainedModel",
        ]
    )
+    _import_structure["models.univnet"].extend(
+        [
+            "UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "UnivNetModel",
+        ]
+    )
    _import_structure["models.upernet"].extend(
        [
            "UperNetForSemanticSegmentation",
@ -4817,6 +4828,11 @@ if TYPE_CHECKING:
    from .models.umt5 import UMT5Config
    from .models.unispeech import UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechConfig
    from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig
+    from .models.univnet import (
+        UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        UnivNetConfig,
+        UnivNetFeatureExtractor,
+    )
    from .models.upernet import UperNetConfig
    from .models.videomae import VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP, VideoMAEConfig
    from .models.vilt import (
@ -6807,6 +6823,7 @@ if TYPE_CHECKING:
            UniSpeechSatModel,
            UniSpeechSatPreTrainedModel,
        )
+        from .models.univnet import UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST, UnivNetModel
        from .models.upernet import UperNetForSemanticSegmentation, UperNetPreTrainedModel
        from .models.videomae import (
            VIDEOMAE_PRETRAINED_MODEL_ARCHIVE_LIST,
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@ -211,6 +211,7 @@ from . import (
    umt5,
    unispeech,
    unispeech_sat,
+    univnet,
    upernet,
    videomae,
    vilt,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@ -218,6 +218,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("umt5", "UMT5Config"),
        ("unispeech", "UniSpeechConfig"),
        ("unispeech-sat", "UniSpeechSatConfig"),
+        ("univnet", "UnivNetConfig"),
        ("upernet", "UperNetConfig"),
        ("van", "VanConfig"),
        ("videomae", "VideoMAEConfig"),
@ -424,6 +425,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("tvp", "TVP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("unispeech", "UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("unispeech-sat", "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("univnet", "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("van", "VAN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("videomae", "VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("vilt", "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -667,6 +669,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("umt5", "UMT5"),
        ("unispeech", "UniSpeech"),
        ("unispeech-sat", "UniSpeechSat"),
+        ("univnet", "UnivNet"),
        ("upernet", "UPerNet"),
        ("van", "VAN"),
        ("videomae", "VideoMAE"),
--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@ -91,6 +91,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
        ("tvlt", "TvltFeatureExtractor"),
        ("unispeech", "Wav2Vec2FeatureExtractor"),
        ("unispeech-sat", "Wav2Vec2FeatureExtractor"),
+        ("univnet", "UnivNetFeatureExtractor"),
        ("van", "ConvNextFeatureExtractor"),
        ("videomae", "VideoMAEFeatureExtractor"),
        ("vilt", "ViltFeatureExtractor"),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@ -204,6 +204,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("umt5", "UMT5Model"),
        ("unispeech", "UniSpeechModel"),
        ("unispeech-sat", "UniSpeechSatModel"),
+        ("univnet", "UnivNetModel"),
        ("van", "VanModel"),
        ("videomae", "VideoMAEModel"),
        ("vilt", "ViltModel"),
--- a/src/transformers/models/univnet/init.py
+++ b/src/transformers/models/univnet/init.py
@ -0,0 +1,65 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_univnet": [
+        "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "UnivNetConfig",
+    ],
+    "feature_extraction_univnet": ["UnivNetFeatureExtractor"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_univnet"] = [
+        "UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "UnivNetModel",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_univnet import (
+        UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        UnivNetConfig,
+    )
+    from .feature_extraction_univnet import UnivNetFeatureExtractor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_univnet import (
+            UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST,
+            UnivNetModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/univnet/configuration_univnet.py
+++ b/src/transformers/models/univnet/configuration_univnet.py
@ -0,0 +1,127 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" UnivNetModel model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "dg845/univnet-dev": "https://huggingface.co/dg845/univnet-dev/resolve/main/config.json",
+}
+
+
+class UnivNetConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`UnivNetModel`]. It is used to instantiate a
+    UnivNet vocoder model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the UnivNet
+    [dg845/univnet-dev](https://huggingface.co/dg845/univnet-dev) architecture, which corresponds to the 'c32'
+    architecture in [maum-ai/univnet](https://github.com/maum-ai/univnet/blob/master/config/default_c32.yaml).
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        model_in_channels (`int`, *optional*, defaults to 64):
+            The number of input channels for the UnivNet residual network. This should correspond to
+            `noise_sequence.shape[1]` and the value used in the [`UnivNetFeatureExtractor`] class.
+        model_hidden_channels (`int`, *optional*, defaults to 32):
+            The number of hidden channels of each residual block in the UnivNet residual network.
+        num_mel_bins (`int`, *optional*, defaults to 100):
+            The number of frequency bins in the conditioning log-mel spectrogram. This should correspond to the value
+            used in the [`UnivNetFeatureExtractor`] class.
+        resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 3, 3]`):
+            A tuple of integers defining the kernel sizes of the 1D convolutional layers in the UnivNet residual
+            network. The length of `resblock_kernel_sizes` defines the number of resnet blocks and should match that of
+            `resblock_stride_sizes` and `resblock_dilation_sizes`.
+        resblock_stride_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 4]`):
+            A tuple of integers defining the stride sizes of the 1D convolutional layers in the UnivNet residual
+            network. The length of `resblock_stride_sizes` should match that of `resblock_kernel_sizes` and
+            `resblock_dilation_sizes`.
+        resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 9, 27], [1, 3, 9, 27], [1, 3, 9, 27]]`):
+            A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
+            UnivNet residual network. The length of `resblock_dilation_sizes` should match that of
+            `resblock_kernel_sizes` and `resblock_stride_sizes`. The length of each nested list in
+            `resblock_dilation_sizes` defines the number of convolutional layers per resnet block.
+        kernel_predictor_num_blocks (`int`, *optional*, defaults to 3):
+            The number of residual blocks in the kernel predictor network, which calculates the kernel and bias for
+            each location variable convolution layer in the UnivNet residual network.
+        kernel_predictor_hidden_channels (`int`, *optional*, defaults to 64):
+            The number of hidden channels for each residual block in the kernel predictor network.
+        kernel_predictor_conv_size (`int`, *optional*, defaults to 3):
+            The kernel size of each 1D convolutional layer in the kernel predictor network.
+        kernel_predictor_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probability for each residual block in the kernel predictor network.
+        initializer_range (`float`, *optional*, defaults to 0.01):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        leaky_relu_slope (`float`, *optional*, defaults to 0.2):
+            The angle of the negative slope used by the leaky ReLU activation.
+
+    Example:
+
+    ```python
+    >>> from transformers import UnivNetModel, UnivNetConfig
+
+    >>> # Initializing a Tortoise TTS style configuration
+    >>> configuration = UnivNetConfig()
+
+    >>> # Initializing a model (with random weights) from the Tortoise TTS style configuration
+    >>> model = UnivNetModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+
+    model_type = "univnet"
+
+    def __init__(
+        self,
+        model_in_channels=64,
+        model_hidden_channels=32,
+        num_mel_bins=100,
+        resblock_kernel_sizes=[3, 3, 3],
+        resblock_stride_sizes=[8, 8, 4],
+        resblock_dilation_sizes=[[1, 3, 9, 27], [1, 3, 9, 27], [1, 3, 9, 27]],
+        kernel_predictor_num_blocks=3,
+        kernel_predictor_hidden_channels=64,
+        kernel_predictor_conv_size=3,
+        kernel_predictor_dropout=0.0,
+        initializer_range=0.01,
+        leaky_relu_slope=0.2,
+        **kwargs,
+    ):
+        if not (len(resblock_kernel_sizes) == len(resblock_stride_sizes) == len(resblock_dilation_sizes)):
+            raise ValueError(
+                "`resblock_kernel_sizes`, `resblock_stride_sizes`, and `resblock_dilation_sizes` must all have the"
+                " same length (which will be the number of resnet blocks in the model)."
+            )
+
+        self.model_in_channels = model_in_channels
+        self.model_hidden_channels = model_hidden_channels
+        self.num_mel_bins = num_mel_bins
+        self.resblock_kernel_sizes = resblock_kernel_sizes
+        self.resblock_stride_sizes = resblock_stride_sizes
+        self.resblock_dilation_sizes = resblock_dilation_sizes
+        self.kernel_predictor_num_blocks = kernel_predictor_num_blocks
+        self.kernel_predictor_hidden_channels = kernel_predictor_hidden_channels
+        self.kernel_predictor_conv_size = kernel_predictor_conv_size
+        self.kernel_predictor_dropout = kernel_predictor_dropout
+        self.initializer_range = initializer_range
+        self.leaky_relu_slope = leaky_relu_slope
+        super().__init__(**kwargs)
--- a/src/transformers/models/univnet/convert_univnet.py
+++ b/src/transformers/models/univnet/convert_univnet.py
@ -0,0 +1,162 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import torch
+
+from transformers import UnivNetConfig, UnivNetModel, logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.univnet")
+
+
+def get_kernel_predictor_key_mapping(config: UnivNetConfig, old_prefix: str = "", new_prefix: str = ""):
+    mapping = {}
+    # Initial conv layer
+    mapping[f"{old_prefix}.input_conv.0.weight_g"] = f"{new_prefix}.input_conv.weight_g"
+    mapping[f"{old_prefix}.input_conv.0.weight_v"] = f"{new_prefix}.input_conv.weight_v"
+    mapping[f"{old_prefix}.input_conv.0.bias"] = f"{new_prefix}.input_conv.bias"
+
+    # Kernel predictor resnet blocks
+    for i in range(config.kernel_predictor_num_blocks):
+        mapping[f"{old_prefix}.residual_convs.{i}.1.weight_g"] = f"{new_prefix}.resblocks.{i}.conv1.weight_g"
+        mapping[f"{old_prefix}.residual_convs.{i}.1.weight_v"] = f"{new_prefix}.resblocks.{i}.conv1.weight_v"
+        mapping[f"{old_prefix}.residual_convs.{i}.1.bias"] = f"{new_prefix}.resblocks.{i}.conv1.bias"
+
+        mapping[f"{old_prefix}.residual_convs.{i}.3.weight_g"] = f"{new_prefix}.resblocks.{i}.conv2.weight_g"
+        mapping[f"{old_prefix}.residual_convs.{i}.3.weight_v"] = f"{new_prefix}.resblocks.{i}.conv2.weight_v"
+        mapping[f"{old_prefix}.residual_convs.{i}.3.bias"] = f"{new_prefix}.resblocks.{i}.conv2.bias"
+
+    # Kernel output conv
+    mapping[f"{old_prefix}.kernel_conv.weight_g"] = f"{new_prefix}.kernel_conv.weight_g"
+    mapping[f"{old_prefix}.kernel_conv.weight_v"] = f"{new_prefix}.kernel_conv.weight_v"
+    mapping[f"{old_prefix}.kernel_conv.bias"] = f"{new_prefix}.kernel_conv.bias"
+
+    # Bias output conv
+    mapping[f"{old_prefix}.bias_conv.weight_g"] = f"{new_prefix}.bias_conv.weight_g"
+    mapping[f"{old_prefix}.bias_conv.weight_v"] = f"{new_prefix}.bias_conv.weight_v"
+    mapping[f"{old_prefix}.bias_conv.bias"] = f"{new_prefix}.bias_conv.bias"
+
+    return mapping
+
+
+def get_key_mapping(config: UnivNetConfig):
+    mapping = {}
+
+    # NOTE: inital conv layer keys are the same
+
+    # LVC Residual blocks
+    for i in range(len(config.resblock_stride_sizes)):
+        # LVCBlock initial convt layer
+        mapping[f"res_stack.{i}.convt_pre.1.weight_g"] = f"resblocks.{i}.convt_pre.weight_g"
+        mapping[f"res_stack.{i}.convt_pre.1.weight_v"] = f"resblocks.{i}.convt_pre.weight_v"
+        mapping[f"res_stack.{i}.convt_pre.1.bias"] = f"resblocks.{i}.convt_pre.bias"
+
+        # Kernel predictor
+        kernel_predictor_mapping = get_kernel_predictor_key_mapping(
+            config, old_prefix=f"res_stack.{i}.kernel_predictor", new_prefix=f"resblocks.{i}.kernel_predictor"
+        )
+        mapping.update(kernel_predictor_mapping)
+
+        # LVC Residual blocks
+        for j in range(len(config.resblock_dilation_sizes[i])):
+            mapping[f"res_stack.{i}.conv_blocks.{j}.1.weight_g"] = f"resblocks.{i}.resblocks.{j}.conv.weight_g"
+            mapping[f"res_stack.{i}.conv_blocks.{j}.1.weight_v"] = f"resblocks.{i}.resblocks.{j}.conv.weight_v"
+            mapping[f"res_stack.{i}.conv_blocks.{j}.1.bias"] = f"resblocks.{i}.resblocks.{j}.conv.bias"
+
+    # Output conv layer
+    mapping["conv_post.1.weight_g"] = "conv_post.weight_g"
+    mapping["conv_post.1.weight_v"] = "conv_post.weight_v"
+    mapping["conv_post.1.bias"] = "conv_post.bias"
+
+    return mapping
+
+
+def rename_state_dict(state_dict, keys_to_modify, keys_to_remove):
+    model_state_dict = {}
+    for key, value in state_dict.items():
+        if key in keys_to_remove:
+            continue
+
+        if key in keys_to_modify:
+            new_key = keys_to_modify[key]
+            model_state_dict[new_key] = value
+        else:
+            model_state_dict[key] = value
+    return model_state_dict
+
+
+def convert_univnet_checkpoint(
+    checkpoint_path,
+    pytorch_dump_folder_path,
+    config_path=None,
+    repo_id=None,
+    safe_serialization=False,
+):
+    model_state_dict_base = torch.load(checkpoint_path, map_location="cpu")
+    # Get the generator's state dict
+    state_dict = model_state_dict_base["model_g"]
+
+    if config_path is not None:
+        config = UnivNetConfig.from_pretrained(config_path)
+    else:
+        config = UnivNetConfig()
+
+    keys_to_modify = get_key_mapping(config)
+    keys_to_remove = set()
+    hf_state_dict = rename_state_dict(state_dict, keys_to_modify, keys_to_remove)
+
+    model = UnivNetModel(config)
+    # Apply weight norm since the original checkpoint has weight norm applied
+    model.apply_weight_norm()
+    model.load_state_dict(hf_state_dict)
+    # Remove weight norm in preparation for inference
+    model.remove_weight_norm()
+
+    model.save_pretrained(pytorch_dump_folder_path, safe_serialization=safe_serialization)
+
+    if repo_id:
+        print("Pushing to the hub...")
+        model.push_to_hub(repo_id)
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
+    parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
+    parser.add_argument(
+        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
+    )
+    parser.add_argument(
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+    )
+    parser.add_argument(
+        "--safe_serialization", action="store_true", help="Whether to save the model using `safetensors`."
+    )
+
+    args = parser.parse_args()
+
+    convert_univnet_checkpoint(
+        args.checkpoint_path,
+        args.pytorch_dump_folder_path,
+        args.config_path,
+        args.push_to_hub,
+        args.safe_serialization,
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/src/transformers/models/univnet/feature_extraction_univnet.py
+++ b/src/transformers/models/univnet/feature_extraction_univnet.py
@ -0,0 +1,456 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for UnivNetModel."""
+
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+
+from ...audio_utils import mel_filter_bank, optimal_fft_length, spectrogram, window_function
+from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ...feature_extraction_utils import BatchFeature
+from ...utils import PaddingStrategy, TensorType, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+class UnivNetFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a UnivNet feature extractor.
+
+    This class extracts log-mel-filter bank features from raw speech using the short time Fourier Transform (STFT). The
+    STFT implementation follows that of TacoTron 2 and Hifi-GAN.
+
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+
+    Args:
+        feature_size (`int`, *optional*, defaults to 1):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 24000):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        padding_value (`float`, *optional*, defaults to 0.0):
+            The value to pad with when applying the padding strategy defined by the `padding` argument to
+            [`UnivNetFeatureExtractor.__call__`]. Should correspond to audio silence. The `pad_end` argument to
+            `__call__` will also use this padding value.
+        do_normalize (`bool`, *optional*, defaults to `False`):
+            Whether to perform Tacotron 2 normalization on the input. Normalizing can help to significantly improve the
+            performance for some models.
+        num_mel_bins (`int`, *optional*, defaults to 100):
+            The number of mel-frequency bins in the extracted spectrogram features. This should match
+            `UnivNetModel.config.num_mel_bins`.
+        hop_length (`int`, *optional*, defaults to 256):
+            The direct number of samples between sliding windows. Otherwise referred to as "shift" in many papers. Note
+            that this is different from other audio feature extractors such as [`SpeechT5FeatureExtractor`] which take
+            the `hop_length` in ms.
+        win_length (`int`, *optional*, defaults to 1024):
+            The direct number of samples for each sliding window. Note that this is different from other audio feature
+            extractors such as [`SpeechT5FeatureExtractor`] which take the `win_length` in ms.
+        win_function (`str`, *optional*, defaults to `"hann_window"`):
+            Name for the window function used for windowing, must be accessible via `torch.{win_function}`
+        filter_length (`int`, *optional*, defaults to 1024):
+            The number of FFT components to use. If `None`, this is determined using
+            `transformers.audio_utils.optimal_fft_length`.
+        max_length_s (`int`, *optional*, defaults to 10):
+            The maximum input lenght of the model in seconds. This is used to pad the audio.
+        fmin (`float`, *optional*, defaults to 0.0):
+            Minimum mel frequency in Hz.
+        fmax (`float`, *optional*):
+            Maximum mel frequency in Hz. If not set, defaults to `sampling_rate / 2`.
+        mel_floor (`float`, *optional*, defaults to 1e-09):
+            Minimum value of mel frequency banks. Note that the way [`UnivNetFeatureExtractor`] uses `mel_floor` is
+            different than in [`transformers.audio_utils.spectrogram`].
+        center (`bool`, *optional*, defaults to `False`):
+            Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`, frame
+            `t` will start at time `t * hop_length`.
+        compression_factor (`float`, *optional*, defaults to 1.0):
+            The multiplicative compression factor for dynamic range compression during spectral normalization.
+        compression_clip_val (`float`, *optional*, defaults to 1e-05):
+            The clip value applied to the waveform before applying dynamic range compression during spectral
+            normalization.
+        normalize_min (`float`, *optional*, defaults to -11.512925148010254):
+            The min value used for Tacotron 2-style linear normalization. The default is the original value from the
+            Tacotron 2 implementation.
+        normalize_max (`float`, *optional*, defaults to 2.3143386840820312):
+            The max value used for Tacotron 2-style linear normalization. The default is the original value from the
+            Tacotron 2 implementation.
+        model_in_channels (`int`, *optional*, defaults to 64):
+            The number of input channels to the [`UnivNetModel`] model. This should match
+            `UnivNetModel.config.model_in_channels`.
+        pad_end_length (`int`, *optional*, defaults to 10):
+            If padding the end of each waveform, the number of spectrogram frames worth of samples to append. The
+            number of appended samples will be `pad_end_length * hop_length`.
+        return_attention_mask (`bool`, *optional*, defaults to `True`):
+            Whether or not [`~UnivNetFeatureExtractor.__call__`] should return `attention_mask`.
+    """
+
+    model_input_names = ["input_features", "noise_sequence", "padding_mask"]
+
+    def __init__(
+        self,
+        feature_size: int = 1,
+        sampling_rate: int = 24000,
+        padding_value: float = 0.0,
+        do_normalize: bool = False,
+        num_mel_bins: int = 100,
+        hop_length: int = 256,
+        win_length: int = 1024,
+        win_function: str = "hann_window",
+        filter_length: Optional[int] = 1024,
+        max_length_s: int = 10,
+        fmin: float = 0.0,
+        fmax: Optional[float] = None,
+        mel_floor: float = 1e-9,
+        center: bool = False,
+        compression_factor: float = 1.0,
+        compression_clip_val: float = 1e-5,
+        normalize_min: float = -11.512925148010254,
+        normalize_max: float = 2.3143386840820312,
+        model_in_channels: int = 64,
+        pad_end_length: int = 10,
+        return_attention_mask=True,
+        **kwargs,
+    ):
+        super().__init__(
+            feature_size=feature_size,
+            sampling_rate=sampling_rate,
+            padding_value=padding_value,
+            return_attention_mask=return_attention_mask,
+            **kwargs,
+        )
+
+        self.do_normalize = do_normalize
+
+        self.num_mel_bins = num_mel_bins
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.win_function = win_function
+        self.filter_length = filter_length
+        self.fmin = fmin
+        if fmax is None:
+            # Follows the librosa.filters.mel implementation
+            fmax = float(sampling_rate) / 2
+        self.fmax = fmax
+        self.mel_floor = mel_floor
+
+        self.max_length_s = max_length_s
+        self.num_max_samples = max_length_s * sampling_rate
+
+        if self.filter_length is None:
+            self.n_fft = optimal_fft_length(self.win_length)
+        else:
+            self.n_fft = self.filter_length
+        self.n_freqs = (self.n_fft // 2) + 1
+
+        self.window = window_function(window_length=self.win_length, name=self.win_function, periodic=True)
+
+        self.mel_filters = mel_filter_bank(
+            num_frequency_bins=self.n_freqs,
+            num_mel_filters=self.num_mel_bins,
+            min_frequency=self.fmin,
+            max_frequency=self.fmax,
+            sampling_rate=self.sampling_rate,
+            norm="slaney",
+            mel_scale="slaney",
+        )
+
+        self.center = center
+        self.compression_factor = compression_factor
+        self.compression_clip_val = compression_clip_val
+        self.normalize_min = normalize_min
+        self.normalize_max = normalize_max
+        self.model_in_channels = model_in_channels
+        self.pad_end_length = pad_end_length
+
+    def normalize(self, spectrogram):
+        return 2 * ((spectrogram - self.normalize_min) / (self.normalize_max - self.normalize_min)) - 1
+
+    def denormalize(self, spectrogram):
+        return self.normalize_min + (self.normalize_max - self.normalize_min) * ((spectrogram + 1) / 2)
+
+    def mel_spectrogram(self, waveform: np.ndarray) -> np.ndarray:
+        """
+        Calculates log MEL spectrograms from a batch of waveforms. Note that the input waveform(s) will be padded by
+        `int(self.n_fft - self.hop_length) / 2` on both sides using the `reflect` padding mode.
+
+        Args:
+            waveform (`np.ndarray` of shape `(length,)`):
+                The input waveform. This must be a single real-valued, mono waveform.
+
+        Returns:
+            `numpy.ndarray`: Array containing a log-mel spectrogram of shape `(num_frames, num_mel_bins)`.
+        """
+        # Do custom padding based on the official MelGAN and Hifi-GAN implementations
+        # See https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/utils/stft.py#L84-L86
+        waveform = np.pad(
+            waveform,
+            (int((self.n_fft - self.hop_length) / 2), int((self.n_fft - self.hop_length) / 2)),
+            mode="reflect",
+        )
+
+        # Get the complex spectrogram.
+        # Note: waveform must be unbatched currently due to the implementation of spectrogram(...).
+        complex_spectrogram = spectrogram(
+            waveform,
+            window=self.window,
+            frame_length=self.n_fft,
+            hop_length=self.hop_length,
+            fft_length=self.n_fft,
+            power=None,
+            center=self.center,
+            mel_filters=None,
+            mel_floor=None,
+        )
+
+        # Apply the MEL filter bank and MEL floor manually since UnivNet uses a slightly different implementation
+        amplitude_spectrogram = np.sqrt(
+            np.real(complex_spectrogram) ** 2 + np.imag(complex_spectrogram) ** 2 + self.mel_floor
+        )
+        mel_spectrogram = np.matmul(self.mel_filters.T, amplitude_spectrogram)
+
+        # Perform spectral normalization to get the log mel spectrogram.
+        log_mel_spectrogram = np.log(
+            np.clip(mel_spectrogram, a_min=self.compression_clip_val, a_max=None) * self.compression_factor
+        )
+
+        # Return spectrogram with num_mel_bins last
+        return log_mel_spectrogram.T
+
+    def generate_noise(
+        self,
+        noise_length: int,
+        generator: Optional[np.random.Generator] = None,
+    ) -> np.ndarray:
+        """
+        Generates a random noise sequence of standard Gaussian noise for use in the `noise_sequence` argument of
+        [`UnivNetModel.forward`].
+
+        Args:
+            spectrogram_length (`int`):
+                The length (dim 0) of the generated noise.
+            model_in_channels (`int`, *optional*, defaults to `None`):
+                The number of features (dim 1) of the generated noise. This should correspond to the
+                `model_in_channels` of the [`UnivNetGan`] model. If not set, this will default to
+                `self.config.model_in_channels`.
+            generator (`numpy.random.Generator`, *optional*, defaults to `None`)
+                An optional `numpy.random.Generator` random number generator to control noise generation. If not set, a
+                new generator with fresh entropy will be created.
+
+        Returns:
+            `numpy.ndarray`: Array containing random standard Gaussian noise of shape `(noise_length,
+            model_in_channels)`.
+        """
+        if generator is None:
+            generator = np.random.default_rng()
+
+        noise_shape = (noise_length, self.model_in_channels)
+        noise = generator.standard_normal(noise_shape, dtype=np.float32)
+
+        return noise
+
+    def batch_decode(self, waveforms, waveform_lengths=None) -> List[np.ndarray]:
+        r"""
+        Removes padding from generated audio after running [`UnivNetModel.forward`]. This returns a ragged list of 1D
+        audio waveform arrays and not a single tensor/array because in general the waveforms will have different
+        lengths after removing padding.
+
+        Args:
+            waveforms (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
+                The batched output waveforms from the [`UnivNetModel`].
+            waveform_lengths (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
+                The batched lengths of each waveform before padding.
+
+        Returns:
+            `List[np.ndarray]`: A ragged list of 1D waveform arrays with padding removed.
+        """
+        # Collapse the batched waveform tensor to a list of 1D audio waveforms
+        waveforms = [waveform.detach().clone().cpu().numpy() for waveform in waveforms]
+
+        if waveform_lengths is not None:
+            waveforms = [waveform[: waveform_lengths[i]] for i, waveform in enumerate(waveforms)]
+
+        return waveforms
+
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
+        sampling_rate: Optional[int] = None,
+        padding: Union[bool, str, PaddingStrategy] = True,
+        max_length: Optional[int] = None,
+        truncation: bool = True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_noise: bool = True,
+        generator: Optional[np.random.Generator] = None,
+        pad_end: bool = False,
+        pad_length: Optional[int] = None,
+        do_normalize: Optional[str] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+
+        Args:
+            raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
+                stereo, i.e. single float per timestep.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
+                pipeline.
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
+                Select a strategy to pad the input `raw_speech` waveforms (according to the model's padding side and
+                padding index) among:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+
+                If `pad_end = True`, that padding will occur before the `padding` strategy is applied.
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            truncation (`bool`, *optional*, defaults to `True`):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value.
+
+                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+            return_noise (`bool`, *optional*, defaults to `True`):
+                Whether to generate and return a noise waveform for use in [`UnivNetModel.forward`].
+            generator (`numpy.random.Generator`, *optional*, defaults to `None`):
+                An optional `numpy.random.Generator` random number generator to use when generating noise.
+            pad_end (`bool`, *optional*, defaults to `False`):
+                Whether to pad the end of each waveform with silence. This can help reduce artifacts at the end of the
+                generated audio sample; see https://github.com/seungwonpark/melgan/issues/8 for more details. This
+                padding will be done before the padding strategy specified in `padding` is performed.
+            pad_length (`int`, *optional*, defaults to `None`):
+                If padding the end of each waveform, the length of the padding in spectrogram frames. If not set, this
+                will default to `self.config.pad_end_length`.
+            do_normalize (`bool`, *optional*):
+                Whether to perform Tacotron 2 normalization on the input. Normalizing can help to significantly improve
+                the performance for some models. If not set, this will default to `self.config.do_normalize`.
+            return_attention_mask (`bool`, *optional*):
+                Whether to return the attention mask. If left to the default, will return the attention mask according
+                to the specific feature_extractor's default.
+
+                [What are attention masks?](../glossary#attention-mask)
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.np.array` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        """
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
+                    f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
+                    f" was sampled with {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+
+        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
+        if is_batched_numpy and len(raw_speech.shape) > 2:
+            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
+        is_batched = is_batched_numpy or (
+            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
+        )
+
+        if is_batched:
+            raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech]
+        elif not is_batched and not isinstance(raw_speech, np.ndarray):
+            raw_speech = np.asarray(raw_speech, dtype=np.float32)
+        elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
+            raw_speech = raw_speech.astype(np.float32)
+
+        # always return batch
+        if not is_batched:
+            raw_speech = [np.asarray(raw_speech, dtype=np.float32)]
+
+        # Pad end to reduce artifacts
+        if pad_end:
+            pad_length = pad_length if pad_length is not None else self.pad_end_length
+            raw_speech = [
+                np.pad(waveform, (0, pad_length * self.hop_length), constant_values=self.padding_value)
+                for waveform in raw_speech
+            ]
+
+        batched_speech = BatchFeature({"input_features": raw_speech})
+
+        padded_inputs = self.pad(
+            batched_speech,
+            padding=padding,
+            max_length=max_length if max_length is not None else self.num_max_samples,
+            truncation=truncation,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+        )
+
+        # make sure list is in array format
+        # input_features = padded_inputs.get("input_features").transpose(2, 0, 1)
+        input_features = padded_inputs.get("input_features")
+
+        mel_spectrograms = [self.mel_spectrogram(waveform) for waveform in input_features]
+
+        if isinstance(input_features[0], List):
+            batched_speech["input_features"] = [np.asarray(mel, dtype=np.float32) for mel in mel_spectrograms]
+        else:
+            batched_speech["input_features"] = [mel.astype(np.float32) for mel in mel_spectrograms]
+
+        # convert attention_mask to correct format
+        attention_mask = padded_inputs.get("attention_mask")
+        if attention_mask is not None:
+            batched_speech["padding_mask"] = [np.asarray(array, dtype=np.int32) for array in attention_mask]
+
+        if return_noise:
+            noise = [
+                self.generate_noise(spectrogram.shape[0], generator)
+                for spectrogram in batched_speech["input_features"]
+            ]
+            batched_speech["noise_sequence"] = noise
+
+        if do_normalize:
+            batched_speech["input_features"] = [
+                self.normalize(spectrogram) for spectrogram in batched_speech["input_features"]
+            ]
+
+        if return_tensors is not None:
+            batched_speech = batched_speech.convert_to_tensors(return_tensors)
+
+        return batched_speech
+
+    def to_dict(self) -> Dict[str, Any]:
+        output = super().to_dict()
+
+        # Don't serialize these as they are derived from the other properties.
+        names = ["window", "mel_filters", "n_fft", "n_freqs", "num_max_samples"]
+        for name in names:
+            if name in output:
+                del output[name]
+
+        return output
--- a/src/transformers/models/univnet/modeling_univnet.py
+++ b/src/transformers/models/univnet/modeling_univnet.py
@ -0,0 +1,636 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch UnivNetModel model."""
+
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+
+from ...modeling_utils import ModelOutput, PreTrainedModel
+from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+from .configuration_univnet import UnivNetConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "UnivNetConfig"
+
+_CHECKPOINT_FOR_DOC = "dg845/univnet-dev"
+
+UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "dg845/univnet-dev",
+    # See all UnivNet models at https://huggingface.co/models?filter=univnet
+]
+
+
+@dataclass
+class UnivNetModelOutput(ModelOutput):
+    """
+    Output class for the [`UnivNetModel`], which includes the generated audio waveforms and the original unpadded
+    lengths of those waveforms (so that the padding can be removed by [`UnivNetModel.batch_decode`]).
+
+    Args:
+        waveforms (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
+            Batched 1D (mono-channel) output audio waveforms.
+        waveform_lengths (`torch.FloatTensor` of shape `(batch_size,)`):
+            The batched length in samples of each unpadded waveform in `waveforms`.
+    """
+
+    waveforms: torch.FloatTensor = None
+    waveform_lengths: torch.FloatTensor = None
+
+
+class UnivNetKernelPredictorResidualBlock(nn.Module):
+    """
+    Implementation of the residual block for the kernel predictor network inside each location variable convolution
+    block (LVCBlock).
+
+    Parameters:
+        config: (`UnivNetConfig`):
+            Config for the `UnivNetModel` model.
+    """
+
+    def __init__(
+        self,
+        config: UnivNetConfig,
+    ):
+        super().__init__()
+        self.channels = config.model_in_channels
+        self.kernel_size = config.kernel_predictor_conv_size
+        self.dropout_prob = config.kernel_predictor_dropout
+        self.leaky_relu_slope = config.leaky_relu_slope
+
+        padding = (self.kernel_size - 1) // 2
+
+        self.dropout = nn.Dropout(self.dropout_prob)
+        self.conv1 = nn.Conv1d(self.channels, self.channels, self.kernel_size, padding=padding, bias=True)
+        self.conv2 = nn.Conv1d(self.channels, self.channels, self.kernel_size, padding=padding, bias=True)
+
+    def forward(self, hidden_states: torch.FloatTensor):
+        # hidden_states should have shape (batch_size, channels, seq_length)
+        residual = hidden_states
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv1(hidden_states)
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+        hidden_states = self.conv2(hidden_states)
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+        return hidden_states + residual
+
+    def apply_weight_norm(self):
+        nn.utils.weight_norm(self.conv1)
+        nn.utils.weight_norm(self.conv2)
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.conv1)
+        nn.utils.remove_weight_norm(self.conv2)
+
+
+class UnivNetKernelPredictor(nn.Module):
+    """
+    Implementation of the kernel predictor network which supplies the kernel and bias for the location variable
+    convolutional layers (LVCs) in each UnivNet LVCBlock.
+
+    Based on the KernelPredictor implementation in
+    [maum-ai/univnet](https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/model/lvcnet.py#L7).
+
+    Parameters:
+        config: (`UnivNetConfig`):
+            Config for the `UnivNetModel` model.
+        conv_kernel_size (`int`, *optional*, defaults to 3):
+            The kernel size for the location variable convolutional layer kernels (convolutional weight tensor).
+        conv_layers (`int`, *optional*, defaults to 4):
+            The number of location variable convolutional layers to output kernels and biases for.
+    """
+
+    def __init__(
+        self,
+        config: UnivNetConfig,
+        conv_kernel_size: int = 3,
+        conv_layers: int = 4,
+    ):
+        super().__init__()
+
+        self.conv_in_channels = config.model_hidden_channels
+        self.conv_out_channels = 2 * config.model_hidden_channels
+        self.conv_kernel_size = conv_kernel_size
+        self.conv_layers = conv_layers
+
+        self.kernel_channels = (
+            self.conv_in_channels * self.conv_out_channels * self.conv_kernel_size * self.conv_layers
+        )
+        self.bias_channels = self.conv_out_channels * self.conv_layers
+
+        self.resnet_in_channels = config.num_mel_bins
+        self.resnet_hidden_channels = config.kernel_predictor_hidden_channels
+        self.resnet_kernel_size = config.kernel_predictor_conv_size
+        self.num_blocks = config.kernel_predictor_num_blocks
+
+        self.leaky_relu_slope = config.leaky_relu_slope
+
+        padding = (self.resnet_kernel_size - 1) // 2
+
+        self.input_conv = nn.Conv1d(self.resnet_in_channels, self.resnet_hidden_channels, 5, padding=2, bias=True)
+
+        self.resblocks = nn.ModuleList([UnivNetKernelPredictorResidualBlock(config) for _ in range(self.num_blocks)])
+
+        self.kernel_conv = nn.Conv1d(
+            self.resnet_hidden_channels, self.kernel_channels, self.resnet_kernel_size, padding=padding, bias=True
+        )
+        self.bias_conv = nn.Conv1d(
+            self.resnet_hidden_channels, self.bias_channels, self.resnet_kernel_size, padding=padding, bias=True
+        )
+
+    def forward(self, spectrogram: torch.FloatTensor):
+        """
+        Maps a conditioning log-mel spectrogram to a tensor of convolutional kernels and biases, for use in location
+        variable convolutional layers. Note that the input spectrogram should have shape (batch_size, input_channels,
+        seq_length).
+
+        Args:
+            spectrogram (`torch.FloatTensor` of shape `(batch_size, input_channels, seq_length)`):
+                Tensor containing the log-mel spectrograms.
+
+        Returns:
+            Tuple[`torch.FloatTensor, `torch.FloatTensor`]: tuple of tensors where the first element is the tensor of
+            location variable convolution kernels of shape `(batch_size, self.conv_layers, self.conv_in_channels,
+            self.conv_out_channels, self.conv_kernel_size, seq_length)` and the second element is the tensor of
+            location variable convolution biases of shape `(batch_size, self.conv_layers. self.conv_out_channels,
+            seq_length)`.
+        """
+        batch_size, _, seq_length = spectrogram.shape
+
+        hidden_states = self.input_conv(spectrogram)
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+
+        for resblock in self.resblocks:
+            hidden_states = resblock(hidden_states)
+
+        kernel_hidden_states = self.kernel_conv(hidden_states)
+        bias_hidden_states = self.bias_conv(hidden_states)
+
+        # Reshape kernels and biases to appropriate shape
+        kernels = kernel_hidden_states.view(
+            batch_size,
+            self.conv_layers,
+            self.conv_in_channels,
+            self.conv_out_channels,
+            self.conv_kernel_size,
+            seq_length,
+        ).contiguous()
+        biases = bias_hidden_states.view(
+            batch_size,
+            self.conv_layers,
+            self.conv_out_channels,
+            seq_length,
+        ).contiguous()
+
+        return kernels, biases
+
+    def apply_weight_norm(self):
+        nn.utils.weight_norm(self.input_conv)
+        for layer in self.resblocks:
+            layer.apply_weight_norm()
+        nn.utils.weight_norm(self.kernel_conv)
+        nn.utils.weight_norm(self.bias_conv)
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.input_conv)
+        for layer in self.resblocks:
+            layer.remove_weight_norm()
+        nn.utils.remove_weight_norm(self.kernel_conv)
+        nn.utils.remove_weight_norm(self.bias_conv)
+
+
+class UnivNetLvcResidualBlock(nn.Module):
+    """
+    Implementation of the location variable convolution (LVC) residual block for the UnivNet residual network.
+
+    Parameters:
+        config: (`UnivNetConfig`):
+            Config for the `UnivNetModel` model.
+        kernel_size (`int`):
+            The kernel size for the dilated 1D convolutional layer.
+        dilation (`int`):
+            The dilation for the dilated 1D convolutional layer.
+    """
+
+    def __init__(
+        self,
+        config: UnivNetConfig,
+        kernel_size: int,
+        dilation: int,
+    ):
+        super().__init__()
+        self.hidden_channels = config.model_hidden_channels
+        self.kernel_size = kernel_size
+        self.dilation = dilation
+        self.leaky_relu_slope = config.leaky_relu_slope
+
+        padding = self.dilation * (self.kernel_size - 1) // 2
+
+        self.conv = nn.Conv1d(
+            self.hidden_channels,
+            self.hidden_channels,
+            self.kernel_size,
+            padding=padding,
+            dilation=self.dilation,
+        )
+
+    def forward(self, hidden_states, kernel, bias, hop_size=256):
+        residual = hidden_states
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+        hidden_states = self.conv(hidden_states)
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+        hidden_states = self.location_variable_convolution(hidden_states, kernel, bias, hop_size=hop_size)
+        # Gated activation unit
+        hidden_states = torch.sigmoid(hidden_states[:, : self.hidden_channels, :]) * torch.tanh(
+            hidden_states[:, self.hidden_channels :, :]
+        )
+        # Skip connection
+        hidden_states = residual + hidden_states
+
+        return hidden_states
+
+    # Based on https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/model/lvcnet.py#L171
+    def location_variable_convolution(
+        self,
+        hidden_states: torch.FloatTensor,
+        kernel: torch.FloatTensor,
+        bias: torch.FloatTensor,
+        dilation: int = 1,
+        hop_size: int = 256,
+    ):
+        """
+        Performs location-variable convolution operation on the input sequence (hidden_states) using the local
+        convolution kernel. This was introduced in [LVCNet: Efficient Condition-Dependent Modeling Network for Waveform
+        Generation](https://arxiv.org/abs/2102.10815) by Zhen Zheng, Jianzong Wang, Ning Cheng, and Jing Xiao.
+
+        Time: 414 μs ± 309 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each), test on NVIDIA V100.
+
+        Args:
+            hidden_states (`torch.FloatTensor` of shape `(batch_size, in_channels, in_length)`):
+                The input sequence of shape (batch, in_channels, in_length).
+            kernel (`torch.FloatTensor` of shape `(batch_size, in_channels, out_channels, kernel_size, kernel_length)`):
+                The local convolution kernel of shape (batch, in_channels, out_channels, kernel_size, kernel_length).
+            bias (`torch.FloatTensor` of shape `(batch_size, out_channels, kernel_length)`):
+                The bias for the local convolution of shape (batch, out_channels, kernel_length).
+            dilation (`int`, *optional*, defaults to 1):
+                The dilation of convolution.
+            hop_size (`int`, *optional*, defaults to 256):
+                The hop_size of the conditioning sequence.
+        Returns:
+            `torch.FloatTensor`: the output sequence after performing local convolution with shape (batch_size,
+            out_channels, in_length).
+        """
+        batch, _, in_length = hidden_states.shape
+        batch, _, out_channels, kernel_size, kernel_length = kernel.shape
+        if in_length != (kernel_length * hop_size):
+            raise ValueError(
+                f"Dim 2 of `hidden_states` should be {kernel_length * hop_size}) but got {in_length}. Please check"
+                " `hidden_states` or `kernel` and `hop_size` to make sure they are correct."
+            )
+
+        padding = dilation * int((kernel_size - 1) / 2)
+
+        # (batch, in_channels, in_length + 2*padding)
+        hidden_states = nn.functional.pad(hidden_states, (padding, padding), "constant", 0)
+        # (batch, in_channels, kernel_length, hop_size + 2*padding)
+        hidden_states = hidden_states.unfold(2, hop_size + 2 * padding, hop_size)
+
+        if hop_size < dilation:
+            hidden_states = nn.functional.pad(hidden_states, (0, dilation), "constant", 0)
+        # (batch, in_channels, kernel_length, (hop_size + 2*padding)/dilation, dilation)
+        hidden_states = hidden_states.unfold(3, dilation, dilation)
+        hidden_states = hidden_states[:, :, :, :, :hop_size]
+        # (batch, in_channels, kernel_length, dilation, (hop_size + 2*padding)/dilation)
+        hidden_states = hidden_states.transpose(3, 4)
+        # (batch, in_channels, kernel_length, dilation, _, kernel_size)
+        hidden_states = hidden_states.unfold(4, kernel_size, 1)
+
+        # Apply local convolution kernel to hidden_states.
+        output_hidden_states = torch.einsum("bildsk,biokl->bolsd", hidden_states, kernel)
+
+        output_hidden_states = output_hidden_states.to(memory_format=torch.channels_last_3d)
+        bias = bias.unsqueeze(-1).unsqueeze(-1).to(memory_format=torch.channels_last_3d)
+        output_hidden_states = output_hidden_states + bias
+        output_hidden_states = output_hidden_states.contiguous().view(batch, out_channels, -1)
+
+        return output_hidden_states
+
+    def apply_weight_norm(self):
+        nn.utils.weight_norm(self.conv)
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.conv)
+
+
+class UnivNetLvcBlock(nn.Module):
+    """
+    Implementation of the location variable convolution (LVC) residual block of the UnivNet residual block. Includes a
+    `UnivNetKernelPredictor` inside to predict the kernels and biases of the LVC layers.
+
+    Based on LVCBlock in
+    [maum-ai/univnet](https://github.com/maum-ai/univnet/blob/9bb2b54838bb6d7ce767131cc7b8b61198bc7558/model/lvcnet.py#L98)
+
+    Parameters:
+        config (`UnivNetConfig`):
+            Config for the `UnivNetModel` model.
+        layer_id (`int`):
+            An integer corresponding to the index of the current LVC resnet block layer. This should be between 0 and
+            `len(config.resblock_stride_sizes) - 1)` inclusive.
+        lvc_hop_size (`int`, *optional*, defaults to 256):
+            The hop size for the location variable convolutional layers.
+    """
+
+    def __init__(
+        self,
+        config: UnivNetConfig,
+        layer_id: int,
+        lvc_hop_size: int = 256,
+    ):
+        super().__init__()
+        self.hidden_channels = config.model_hidden_channels
+        self.kernel_size = config.resblock_kernel_sizes[layer_id]
+        self.stride = config.resblock_stride_sizes[layer_id]
+        self.dilations = config.resblock_dilation_sizes[layer_id]
+        self.cond_hop_length = lvc_hop_size
+        self.leaky_relu_slope = config.leaky_relu_slope
+        self.num_blocks = len(self.dilations)
+
+        self.convt_pre = nn.ConvTranspose1d(
+            self.hidden_channels,
+            self.hidden_channels,
+            2 * self.stride,
+            stride=self.stride,
+            padding=self.stride // 2 + self.stride % 2,
+            output_padding=self.stride % 2,
+        )
+
+        self.kernel_predictor = UnivNetKernelPredictor(config, self.kernel_size, self.num_blocks)
+
+        self.resblocks = nn.ModuleList(
+            [UnivNetLvcResidualBlock(config, self.kernel_size, self.dilations[i]) for i in range(self.num_blocks)]
+        )
+
+    def forward(self, hidden_states: torch.FloatTensor, spectrogram: torch.FloatTensor):
+        # hidden_states: (batch_size, hidden_channels, seq_length)
+        # spectrogram: (batch_size, cond_channels, cond_length)
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+        hidden_states = self.convt_pre(hidden_states)
+
+        kernels, biases = self.kernel_predictor(spectrogram)
+
+        for i, resblock in enumerate(self.resblocks):
+            kernel = kernels[:, i, :, :, :, :]
+            bias = biases[:, i, :, :]
+            hidden_states = resblock(hidden_states, kernel, bias, hop_size=self.cond_hop_length)
+
+        return hidden_states
+
+    def apply_weight_norm(self):
+        nn.utils.weight_norm(self.convt_pre)
+        self.kernel_predictor.apply_weight_norm()
+        for layer in self.resblocks:
+            layer.apply_weight_norm()
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.convt_pre)
+        self.kernel_predictor.remove_weight_norm()
+        for layer in self.resblocks:
+            layer.remove_weight_norm()
+
+
+UNIVNET_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`UnivNetConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+UNIVNET_INPUTS_DOCSTRING = r"""
+    Converts a noise waveform and a conditioning spectrogram to a speech waveform. Passing a batch of log-mel
+    spectrograms returns a batch of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a
+    single, un-batched speech waveform.
+
+    Args:
+        input_features (`torch.FloatTensor`):
+            Tensor containing the log-mel spectrograms. Can be batched and of shape `(batch_size, sequence_length,
+            config.num_mel_channels)`, or un-batched and of shape `(sequence_length, config.num_mel_channels)`.
+        noise_sequence (`torch.FloatTensor`, *optional*):
+            Tensor containing a noise sequence of standard Gaussian noise. Can be batched and of shape `(batch_size,
+            sequence_length, config.model_in_channels)`, or un-batched and of shape (sequence_length,
+            config.model_in_channels)`. If not supplied, will be randomly generated.
+        padding_mask (`torch.BoolTensor`, *optional*):
+            Mask indicating which parts of each sequence are padded. Mask values are selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**
+            - 0 for tokens that are **masked**
+
+            The mask can be batched and of shape `(batch_size, sequence_length)` or un-batched and of shape
+            `(sequence_length,)`.
+        generator (`torch.Generator`, *optional*):
+            A [torch generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation
+            deterministic.
+        return_dict:
+            Whether to return a [`~utils.ModelOutput`] subclass instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    """UnivNet GAN vocoder.""",
+    UNIVNET_START_DOCSTRING,
+)
+class UnivNetModel(PreTrainedModel):
+    config_class = UnivNetConfig
+    main_input_name = "input_features"
+
+    def __init__(self, config: UnivNetConfig):
+        super().__init__(config)
+
+        self.num_kernels = len(config.resblock_kernel_sizes)
+        self.leaky_relu_slope = config.leaky_relu_slope
+
+        self.conv_pre = nn.Conv1d(
+            config.model_in_channels,
+            config.model_hidden_channels,
+            kernel_size=7,
+            stride=1,
+            padding=3,
+            padding_mode="reflect",
+        )
+
+        # Initialize location-variable convolution ResNet Blocks.
+        num_layers = len(config.resblock_stride_sizes)
+        hop_length = 1
+        hop_lengths = []
+        for stride in config.resblock_stride_sizes:
+            hop_length = hop_length * stride
+            hop_lengths.append(hop_length)
+
+        self.resblocks = nn.ModuleList(
+            [
+                UnivNetLvcBlock(
+                    config,
+                    layer_id=i,
+                    lvc_hop_size=hop_lengths[i],
+                )
+                for i in range(num_layers)
+            ]
+        )
+
+        self.conv_post = nn.Conv1d(config.model_hidden_channels, 1, 7, padding=3, padding_mode="reflect")
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(UNIVNET_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=UnivNetModelOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_features: torch.FloatTensor,
+        noise_sequence: Optional[torch.FloatTensor] = None,
+        padding_mask: Optional[torch.FloatTensor] = None,
+        generator: Optional[torch.Generator] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], UnivNetModelOutput]:
+        r"""
+        Returns:
+
+        Example:
+
+         ```python
+         >>> from transformers import UnivNetFeatureExtractor, UnivNetModel
+         >>> from datasets import load_dataset, Audio
+
+         >>> model = UnivNetModel.from_pretrained("dg845/univnet-dev")
+         >>> feature_extractor = UnivNetFeatureExtractor.from_pretrained("dg845/univnet-dev")
+
+         >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+         >>> # Resample the audio to the feature extractor's sampling rate.
+         >>> ds = ds.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
+         >>> inputs = feature_extractor(
+         ...     ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt"
+         ... )
+         >>> audio = model(**inputs).waveforms
+         >>> list(audio.shape)
+         [1, 140288]
+         ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # Resolve batch sizes for noise_sequence and spectrogram
+        spectrogram_batched = input_features.dim() == 3
+        if not spectrogram_batched:
+            input_features = input_features.unsqueeze(0)
+        spectrogram_batch_size, spectrogram_length, _ = input_features.shape
+
+        if noise_sequence is not None:
+            noise_sequence_batched = noise_sequence.dim() == 3
+            if not noise_sequence_batched:
+                noise_sequence = noise_sequence.unsqueeze(0)
+        else:
+            # Randomly generate noise_sequence
+            noise_sequence_shape = (spectrogram_batch_size, spectrogram_length, self.config.model_in_channels)
+            noise_sequence = torch.randn(
+                noise_sequence_shape, generator=generator, dtype=input_features.dtype, device=input_features.device
+            )
+        noise_sequence_batch_size = noise_sequence.shape[0]
+
+        if spectrogram_batch_size > 1 and noise_sequence_batch_size == 1:
+            # Repeat noise_sequence spectrogram_batch_size times
+            noise_sequence = noise_sequence.repeat(spectrogram_batch_size, 1, 1)
+        elif noise_sequence_batch_size > 1 and spectrogram_batch_size == 1:
+            # Repeat spectrogram noise_sequence_batch_size times
+            input_features = input_features.repeat(noise_sequence_batch_size, 1, 1)
+
+        if noise_sequence_batch_size != spectrogram_batch_size:
+            raise ValueError(
+                f"The batch size of `noise_sequence` is {noise_sequence_batch_size} and the batch size of"
+                f" `input_features` is {spectrogram_batch_size}, but the two are expected to be equal."
+            )
+
+        if padding_mask is not None:
+            if padding_mask.dim() == 1:
+                padding_mask = padding_mask.unsqueeze(0)
+            padding_mask_batch_size = padding_mask.shape[0]
+            if padding_mask_batch_size != spectrogram_batch_size:
+                raise ValueError(
+                    f"The batch size of `padding_mask` is {padding_mask_batch_size} and the batch size of"
+                    f" `input_features` is {spectrogram_batch_size}, but the two are expected to be equal."
+                )
+
+        # Change shapes to have channels before sequence lengths
+        hidden_states = noise_sequence.transpose(2, 1)
+        input_features = input_features.transpose(2, 1)
+
+        hidden_states = self.conv_pre(hidden_states)
+
+        for resblock in self.resblocks:
+            hidden_states = resblock(hidden_states, input_features)
+
+        hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+        hidden_states = self.conv_post(hidden_states)
+        hidden_states = torch.tanh(hidden_states)
+
+        # Remove sequence length dimension since this collapses to 1
+        # NOTE: keep waveforms batched even if there's only one
+        waveform = hidden_states.squeeze(1)
+
+        # Get sequence lengths for UnivNetFeatureExtractor.batch_decode.
+        waveform_lengths = None
+        if padding_mask is not None:
+            # Padding is always contiguous and added on the right
+            waveform_lengths = torch.sum(padding_mask, dim=1)
+
+        if not return_dict:
+            outputs = (waveform, waveform_lengths)
+            return outputs
+
+        return UnivNetModelOutput(
+            waveforms=waveform,
+            waveform_lengths=waveform_lengths,
+        )
+
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, (nn.Linear, nn.Conv1d, nn.ConvTranspose1d)):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+
+    def apply_weight_norm(self):
+        nn.utils.weight_norm(self.conv_pre)
+        for layer in self.resblocks:
+            layer.apply_weight_norm()
+        nn.utils.weight_norm(self.conv_post)
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.conv_pre)
+        for layer in self.resblocks:
+            layer.remove_weight_norm()
+        nn.utils.remove_weight_norm(self.conv_post)
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@ -7985,6 +7985,16 @@ class UniSpeechSatPreTrainedModel(metaclass=DummyObject):
        requires_backends(self, ["torch"])


+UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class UnivNetModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 class UperNetForSemanticSegmentation(metaclass=DummyObject):
    _backends = ["torch"]

--- a/tests/models/univnet/init.py
+++ b/tests/models/univnet/init.py
--- a/tests/models/univnet/test_feature_extraction_univnet.py
+++ b/tests/models/univnet/test_feature_extraction_univnet.py
@ -0,0 +1,365 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import os
+import random
+import tempfile
+import unittest
+
+import numpy as np
+from datasets import Audio, load_dataset
+
+from transformers import UnivNetFeatureExtractor
+from transformers.testing_utils import check_json_file_has_correct_format, require_torch, slow
+from transformers.utils.import_utils import is_torch_available
+
+from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
+
+
+if is_torch_available():
+    import torch
+
+
+global_rng = random.Random()
+
+
+# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
+def floats_list(shape, scale=1.0, rng=None, name=None):
+    """Creates a random float32 tensor"""
+    if rng is None:
+        rng = global_rng
+
+    values = []
+    for batch_idx in range(shape[0]):
+        values.append([])
+        for _ in range(shape[1]):
+            values[-1].append(rng.random() * scale)
+
+    return values
+
+
+class UnivNetFeatureExtractionTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        min_seq_length=400,
+        max_seq_length=2000,
+        feature_size=1,
+        sampling_rate=24000,
+        padding_value=0.0,
+        do_normalize=True,
+        num_mel_bins=100,
+        hop_length=256,
+        win_length=1024,
+        win_function="hann_window",
+        filter_length=1024,
+        max_length_s=10,
+        fmin=0.0,
+        fmax=12000,
+        mel_floor=1e-9,
+        center=False,
+        compression_factor=1.0,
+        compression_clip_val=1e-5,
+        normalize_min=-11.512925148010254,
+        normalize_max=2.3143386840820312,
+        model_in_channels=64,
+        pad_end_length=10,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.min_seq_length = min_seq_length
+        self.max_seq_length = max_seq_length
+        self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
+
+        self.feature_size = feature_size
+        self.sampling_rate = sampling_rate
+        self.padding_value = padding_value
+        self.do_normalize = do_normalize
+        self.num_mel_bins = num_mel_bins
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.win_function = win_function
+        self.filter_length = filter_length
+        self.max_length_s = max_length_s
+        self.fmin = fmin
+        self.fmax = fmax
+        self.mel_floor = mel_floor
+        self.center = center
+        self.compression_factor = compression_factor
+        self.compression_clip_val = compression_clip_val
+        self.normalize_min = normalize_min
+        self.normalize_max = normalize_max
+        self.model_in_channels = model_in_channels
+        self.pad_end_length = pad_end_length
+
+    def prepare_feat_extract_dict(self):
+        return {
+            "feature_size": self.feature_size,
+            "sampling_rate": self.sampling_rate,
+            "padding_value": self.padding_value,
+            "do_normalize": self.do_normalize,
+            "num_mel_bins": self.num_mel_bins,
+            "hop_length": self.hop_length,
+            "win_length": self.win_length,
+            "win_function": self.win_function,
+            "filter_length": self.filter_length,
+            "max_length_s": self.max_length_s,
+            "fmin": self.fmin,
+            "fmax": self.fmax,
+            "mel_floor": self.mel_floor,
+            "center": self.center,
+            "compression_factor": self.compression_factor,
+            "compression_clip_val": self.compression_clip_val,
+            "normalize_min": self.normalize_min,
+            "normalize_max": self.normalize_max,
+            "model_in_channels": self.model_in_channels,
+            "pad_end_length": self.pad_end_length,
+        }
+
+    def prepare_inputs_for_common(self, equal_length=False, numpify=False):
+        def _flatten(list_of_lists):
+            return list(itertools.chain(*list_of_lists))
+
+        if equal_length:
+            speech_inputs = floats_list((self.batch_size, self.max_seq_length))
+        else:
+            # make sure that inputs increase in size
+            speech_inputs = [
+                _flatten(floats_list((x, self.feature_size)))
+                for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
+            ]
+
+        if numpify:
+            speech_inputs = [np.asarray(x) for x in speech_inputs]
+
+        return speech_inputs
+
+
+class UnivNetFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
+    feature_extraction_class = UnivNetFeatureExtractor
+
+    def setUp(self):
+        self.feat_extract_tester = UnivNetFeatureExtractionTester(self)
+
+    # Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_feat_extract_from_and_save_pretrained
+    def test_feat_extract_from_and_save_pretrained(self):
+        feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
+            check_json_file_has_correct_format(saved_file)
+            feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
+
+        dict_first = feat_extract_first.to_dict()
+        dict_second = feat_extract_second.to_dict()
+        mel_1 = feat_extract_first.mel_filters
+        mel_2 = feat_extract_second.mel_filters
+        self.assertTrue(np.allclose(mel_1, mel_2))
+        self.assertEqual(dict_first, dict_second)
+
+    # Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_feat_extract_to_json_file
+    def test_feat_extract_to_json_file(self):
+        feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            json_file_path = os.path.join(tmpdirname, "feat_extract.json")
+            feat_extract_first.to_json_file(json_file_path)
+            feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
+
+        dict_first = feat_extract_first.to_dict()
+        dict_second = feat_extract_second.to_dict()
+        mel_1 = feat_extract_first.mel_filters
+        mel_2 = feat_extract_second.mel_filters
+        self.assertTrue(np.allclose(mel_1, mel_2))
+        self.assertEqual(dict_first, dict_second)
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        # create three inputs of length 800, 1000, and 1200
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+
+        # Test feature size
+        input_features = feature_extractor(
+            np_speech_inputs, padding="max_length", max_length=1600, return_tensors="np"
+        ).input_features
+        self.assertTrue(input_features.ndim == 3)
+        # Note: for some reason I get a weird padding error when feature_size > 1
+        # self.assertTrue(input_features.shape[-2] == feature_extractor.feature_size)
+        # Note: we use the shape convention (batch_size, seq_len, num_mel_bins)
+        self.assertTrue(input_features.shape[-1] == feature_extractor.num_mel_bins)
+
+        # Test not batched input
+        encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
+        self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
+
+        # Test batched
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test 2-D numpy arrays are batched.
+        speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
+        np_speech_inputs = np.asarray(speech_inputs)
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test truncation required
+        speech_inputs = [
+            floats_list((1, x))[0]
+            for x in range((feature_extractor.num_max_samples - 100), (feature_extractor.num_max_samples + 500), 200)
+        ]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+
+        speech_inputs_truncated = [x[: feature_extractor.num_max_samples] for x in speech_inputs]
+        np_speech_inputs_truncated = [np.asarray(speech_input) for speech_input in speech_inputs_truncated]
+
+        encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs_truncated, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+    def test_batched_unbatched_consistency(self):
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = floats_list((1, 800))[0]
+        np_speech_inputs = np.asarray(speech_inputs)
+
+        # Test unbatched vs batched list
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor([speech_inputs], return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test np.ndarray vs List[np.ndarray]
+        encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor([np_speech_inputs], return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test unbatched np.ndarray vs batched np.ndarray
+        encoded_sequences_1 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(
+            np.expand_dims(np_speech_inputs, axis=0), return_tensors="np"
+        ).input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+    def test_generate_noise(self):
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+
+        features = feature_extractor(speech_inputs, return_noise=True)
+        input_features = features.input_features
+        noise_features = features.noise_sequence
+
+        for spectrogram, noise in zip(input_features, noise_features):
+            self.assertEqual(spectrogram.shape[0], noise.shape[0])
+
+    def test_pad_end(self):
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+
+        input_features1 = feature_extractor(speech_inputs, padding=False, pad_end=False).input_features
+        input_features2 = feature_extractor(speech_inputs, padding=False, pad_end=True).input_features
+
+        for spectrogram1, spectrogram2 in zip(input_features1, input_features2):
+            self.assertEqual(spectrogram1.shape[0] + self.feat_extract_tester.pad_end_length, spectrogram2.shape[0])
+
+    def test_generate_noise_and_pad_end(self):
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+
+        features = feature_extractor(speech_inputs, padding=False, return_noise=True, pad_end=True)
+        input_features = features.input_features
+        noise_features = features.noise_sequence
+
+        for spectrogram, noise in zip(input_features, noise_features):
+            self.assertEqual(spectrogram.shape[0], noise.shape[0])
+
+    @require_torch
+    def test_batch_decode(self):
+        import torch
+
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
+        input_lengths = list(range(800, 1400, 200))
+        pad_samples = feature_extractor.pad_end_length * feature_extractor.hop_length
+        output_features = {
+            "waveforms": torch.tensor(floats_list((3, max(input_lengths) + pad_samples))),
+            "waveform_lengths": torch.tensor(input_lengths),
+        }
+        waveforms = feature_extractor.batch_decode(**output_features)
+
+        for input_length, waveform in zip(input_lengths, waveforms):
+            self.assertTrue(len(waveform.shape) == 1, msg="Individual output waveforms should be 1D")
+            self.assertEqual(waveform.shape[0], input_length)
+
+    @require_torch
+    # Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_double_precision_pad
+    def test_double_precision_pad(self):
+        import torch
+
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
+        py_speech_inputs = np_speech_inputs.tolist()
+
+        for inputs in [py_speech_inputs, np_speech_inputs]:
+            np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
+            self.assertTrue(np_processed.input_features.dtype == np.float32)
+            pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
+            self.assertTrue(pt_processed.input_features.dtype == torch.float32)
+
+    def _load_datasamples(self, num_samples):
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        ds = ds.cast_column("audio", Audio(sampling_rate=self.feat_extract_tester.sampling_rate))
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+
+        return [x["array"] for x in speech_samples], [x["sampling_rate"] for x in speech_samples]
+
+    @slow
+    @require_torch
+    def test_integration(self):
+        # fmt: off
+        EXPECTED_INPUT_FEATURES = torch.tensor(
+            [
+                -5.0229, -6.1358, -5.8346, -5.4447, -5.6707, -5.8577, -5.0464, -5.0058,
+                -5.6015, -5.6410, -5.4325, -5.6116, -5.3700, -5.7956, -5.3196, -5.3274,
+                -5.9655, -5.6057, -5.8382, -5.9602, -5.9005, -5.9123, -5.7669, -6.1441,
+                -5.5168, -5.1405, -5.3927, -6.0032, -5.5784, -5.3728
+            ],
+        )
+        # fmt: on
+
+        input_speech, sr = self._load_datasamples(1)
+
+        feature_extractor = UnivNetFeatureExtractor()
+        input_features = feature_extractor(input_speech, sampling_rate=sr[0], return_tensors="pt").input_features
+        self.assertEqual(input_features.shape, (1, 548, 100))
+
+        input_features_mean = torch.mean(input_features)
+        input_features_stddev = torch.std(input_features)
+
+        EXPECTED_MEAN = torch.tensor(-6.18862009)
+        EXPECTED_STDDEV = torch.tensor(2.80845642)
+
+        torch.testing.assert_close(input_features_mean, EXPECTED_MEAN, atol=5e-5, rtol=5e-6)
+        torch.testing.assert_close(input_features_stddev, EXPECTED_STDDEV)
+        torch.testing.assert_close(input_features[0, :30, 0], EXPECTED_INPUT_FEATURES, atol=1e-4, rtol=1e-5)
--- a/tests/models/univnet/test_modeling_univnet.py
+++ b/tests/models/univnet/test_modeling_univnet.py
@ -0,0 +1,365 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import inspect
+import random
+import unittest
+
+from datasets import Audio, load_dataset
+
+from transformers import UnivNetConfig, UnivNetFeatureExtractor
+from transformers.testing_utils import (
+    is_torch_available,
+    require_torch,
+    require_torch_gpu,
+    slow,
+    torch_device,
+)
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+)
+
+
+if is_torch_available():
+    import torch
+
+    from transformers import UnivNetModel
+
+
+class UnivNetModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        seq_length=7,
+        in_channels=8,
+        hidden_channels=8,
+        num_mel_bins=20,
+        kernel_predictor_hidden_channels=8,
+        seed=0,
+        is_training=False,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.in_channels = in_channels
+        self.hidden_channels = hidden_channels
+        self.num_mel_bins = num_mel_bins
+        self.kernel_predictor_hidden_channels = kernel_predictor_hidden_channels
+        self.seed = seed
+        self.is_training = is_training
+
+    def prepare_noise_sequence(self):
+        generator = torch.manual_seed(self.seed)
+        noise_shape = (self.seq_length, self.in_channels)
+        # Create noise on CPU for reproducibility
+        noise_sequence = torch.randn(noise_shape, generator=generator, dtype=torch.float)
+        return noise_sequence
+
+    def prepare_config_and_inputs(self):
+        spectrogram = floats_tensor([self.seq_length, self.num_mel_bins], scale=1.0)
+        noise_sequence = self.prepare_noise_sequence()
+        noise_sequence = noise_sequence.to(spectrogram.device)
+        config = self.get_config()
+        return config, spectrogram, noise_sequence
+
+    def get_config(self):
+        return UnivNetConfig(
+            model_in_channels=self.in_channels,
+            model_hidden_channels=self.hidden_channels,
+            num_mel_bins=self.num_mel_bins,
+            kernel_predictor_hidden_channels=self.kernel_predictor_hidden_channels,
+        )
+
+    def create_and_check_model(self, config, spectrogram, noise_sequence):
+        model = UnivNetModel(config=config).to(torch_device).eval()
+        result = model(spectrogram, noise_sequence)[0]
+        self.parent.assertEqual(result.shape, (1, self.seq_length * 256))
+
+    def prepare_config_and_inputs_for_common(self):
+        config, spectrogram, noise_sequence = self.prepare_config_and_inputs()
+        inputs_dict = {"input_features": spectrogram, "noise_sequence": noise_sequence}
+        return config, inputs_dict
+
+
+@require_torch
+class UnivNetModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (UnivNetModel,) if is_torch_available() else ()
+    # UnivNetModel currently cannot be traced with torch.jit.trace.
+    test_torchscript = False
+    # The UnivNetModel is not a transformer and does not use any attention mechanisms, so skip transformer/attention
+    # related tests.
+    test_pruning = False
+    test_resize_embeddings = False
+    test_resize_position_embeddings = False
+    test_head_masking = False
+    # UnivNetModel is not a sequence classification model.
+    test_mismatched_shapes = False
+    # UnivNetModel does not have a base_model_prefix attribute.
+    test_missing_keys = False
+    # UnivNetModel does not implement a parallelize method.
+    test_model_parallel = False
+    is_encoder_decoder = False
+    has_attentions = False
+
+    input_name = "input_features"
+
+    def setUp(self):
+        self.model_tester = UnivNetModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=UnivNetConfig)
+
+    def test_config(self):
+        self.config_tester.create_and_test_config_to_json_string()
+        self.config_tester.create_and_test_config_to_json_file()
+        self.config_tester.create_and_test_config_from_and_save_pretrained()
+        self.config_tester.create_and_test_config_from_and_save_pretrained_subfolder()
+        self.config_tester.create_and_test_config_with_num_labels()
+        self.config_tester.check_config_can_be_init_without_params()
+        self.config_tester.check_config_arguments_init()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "input_features",
+            ]
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+    @unittest.skip(reason="UnivNetModel does not output hidden_states.")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="UnivNetModel.forward does not accept an inputs_embeds argument.")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="UnivNetModel does not use input embeddings and thus has no get_input_embeddings method.")
+    def test_model_common_attributes(self):
+        pass
+
+    @unittest.skip(reason="UnivNetModel does not support all arguments tested, such as output_hidden_states.")
+    def test_model_outputs_equivalence(self):
+        pass
+
+    @unittest.skip(reason="UnivNetModel does not output hidden_states.")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    def test_batched_inputs_outputs(self):
+        config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            batched_spectrogram = inputs["input_features"].unsqueeze(0).repeat(2, 1, 1)
+            batched_noise_sequence = inputs["noise_sequence"].unsqueeze(0).repeat(2, 1, 1)
+            with torch.no_grad():
+                batched_outputs = model(
+                    batched_spectrogram.to(torch_device),
+                    batched_noise_sequence.to(torch_device),
+                )[0]
+
+            self.assertEqual(
+                batched_spectrogram.shape[0],
+                batched_outputs.shape[0],
+                msg="Got different batch dims for input and output",
+            )
+
+    def test_unbatched_inputs_outputs(self):
+        config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            with torch.no_grad():
+                outputs = model(inputs["input_features"].to(torch_device), inputs["noise_sequence"].to(torch_device))[
+                    0
+                ]
+            self.assertTrue(outputs.shape[0] == 1, msg="Unbatched input should create batched output with bsz = 1")
+
+    def test_unbatched_batched_outputs_consistency(self):
+        config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            unbatched_spectrogram = inputs["input_features"].detach().clone()
+            unbatched_noise_sequence = inputs["noise_sequence"].detach().clone()
+            batched_spectrogram = inputs["input_features"].unsqueeze(0)
+            batched_noise_sequence = inputs["noise_sequence"].unsqueeze(0)
+
+            with torch.no_grad():
+                unbatched_outputs = model(
+                    unbatched_spectrogram.to(torch_device),
+                    unbatched_noise_sequence.to(torch_device),
+                )[0]
+
+                batched_outputs = model(
+                    batched_spectrogram.to(torch_device),
+                    batched_noise_sequence.to(torch_device),
+                )[0]
+
+            torch.testing.assert_close(unbatched_outputs, batched_outputs)
+
+
+@require_torch_gpu
+@slow
+class UnivNetModelIntegrationTests(unittest.TestCase):
+    def tearDown(self):
+        super().tearDown()
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    def _load_datasamples(self, num_samples, sampling_rate=24000):
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        ds = ds.cast_column("audio", Audio(sampling_rate=sampling_rate))
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+
+        return [x["array"] for x in speech_samples], [x["sampling_rate"] for x in speech_samples]
+
+    def get_inputs(self, device, num_samples: int = 3, noise_length: int = 10, seed: int = 0):
+        generator = torch.manual_seed(seed)
+        # Note: hardcode model_in_channels -> 64
+        if num_samples == 1:
+            noise_sequence_shape = (64, noise_length)
+        else:
+            noise_sequence_shape = (num_samples, 64, noise_length)
+        # Explicity generate noise_sequence on CPU for consistency.
+        noise_sequence = torch.randn(noise_sequence_shape, generator=generator, dtype=torch.float32, device="cpu")
+        # Put noise_sequence on the desired device.
+        noise_sequence = noise_sequence.to(device)
+
+        # Note: hardcode num_mel_channels -> 100
+        if num_samples == 1:
+            spectrogram_shape = [100, noise_length]
+        else:
+            spectrogram_shape = [num_samples, 100, noise_length]
+        spectrogram = floats_tensor(spectrogram_shape, scale=1.0, rng=random.Random(seed))
+        # Note: spectrogram should already be on torch_device
+
+        # Permute to match diffusers implementation
+        if num_samples == 1:
+            noise_sequence = noise_sequence.transpose(1, 0)
+            spectrogram = spectrogram.transpose(1, 0)
+        else:
+            noise_sequence = noise_sequence.transpose(2, 1)
+            spectrogram = spectrogram.transpose(2, 1)
+
+        inputs = {
+            "input_features": spectrogram,
+            "noise_sequence": noise_sequence,
+            "generator": generator,
+        }
+
+        return inputs
+
+    def test_model_inference_batched(self):
+        # Load sample checkpoint from Tortoise TTS
+        model = UnivNetModel.from_pretrained("dg845/univnet-dev")
+        model.eval().to(torch_device)
+
+        # Get batched noise and spectrogram inputs.
+        input_speech = self.get_inputs(torch_device, num_samples=3)
+
+        with torch.no_grad():
+            waveform = model(**input_speech)[0]
+        waveform = waveform.cpu()
+
+        waveform_mean = torch.mean(waveform)
+        waveform_stddev = torch.std(waveform)
+        waveform_slice = waveform[-1, -9:].flatten()
+
+        EXPECTED_MEAN = torch.tensor(-0.19989729)
+        EXPECTED_STDDEV = torch.tensor(0.35230172)
+        EXPECTED_SLICE = torch.tensor([-0.3408, -0.6045, -0.5052, 0.1160, -0.1556, -0.0405, -0.3024, -0.5290, -0.5019])
+
+        torch.testing.assert_close(waveform_mean, EXPECTED_MEAN, atol=1e-4, rtol=1e-5)
+        torch.testing.assert_close(waveform_stddev, EXPECTED_STDDEV, atol=1e-4, rtol=1e-5)
+        torch.testing.assert_close(waveform_slice, EXPECTED_SLICE, atol=5e-4, rtol=1e-5)
+
+    def test_model_inference_unbatched(self):
+        # Load sample checkpoint from Tortoise TTS
+        model = UnivNetModel.from_pretrained("dg845/univnet-dev")
+        model.eval().to(torch_device)
+
+        # Get unbatched noise and spectrogram inputs.
+        input_speech = self.get_inputs(torch_device, num_samples=1)
+
+        with torch.no_grad():
+            waveform = model(**input_speech)[0]
+        waveform = waveform.cpu()
+
+        waveform_mean = torch.mean(waveform)
+        waveform_stddev = torch.std(waveform)
+        waveform_slice = waveform[-1, -9:].flatten()
+
+        EXPECTED_MEAN = torch.tensor(-0.22895093)
+        EXPECTED_STDDEV = torch.tensor(0.33986747)
+        EXPECTED_SLICE = torch.tensor([-0.3276, -0.5504, -0.3484, 0.3574, -0.0373, -0.1826, -0.4880, -0.6431, -0.5162])
+
+        torch.testing.assert_close(waveform_mean, EXPECTED_MEAN, atol=1e-4, rtol=1e-5)
+        torch.testing.assert_close(waveform_stddev, EXPECTED_STDDEV, atol=1e-4, rtol=1e-5)
+        torch.testing.assert_close(waveform_slice, EXPECTED_SLICE, atol=1e-3, rtol=1e-5)
+
+    def test_integration(self):
+        feature_extractor = UnivNetFeatureExtractor.from_pretrained("dg845/univnet-dev")
+        model = UnivNetModel.from_pretrained("dg845/univnet-dev")
+        model.eval().to(torch_device)
+
+        audio, sr = self._load_datasamples(1, sampling_rate=feature_extractor.sampling_rate)
+
+        input_features = feature_extractor(audio, sampling_rate=sr[0], return_tensors="pt").input_features
+        input_features = input_features.to(device=torch_device)
+
+        input_speech = self.get_inputs(torch_device, num_samples=1, noise_length=input_features.shape[1])
+        input_speech["input_features"] = input_features
+
+        with torch.no_grad():
+            waveform = model(**input_speech)[0]
+        waveform = waveform.cpu()
+
+        waveform_mean = torch.mean(waveform)
+        waveform_stddev = torch.std(waveform)
+        waveform_slice = waveform[-1, -9:].flatten()
+
+        EXPECTED_MEAN = torch.tensor(0.00051374)
+        EXPECTED_STDDEV = torch.tensor(0.058105603)
+        # fmt: off
+        EXPECTED_SLICE = torch.tensor([-4.3934e-04, -1.8203e-04, -3.3033e-04, -3.8716e-04, -1.6125e-04, 3.5389e-06, -3.3149e-04, -3.7613e-04, -2.3331e-04])
+        # fmt: on
+
+        torch.testing.assert_close(waveform_mean, EXPECTED_MEAN, atol=5e-6, rtol=1e-5)
+        torch.testing.assert_close(waveform_stddev, EXPECTED_STDDEV, atol=1e-4, rtol=1e-5)
+        torch.testing.assert_close(waveform_slice, EXPECTED_SLICE, atol=5e-6, rtol=1e-5)