Add Jukebox model (replaces #16875) (#17826)

This commit is contained in:
Arthur 2022-11-10 21:05:27 +01:00 committed by GitHub
parent 9740a03f61
commit 61a51f5f23
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
24 changed files with 4797 additions and 0 deletions

View File

@ -320,6 +320,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Jukebox](https://huggingface.co/docs/transformers/main/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -320,6 +320,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Jukebox](https://huggingface.co/docs/transformers/main/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -355,6 +355,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Jukebox](https://huggingface.co/docs/transformers/main/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -270,6 +270,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Jukebox](https://huggingface.co/docs/transformers/main/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -294,6 +294,7 @@ conda install -c huggingface transformers
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (来自 OpenAI) 伴随论文 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 由 Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 发布。
1. **[Jukebox](https://huggingface.co/docs/transformers/main/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) 由 Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou 发布。
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) 由 Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou 发布。
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) 由 Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei 发布。

View File

@ -306,6 +306,7 @@ conda install -c huggingface transformers
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Jukebox](https://huggingface.co/docs/transformers/main/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.

View File

@ -275,6 +275,8 @@
title: HerBERT
- local: model_doc/ibert
title: I-BERT
- local: model_doc/jukebox
title: Jukebox
- local: model_doc/layoutlm
title: LayoutLM
- local: model_doc/led

View File

@ -108,6 +108,7 @@ The documentation is organized into five sections:
1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Jukebox](model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1. **[LayoutLMv3](model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
@ -263,6 +264,7 @@ Flax), PyTorch, and/or TensorFlow.
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
| ImageGPT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Jukebox | ✅ | ❌ | ✅ | ❌ | ❌ |
| LayoutLM | ✅ | ✅ | ✅ | ✅ | ❌ |
| LayoutLMv2 | ✅ | ✅ | ✅ | ❌ | ❌ |
| LayoutLMv3 | ✅ | ✅ | ✅ | ✅ | ❌ |

View File

@ -0,0 +1,79 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Jukebox
## Overview
The Jukebox model was proposed in [Jukebox: A generative model for music](https://arxiv.org/pdf/2005.00341.pdf)
by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford,
Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on
an artist, genres and lyrics.
The abstract from the paper is the following:
*We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples, along with model weights and code.*
As shown on the following figure, Jukebox is made of 3 `priors` which are decoder only models. They follow the architecture described in [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509), modified to support longer context length.
First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditionner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution.
The metadata such as *artist, genre and timing* are passed to each prior, in the form of a start token and positionnal embedding for the timing data. The hidden states are mapped to the closest codebook vector from the VQVAE in order to convert them to raw audio.
![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg)
Tips:
- This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer!
- This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`.
- Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`.
- Primed sampling (conditionning the sampling on raw audio) requires more memory than ancestral sampling and should be used with `fp16` set to `True`.
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/openai/jukebox).
## JukeboxConfig
[[autodoc]] JukeboxConfig
## JukeboxPriorConfig
[[autodoc]] JukeboxPriorConfig
## JukeboxVQVAEConfig
[[autodoc]] JukeboxVQVAEConfig
## JukeboxTokenizer
[[autodoc]] JukeboxTokenizer
- save_vocabulary
## JukeboxModel
[[autodoc]] JukeboxModel
- ancestral_sample
- primed_sample
- continue_sample
- upsample
- _sample
## JukeboxPrior
[[autodoc]] JukeboxPrior
- sample
- forward
## JukeboxVQVAE
[[autodoc]] JukeboxVQVAE
- forward
- encode
- decode

View File

@ -249,6 +249,13 @@ _import_structure = {
"models.hubert": ["HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "HubertConfig"],
"models.ibert": ["IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "IBertConfig"],
"models.imagegpt": ["IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ImageGPTConfig"],
"models.jukebox": [
"JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP",
"JukeboxConfig",
"JukeboxPriorConfig",
"JukeboxTokenizer",
"JukeboxVQVAEConfig",
],
"models.layoutlm": ["LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "LayoutLMConfig", "LayoutLMTokenizer"],
"models.layoutlmv2": [
"LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP",
@ -1483,6 +1490,15 @@ else:
"load_tf_weights_in_imagegpt",
]
)
_import_structure["models.jukebox"].extend(
[
"JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST",
"JukeboxModel",
"JukeboxPreTrainedModel",
"JukeboxVQVAE",
"JukeboxPrior",
]
)
_import_structure["models.layoutlm"].extend(
[
"LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST",
@ -3353,6 +3369,13 @@ if TYPE_CHECKING:
from .models.hubert import HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, HubertConfig
from .models.ibert import IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, IBertConfig
from .models.imagegpt import IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, ImageGPTConfig
from .models.jukebox import (
JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP,
JukeboxConfig,
JukeboxPriorConfig,
JukeboxTokenizer,
JukeboxVQVAEConfig,
)
from .models.layoutlm import LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMConfig, LayoutLMTokenizer
from .models.layoutlmv2 import (
LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP,
@ -4359,6 +4382,13 @@ if TYPE_CHECKING:
ImageGPTPreTrainedModel,
load_tf_weights_in_imagegpt,
)
from .models.jukebox import (
JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST,
JukeboxModel,
JukeboxPreTrainedModel,
JukeboxPrior,
JukeboxVQVAE,
)
from .models.layoutlm import (
LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST,
LayoutLMForMaskedLM,

View File

@ -78,6 +78,7 @@ from . import (
hubert,
ibert,
imagegpt,
jukebox,
layoutlm,
layoutlmv2,
layoutlmv3,

View File

@ -81,6 +81,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("hubert", "HubertConfig"),
("ibert", "IBertConfig"),
("imagegpt", "ImageGPTConfig"),
("jukebox", "JukeboxConfig"),
("layoutlm", "LayoutLMConfig"),
("layoutlmv2", "LayoutLMv2Config"),
("layoutlmv3", "LayoutLMv3Config"),
@ -221,6 +222,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("jukebox", "JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlm", "LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlmv3", "LAYOUTLMV3_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -363,6 +365,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("hubert", "Hubert"),
("ibert", "I-BERT"),
("imagegpt", "ImageGPT"),
("jukebox", "Jukebox"),
("layoutlm", "LayoutLM"),
("layoutlmv2", "LayoutLMv2"),
("layoutlmv3", "LayoutLMv3"),

View File

@ -80,6 +80,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("hubert", "HubertModel"),
("ibert", "IBertModel"),
("imagegpt", "ImageGPTModel"),
("jukebox", "JukeboxModel"),
("layoutlm", "LayoutLMModel"),
("layoutlmv2", "LayoutLMv2Model"),
("layoutlmv3", "LayoutLMv3Model"),

View File

@ -143,6 +143,7 @@ else:
("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)),
("hubert", ("Wav2Vec2CTCTokenizer", None)),
("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
("jukebox", ("JukeboxTokenizer", None)),
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),

View File

@ -0,0 +1,74 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
_import_structure = {
"configuration_jukebox": [
"JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP",
"JukeboxConfig",
"JukeboxPriorConfig",
"JukeboxVQVAEConfig",
],
"tokenization_jukebox": ["JukeboxTokenizer"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_jukebox"] = [
"JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST",
"JukeboxModel",
"JukeboxPreTrainedModel",
"JukeboxVQVAE",
"JukeboxPrior",
]
if TYPE_CHECKING:
from .configuration_jukebox import (
JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP,
JukeboxConfig,
JukeboxPriorConfig,
JukeboxVQVAEConfig,
)
from .tokenization_jukebox import JukeboxTokenizer
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_jukebox import (
JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST,
JukeboxModel,
JukeboxPreTrainedModel,
JukeboxPrior,
JukeboxVQVAE,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@ -0,0 +1,639 @@
# coding=utf-8
# Copyright 2022 The OpenAI Team Authors and HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Jukebox configuration"""
import copy
import os
from typing import List, Union
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"openai/jukebox-5b-lyrics": "https://huggingface.co/openai/jukebox-5b-lyrics/blob/main/config.json",
"openai/jukebox-1b-lyrics": "https://huggingface.co/openai/jukebox-1b-lyrics/blob/main/config.json",
}
_LARGE_ATTENTION = [
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"block_attn",
"transpose_block_attn",
"prev_block_attn",
"cross_attention",
]
_RawColumnPreviousRowAttention = ["block_attn", "transpose_block_attn", "prev_block_attn"]
_FullDenseAttention = ["dense_attention"]
_PrimePrimeDenseAttention = ["prime_attn", "prime_attn", "dense_attn"]
def full_dense_attention(layer):
return _FullDenseAttention[0]
def raw_column_previous_row_attention(layer):
return _RawColumnPreviousRowAttention[layer % 3]
def large_separated_enc_dec_w_lyrics(layer):
return _LARGE_ATTENTION[layer % 79]
def enc_dec_with_lyrics(layer):
if layer % 16 == 15:
return _PrimePrimeDenseAttention[layer % 3]
return _RawColumnPreviousRowAttention[layer % 3]
ATTENTION_PATTERNS = {
"full_dense_attention": full_dense_attention,
"raw_column_previous_row_attention": raw_column_previous_row_attention, # Alternate row, column and previous row attn
"large_separated_enc_dec_w_lyrics": large_separated_enc_dec_w_lyrics, # Used by large separated_enc_dec model with lyrics
"enc_dec_with_lyrics": enc_dec_with_lyrics, # Used by encoder_decoder model with lyrics
}
class JukeboxPriorConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`JukeboxPrior`]. It is used to instantiate a
`JukeboxPrior` according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the top level prior from the
[openai/jukebox-1b-lyrics](https://huggingface.co/openai/jukebox
-1b-lyrics) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
act_fn (`str`, *optional*, defaults to `"quick_gelu"`):
Activation function.
alignment_head (`int`, *optional*, defaults to 2):
Head that is responsible of the alignment between lyrics and music. Only used to compute the lyric to audio
alignment
alignment_layer (`int`, *optional*, defaults to 68):
Index of the layer that is responsible of the alignment between lyrics and music. Only used to compute the
lyric to audio alignment
attention_multiplier (`float`, *optional*, defaults to 0.25):
Multiplier coefficient used to define the hidden dimension of the attention layers. 0.25 means that
0.25*width of the model will be used.
attention_pattern (`str`, *optional*, defaults to `"enc_dec_with_lyrics"`):
Which attention pattern to use for the decoder/
attn_dropout (`int`, *optional*, defaults to 0):
Dropout probability for the post-attention layer dropout in the decoder.
attn_res_scale (`bool`, *optional*, defaults to `False`):
Whether or not to scale the residuals in the attention conditioner block.
blocks (`int`, *optional*, defaults to 64):
Number of blocks used in the `block_attn`. A sequence of length seq_len is factored as `[blocks, seq_len //
blocks]` in the `JukeboxAttention` layer.
conv_res_scale (`int`, *optional*):
Whether or not to scale the residuals in the conditioner block. Since the top level prior does not have a
conditioner, the default value is to None and should not be modified.
num_layers (`int`, *optional*, defaults to 72):
Number of layers of the transformer architecture.
emb_dropout (`int`, *optional*, defaults to 0):
Embedding dropout used in the lyric decoder.
encoder_config (`JukeboxPriorConfig`, *optional*) :
Configuration of the encoder which models the prior on the lyrics.
encoder_loss_fraction (`float`, *optional*, defaults to 0.4):
Multiplication factor used in front of the lyric encoder loss.
hidden_size (`int`, *optional*, defaults to 2048):
Hidden dimension of the attention layers.
init_scale (`float`, *optional*, defaults to 0.2):
Initialization scales for the prior modules.
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether or not the prior is an encoder-decoder model. In case it is not, and `nb_relevant_lyric_tokens` is
greater than 0, the `encoder` args should be specified for the lyric encoding.
mask (`bool`, *optional*, defaults to `False`):
Whether or not to mask the previous positions in the attention.
max_duration (`int`, *optional*, defaults to 600):
Maximum supported duration of the generated song in seconds.
max_nb_genres (`int`, *optional*, defaults to 1):
Maximum number of genres that can be used to condition the model.
merged_decoder (`bool`, *optional*, defaults to `True`):
Whether or not the decoder and the encoder inputs are merged. This is used for the separated
encoder-decoder architecture
metadata_conditioning (`bool`, *optional*, defaults to `True)`:
Whether or not to condition on the artist and genre metadata.
metadata_dims (`List[int]`, *optional*, defaults to `[604, 7898]`):
Number of genres and the number of artists that were used to train the embedding layers of the prior
models.
min_duration (`int`, *optional*, defaults to 0):
Minimum duration of the generated audio on which the model was trained.
mlp_multiplier (`float`, *optional*, defaults to 1.0):
Multiplier coefficient used to define the hidden dimension of the MLP layers. 0.25 means that 0.25*width of
the model will be used.
music_vocab_size (`int`, *optional*, defaults to 2048):
Number of different music tokens. Should be similar to the `JukeboxVQVAEConfig.nb_discrete_codes`.
n_ctx (`int`, *optional*, defaults to 6144):
Number of context tokens for each prior. The context tokens are the music tokens that are attended to when
generating music tokens.
n_heads (`int`, *optional*, defaults to 2):
Number of attention heads.
nb_relevant_lyric_tokens (`int`, *optional*, defaults to 384):
Number of lyric tokens that are used when sampling a single window of length `n_ctx`
res_conv_depth (`int`, *optional*, defaults to 3):
Depth of the `JukeboxDecoderConvBock` used to upsample the previously sampled audio in the
`JukeboxMusicTokenConditioner`.
res_conv_width (`int`, *optional*, defaults to 128):
Width of the `JukeboxDecoderConvBock` used to upsample the previously sampled audio in the
`JukeboxMusicTokenConditioner`.
res_convolution_multiplier (`int`, *optional*, defaults to 1):
Multiplier used to scale the `hidden_dim` of the `JukeboxResConv1DBlock`.
res_dilation_cycle (`int`, *optional*):
Dilation cycle used to define the `JukeboxMusicTokenConditioner`. Usually similar to the ones used in the
corresponding level of the VQVAE. The first prior does not use it as it is not conditioned on upper level
tokens.
res_dilation_growth_rate (`int`, *optional*, defaults to 1):
Dilation grow rate used between each convolutionnal block of the `JukeboxMusicTokenConditioner`
res_downs_t (`List[int]`, *optional*, defaults to `[3, 2, 2]`):
Downsampling rates used in the audio conditioning network
res_strides_t (`List[int]`, *optional*, defaults to `[2, 2, 2]`):
Striding used in the audio conditioning network
resid_dropout (`int`, *optional*, defaults to 0):
Residual dropout used in the attention pattern.
sampling_rate (`int`, *optional*, defaults to 44100):
Sampling rate used for training.
spread (`int`, *optional*):
Spread used in the `summary_spread_attention` pattern
timing_dims (`int`, *optional*, defaults to 64):
Dimension of the timing embedding.
zero_out (`bool`, *optional*, defaults to `False`):
Whether or not to zero out convolution weights when initializing.
"""
model_type = "jukebox_prior"
attribute_map = {
"max_position_embeddings": "n_positions",
"num_attention_heads": "n_head",
}
def __init__(
self,
act_fn="quick_gelu",
level=0,
alignment_head=2,
alignment_layer=68,
attention_multiplier=0.25,
attention_pattern="enc_dec_with_lyrics",
attn_dropout=0,
attn_res_scale=False,
blocks=64,
conv_res_scale=None,
num_layers=72,
emb_dropout=0,
encoder_config=None,
encoder_loss_fraction=0.4,
hidden_size=2048,
init_scale=0.2,
is_encoder_decoder=True,
lyric_vocab_size=80,
mask=False,
max_duration=600,
max_nb_genres=1,
merged_decoder=True,
metadata_conditioning=True,
metadata_dims=[604, 7898],
min_duration=0,
mlp_multiplier=1.0,
music_vocab_size=2048,
n_ctx=6144,
n_heads=2,
nb_relevant_lyric_tokens=384,
res_conv_depth=3,
res_conv_width=128,
res_convolution_multiplier=1,
res_dilation_cycle=None,
res_dilation_growth_rate=1,
res_downs_t=[3, 2, 2],
res_strides_t=[2, 2, 2],
resid_dropout=0,
sampling_rate=44100,
spread=None,
timing_dims=64,
zero_out=False,
**kwargs
):
self.act_fn = act_fn
self.alignment_head = alignment_head
self.alignment_layer = alignment_layer
self.attention_multiplier = attention_multiplier
self.attention_pattern = attention_pattern
self.attn_dropout = attn_dropout
self.attn_res_scale = attn_res_scale
self.blocks = blocks
self.conv_res_scale = conv_res_scale
self.num_layers = num_layers
self.emb_dropout = emb_dropout
self.music_vocab_size = music_vocab_size
if encoder_config is not None:
self.encoder_config = JukeboxPriorConfig(**encoder_config)
else:
self.encoder_config = None
self.encoder_loss_fraction = encoder_loss_fraction
self.init_scale = init_scale
self.is_encoder_decoder = is_encoder_decoder
self.lyric_vocab_size = lyric_vocab_size
self.level = level
self.mask = mask
self.max_duration = max_duration
self.max_nb_genres = max_nb_genres
self.merged_decoder = merged_decoder
self.metadata_conditioning = metadata_conditioning
self.metadata_dims = metadata_dims
self.min_duration = min_duration
self.mlp_multiplier = mlp_multiplier
self.n_ctx = n_ctx
self.n_heads = n_heads
self.nb_relevant_lyric_tokens = nb_relevant_lyric_tokens
self.res_conv_depth = res_conv_depth
self.res_conv_width = res_conv_width
self.res_convolution_multiplier = res_convolution_multiplier
self.res_dilation_cycle = res_dilation_cycle
self.res_dilation_growth_rate = res_dilation_growth_rate
self.res_downs_t = res_downs_t
self.res_strides_t = res_strides_t
self.resid_dropout = resid_dropout
self.sampling_rate = sampling_rate
self.spread = spread
self.timing_dims = timing_dims
self.hidden_size = hidden_size
self.zero_out = zero_out
@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path: Union[str, os.PathLike], level=0, **kwargs
) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the prior config dict if we are loading from JukeboxConfig
if config_dict.get("model_type") == "jukebox":
config_dict = config_dict[f"prior_{level}"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["encoder_config"] = self.encoder_config.to_dict() if self.encoder_config is not None else None
output["model_type"] = self.__class__.model_type
return output
class JukeboxVQVAEConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`JukeboxVQVAE`]. It is used to instantiate a
`JukeboxVQVAE` according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the VQVAE from
[openai/jukebox-1b-lyrics](https://huggingface.co/openai/jukebox-1b-lyrics) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
act_fn (`str`, *optional*, defaults to `"relu"`):
Activation function of the model.
nb_discrete_codes (`int`, *optional*, defaults to 2048):
Number of codes of the VQVAE.
commit (`float`, *optional*, defaults to 0.02):
Commit loss multiplier.
conv_input_shape (`int`, *optional*, defaults to 1):
Number of audio channels.
conv_res_scale (`bool`, *optional*, defaults to `False`):
Whether or not to scale the residuals of the `JukeboxResConv1DBlock`.
embed_dim (`int`, *optional*, defaults to 64):
Embedding dimension of the codebook vectors.
hop_fraction (`List[int]`, *optional*, defaults to `[0.125, 0.5, 0.5]`):
Fraction of non-intersecting window used when continuing the sampling process.
levels (`int`, *optional*, defaults to 3):
Number of hierarchical levels that used in the VQVAE.
lmu (`float`, *optional*, defaults to 0.99):
Used in the codebook update, exponential moving average coefficient. For more detail refer to Appendix A.1
of the original [VQVAE paper](https://arxiv.org/pdf/1711.00937v2.pdf)
multipliers (`List[int]`, *optional*, defaults to `[2, 1, 1]`):
Depth and width multipliers used for each level. Used on the `res_conv_width` and `res_conv_depth`
res_conv_depth (`int`, *optional*, defaults to 4):
Depth of the encoder and decoder block. If no `multipliers` are used, this is the same for each level.
res_conv_width (`int`, *optional*, defaults to 32):
Width of the encoder and decoder block. If no `multipliers` are used, this is the same for each level.
res_convolution_multiplier (`int`, *optional*, defaults to 1):
Scaling factor of the hidden dimension used in the `JukeboxResConv1DBlock`.
res_dilation_cycle (`int`, *optional*):
Dilation cycle value used in the `JukeboxResnet`. If an int is used, each new Conv1 block will have a depth
reduced by a power of `res_dilation_cycle`.
res_dilation_growth_rate (`int`, *optional*, defaults to 3):
Resnet dilation growth rate used in the VQVAE (dilation_growth_rate ** depth)
res_downs_t (`List[int]`, *optional*, defaults to `[3, 2, 2]`):
Downsampling rate for each level of the hierarchical VQ-VAE.
res_strides_t (`List[int]`, *optional*, defaults to `[2, 2, 2]`):
Stride used for each level of the hierarchical VQ-VAE.
sample_length (`int`, *optional*, defaults to 1058304):
Provides the max input shape of the VQVAE. Is used to compute the input shape of each level.
init_scale (`float`, *optional*, defaults to 0.2):
Initialization scale.
zero_out (`bool`, *optional*, defaults to `False`):
Whether or not to zero out convolution weights when initializing.
"""
model_type = "jukebox_vqvae"
def __init__(
self,
act_fn="relu",
nb_discrete_codes=2048,
commit=0.02,
conv_input_shape=1,
conv_res_scale=False,
embed_dim=64,
hop_fraction=[0.125, 0.5, 0.5],
levels=3,
lmu=0.99,
multipliers=[2, 1, 1],
res_conv_depth=4,
res_conv_width=32,
res_convolution_multiplier=1,
res_dilation_cycle=None,
res_dilation_growth_rate=3,
res_downs_t=[3, 2, 2],
res_strides_t=[2, 2, 2],
sample_length=1058304,
init_scale=0.2,
zero_out=False,
**kwargs
):
self.hop_fraction = hop_fraction
self.conv_input_shape = conv_input_shape
self.sample_length = sample_length
# VQVAE parameters (all used)
self.levels = levels
self.embed_dim = embed_dim
self.nb_discrete_codes = nb_discrete_codes
self.res_conv_width = res_conv_width
self.res_conv_depth = res_conv_depth
self.res_convolution_multiplier = res_convolution_multiplier
self.res_dilation_growth_rate = res_dilation_growth_rate
self.res_dilation_cycle = res_dilation_cycle
self.multipliers = multipliers
self.res_downs_t = res_downs_t
self.res_strides_t = res_strides_t
self.lmu = lmu
self.commit = commit
self.conv_res_scale = conv_res_scale
self.act_fn = act_fn
self.init_scale = init_scale
self.zero_out = zero_out
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the text config dict if we are loading from CLIPConfig
if config_dict.get("model_type") == "jukebox":
config_dict = config_dict["vqvae_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class JukeboxConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`JukeboxModel`].
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information. Instantiating a configuration with the defaults will
yield a similar configuration to that of
[openai/jukebox-1b-lyrics](https://huggingface.co/openai/jukebox-1b-lyrics) architecture.
The downsampling and stride are used to determine downsampling of the input sequence. For example, downsampling =
(5,3), and strides = (2, 2) will downsample the audio by 2^5 = 32 to get the first level of codes, and 2**8 = 256
to get the second level codes. This is mostly true for training the top level prior and the upsamplers.
Args:
vqvae_config (`JukeboxVQVAEConfig`, *optional*):
Configuration for the `JukeboxVQVAE` model.
prior_config_list (`List[JukeboxPriorConfig]`, *optional*):
List of the configs for each of the `JukeboxPrior` of the model. The original architecture uses 3 priors.
nb_priors (`int`, *optional*, defaults to 3):
Number of prior models that will sequentially sample tokens. Each prior is conditional auto regressive
(decoder) model, apart from the top prior, which can include a lyric encoder. The available models were
trained using a top prior and 2 upsampler priors.
sampling_rate (`int`, *optional*, defaults to 44100):
Sampling rate of the raw audio.
timing_dims (`int`, *optional*, defaults to 64):
Dimensions of the JukeboxRangeEmbedding layer which is equivalent to traditional positional embedding
layer. The timing embedding layer converts the absolute and relative position in the currently sampled
audio to a tensor of length `timing_dims` that will be added to the music tokens.
min_duration (`int`, *optional*, defaults to 0):
Minimum duration of the audios to generate
max_duration (`float`, *optional*, defaults to 600.0):
Maximum duration of the audios to generate
max_nb_genres (`int`, *optional*, defaults to 5):
Maximum number of genres that can be used to condition a single sample.
metadata_conditioning (`bool`, *optional*, defaults to `True`):
Whether or not to use metadata conditioning, corresponding to the artist, the genre and the min/maximum
duration.
init_std (`float`, *optional*, defaults to 0.2):
Standard deviation used to initial the model.
Example:
```python
>>> from transformers import JukeboxModel, JukeboxConfig
>>> # Initializing a Jukebox configuration
>>> configuration = JukeboxConfig()
>>> # Initializing a model from the configuration
>>> model = JukeboxModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "jukebox"
is_composition = True
def __init__(
self,
vqvae_config=None,
prior_config_list=None,
nb_priors=3,
sampling_rate=44100,
timing_dims=64,
min_duration=0,
max_duration=600.0,
max_nb_genres=5,
metadata_conditioning=True,
init_std=0.2,
**kwargs,
):
if vqvae_config is None:
vqvae_config = {}
logger.info("vqvae_config is None. initializing the JukeboxVQVAE with default values.")
self.vqvae_config = JukeboxVQVAEConfig(**vqvae_config)
if prior_config_list is not None:
self.prior_configs = [JukeboxPriorConfig(**prior_config) for prior_config in prior_config_list]
else:
self.prior_configs = []
for prior_idx in range(nb_priors):
prior_config = kwargs.pop(f"prior_{prior_idx}", None)
if prior_config is None:
prior_config = {}
logger.info(
f"prior_{prior_idx}'s config is None. Initializing the JukeboxPriorConfig list with default"
" values."
)
self.prior_configs.append(JukeboxPriorConfig(**prior_config))
self.hop_fraction = self.vqvae_config.hop_fraction
self.init_std = init_std
self.nb_priors = nb_priors
# Metadata conditioning
self.max_nb_genres = max_nb_genres
self.sampling_rate = sampling_rate
self.timing_dims = timing_dims
self.min_duration = min_duration
self.max_duration = max_duration
self.metadata_conditioning = metadata_conditioning
super().__init__(**kwargs)
@classmethod
def from_configs(cls, prior_configs: List[JukeboxPriorConfig], vqvae_config: JukeboxVQVAEConfig, **kwargs):
r"""
Instantiate a [`JukeboxConfig`] (or a derived class) from clip text model configuration and clip vision model
configuration.
Returns:
[`JukeboxConfig`]: An instance of a configuration object
"""
prior_config_list = [config.to_dict() for config in prior_configs]
return cls(prior_config_list=prior_config_list, vqvae_config_dict=vqvae_config.to_dict(), **kwargs)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
for i, config in enumerate(output.pop("prior_configs")):
output[f"prior_{i}"] = config.to_dict()
output["vqvae_config"] = self.vqvae_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

View File

@ -0,0 +1,280 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert Jukebox checkpoints"""
import argparse
import json
import os
from pathlib import Path
import torch
import requests
from transformers import JukeboxConfig, JukeboxModel
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
PREFIX = "https://openaipublic.azureedge.net/jukebox/models/"
MODEL_MAPPING = {
"jukebox-1b-lyrics": [
"5b/vqvae.pth.tar",
"5b/prior_level_0.pth.tar",
"5b/prior_level_1.pth.tar",
"1b_lyrics/prior_level_2.pth.tar",
],
"jukebox-5b-lyrics": [
"5b/vqvae.pth.tar",
"5b/prior_level_0.pth.tar",
"5b/prior_level_1.pth.tar",
"5b_lyrics/prior_level_2.pth.tar",
],
}
def replace_key(key):
if key.endswith(".model.1.bias") and len(key.split(".")) > 10:
key = key.replace(".model.1.bias", ".conv1d_1.bias")
elif key.endswith(".model.1.weight") and len(key.split(".")) > 10:
key = key.replace(".model.1.weight", ".conv1d_1.weight")
elif key.endswith(".model.3.bias") and len(key.split(".")) > 10:
key = key.replace(".model.3.bias", ".conv1d_2.bias")
elif key.endswith(".model.3.weight") and len(key.split(".")) > 10:
key = key.replace(".model.3.weight", ".conv1d_2.weight")
if "conditioner_blocks.0." in key:
key = key.replace("conditioner_blocks.0", "conditioner_blocks")
if "prime_prior" in key:
key = key.replace("prime_prior", "encoder")
if ".emb." in key and "total" not in key and "absolute" not in key and "relative" not in key:
key = key.replace(".emb.", ".")
if key.endswith("k"): # replace vqvae.X.k with vqvae.X.codebook
return key.replace(".k", ".codebook")
if "y_emb." in key:
return key.replace("y_emb.", "metadata_embedding.")
if "x_emb.emb." in key:
key = key.replace("0.x_emb.emb", "embed_tokens")
if "prime_state_ln" in key:
return key.replace("prime_state_ln", "encoder.final_layer_norm")
if ".ln" in key:
return key.replace(".ln", ".layer_norm")
if "_ln" in key:
return key.replace("_ln", "_layer_norm")
if "prime_state_proj" in key:
return key.replace("prime_state_proj", "encoder.proj_in")
if "prime_x_out" in key:
return key.replace("prime_x_out", "encoder.lm_head")
if "prior.x_out" in key:
return key.replace("x_out", "fc_proj_out")
if "x_emb" in key:
return key.replace("x_emb", "embed_tokens")
return key
def fix_jukebox_keys(state_dict, model_state_dict, key_prefix, mapping):
new_dict = {}
import re
re_encoder_block_conv_in = re.compile("encoders.(\d*).level_blocks.(\d*).model.(\d*).(\d).(bias|weight)")
re_encoder_block_resnet = re.compile(
"encoders.(\d*).level_blocks.(\d*).model.(\d*).(\d).model.(\d*).model.(\d*).(bias|weight)"
)
re_encoder_block_proj_out = re.compile("encoders.(\d*).level_blocks.(\d*).model.(\d*).(bias|weight)")
re_decoder_block_conv_out = re.compile("decoders.(\d*).level_blocks.(\d*).model.(\d*).(\d).(bias|weight)")
re_decoder_block_resnet = re.compile(
"decoders.(\d*).level_blocks.(\d*).model.(\d*).(\d).model.(\d*).model.(\d*).(bias|weight)"
)
re_decoder_block_proj_in = re.compile("decoders.(\d*).level_blocks.(\d*).model.(\d*).(bias|weight)")
re_prior_cond_conv_out = re.compile("conditioner_blocks.(\d*).cond.model.(\d*).(\d).(bias|weight)")
re_prior_cond_resnet = re.compile(
"conditioner_blocks.(\d*).cond.model.(\d*).(\d).model.(\d*).model.(\d*).(bias|weight)"
)
re_prior_cond_proj_in = re.compile("conditioner_blocks.(\d*).cond.model.(\d*).(bias|weight)")
for original_key, value in state_dict.items():
# rename vqvae.encoder keys
if re_encoder_block_conv_in.fullmatch(original_key):
regex_match = re_encoder_block_conv_in.match(original_key)
groups = regex_match.groups()
block_index = int(groups[2]) * 2 + int(groups[3])
re_new_key = f"encoders.{groups[0]}.level_blocks.{groups[1]}.downsample_block.{block_index}.{groups[-1]}"
key = re_encoder_block_conv_in.sub(re_new_key, original_key)
elif re_encoder_block_resnet.fullmatch(original_key):
regex_match = re_encoder_block_resnet.match(original_key)
groups = regex_match.groups()
block_index = int(groups[2]) * 2 + int(groups[3])
conv_index = {"1": 1, "3": 2}[groups[-2]]
prefix = f"encoders.{groups[0]}.level_blocks.{groups[1]}.downsample_block.{block_index}."
resnet_block = f"resnet_block.{groups[-3]}.conv1d_{conv_index}.{groups[-1]}"
re_new_key = prefix + resnet_block
key = re_encoder_block_resnet.sub(re_new_key, original_key)
elif re_encoder_block_proj_out.fullmatch(original_key):
regex_match = re_encoder_block_proj_out.match(original_key)
groups = regex_match.groups()
re_new_key = f"encoders.{groups[0]}.level_blocks.{groups[1]}.proj_out.{groups[-1]}"
key = re_encoder_block_proj_out.sub(re_new_key, original_key)
# rename vqvae.decoder keys
elif re_decoder_block_conv_out.fullmatch(original_key):
regex_match = re_decoder_block_conv_out.match(original_key)
groups = regex_match.groups()
block_index = int(groups[2]) * 2 + int(groups[3]) - 2
re_new_key = f"decoders.{groups[0]}.level_blocks.{groups[1]}.upsample_block.{block_index}.{groups[-1]}"
key = re_decoder_block_conv_out.sub(re_new_key, original_key)
elif re_decoder_block_resnet.fullmatch(original_key):
regex_match = re_decoder_block_resnet.match(original_key)
groups = regex_match.groups()
block_index = int(groups[2]) * 2 + int(groups[3]) - 2
conv_index = {"1": 1, "3": 2}[groups[-2]]
prefix = f"decoders.{groups[0]}.level_blocks.{groups[1]}.upsample_block.{block_index}."
resnet_block = f"resnet_block.{groups[-3]}.conv1d_{conv_index}.{groups[-1]}"
re_new_key = prefix + resnet_block
key = re_decoder_block_resnet.sub(re_new_key, original_key)
elif re_decoder_block_proj_in.fullmatch(original_key):
regex_match = re_decoder_block_proj_in.match(original_key)
groups = regex_match.groups()
re_new_key = f"decoders.{groups[0]}.level_blocks.{groups[1]}.proj_in.{groups[-1]}"
key = re_decoder_block_proj_in.sub(re_new_key, original_key)
# rename prior cond.model to upsampler.upsample_block and resnet
elif re_prior_cond_conv_out.fullmatch(original_key):
regex_match = re_prior_cond_conv_out.match(original_key)
groups = regex_match.groups()
block_index = int(groups[1]) * 2 + int(groups[2]) - 2
re_new_key = f"conditioner_blocks.upsampler.upsample_block.{block_index}.{groups[-1]}"
key = re_prior_cond_conv_out.sub(re_new_key, original_key)
elif re_prior_cond_resnet.fullmatch(original_key):
regex_match = re_prior_cond_resnet.match(original_key)
groups = regex_match.groups()
block_index = int(groups[1]) * 2 + int(groups[2]) - 2
conv_index = {"1": 1, "3": 2}[groups[-2]]
prefix = f"conditioner_blocks.upsampler.upsample_block.{block_index}."
resnet_block = f"resnet_block.{groups[-3]}.conv1d_{conv_index}.{groups[-1]}"
re_new_key = prefix + resnet_block
key = re_prior_cond_resnet.sub(re_new_key, original_key)
elif re_prior_cond_proj_in.fullmatch(original_key):
regex_match = re_prior_cond_proj_in.match(original_key)
groups = regex_match.groups()
re_new_key = f"conditioner_blocks.upsampler.proj_in.{groups[-1]}"
key = re_prior_cond_proj_in.sub(re_new_key, original_key)
# keep original key
else:
key = original_key
key = replace_key(key)
if f"{key_prefix}.{key}" not in model_state_dict or key is None:
print(f"failed converting {original_key} to {key}, does not match")
# handle missmatched shape
elif value.shape != model_state_dict[f"{key_prefix}.{key}"].shape:
val = model_state_dict[f"{key_prefix}.{key}"]
print(f"{original_key}-> {key} : \nshape {val.shape} and { value.shape}, do not match")
key = original_key
mapping[key] = original_key
new_dict[key] = value
return new_dict
@torch.no_grad()
def convert_openai_checkpoint(model_name=None, pytorch_dump_folder_path=None):
"""
Copy/paste/tweak model's weights to our Jukebox structure.
"""
for file in MODEL_MAPPING[model_name]:
if not os.path.isfile(f"{pytorch_dump_folder_path}/{file.split('/')[-1]}"):
r = requests.get(f"{PREFIX}{file}", allow_redirects=True)
os.makedirs(f"{pytorch_dump_folder_path}/", exist_ok=True)
open(f"{pytorch_dump_folder_path}/{file.split('/')[-1]}", "wb").write(r.content)
model_to_convert = MODEL_MAPPING[model_name.split("/")[-1]]
config = JukeboxConfig.from_pretrained(model_name)
model = JukeboxModel(config)
weight_dict = []
mapping = {}
for i, dict_name in enumerate(model_to_convert):
old_dic = torch.load(f"{pytorch_dump_folder_path}/{dict_name.split('/')[-1]}")["model"]
new_dic = {}
for k in old_dic.keys():
if k.endswith(".b"):
new_dic[k.replace("b", "bias")] = old_dic[k]
elif k.endswith(".w"):
new_dic[k.replace("w", "weight")] = old_dic[k]
elif "level_2" not in dict_name and "cond.model." in k:
new_dic[k.replace(".blocks.", ".model.")] = old_dic[k]
else:
new_dic[k] = old_dic[k]
key_prefix = "vqvae" if i == 0 else f"priors.{3 - i}"
new_dic = fix_jukebox_keys(new_dic, model.state_dict(), key_prefix, mapping)
weight_dict.append(new_dic)
vqvae_state_dict = weight_dict.pop(0)
model.vqvae.load_state_dict(vqvae_state_dict)
for i in range(len(weight_dict)):
model.priors[i].load_state_dict(weight_dict[2 - i])
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
with open(f"{pytorch_dump_folder_path}/mapping.json", "w") as txtfile:
json.dump(mapping, txtfile)
print(f"Saving model {model_name} to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
return weight_dict
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--model_name",
default="jukebox-5b-lyrics",
type=str,
help="Name of the model you'd like to convert.",
)
parser.add_argument(
"--pytorch_dump_folder_path",
default="jukebox-5b-lyrics-converted",
type=str,
help="Path to the output PyTorch model directory.",
)
args = parser.parse_args()
convert_openai_checkpoint(args.model_name, args.pytorch_dump_folder_path)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,424 @@
# coding=utf-8
# Copyright 2022 The Open AI Team Authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for OpenAI Jukebox."""
import json
import os
import re
import unicodedata
from json.encoder import INFINITY
from typing import Any, Dict, List, Optional, Tuple, Union
import numpy as np
import regex
from transformers.utils.generic import _is_jax, _is_numpy
from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...tokenization_utils_base import BatchEncoding
from ...utils import TensorType, is_flax_available, is_tf_available, is_torch_available, logging
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {
"artists_file": "artists.json",
"lyrics_file": "lyrics.json",
"genres_file": "genres.json",
}
PRETRAINED_VOCAB_FILES_MAP = {
"artists_file": {
"jukebox": "https://huggingface.co/ArthurZ/jukebox/blob/main/artists.json",
},
"genres_file": {
"jukebox": "https://huggingface.co/ArthurZ/jukebox/blob/main/genres.json",
},
"lyrics_file": {
"jukebox": "https://huggingface.co/ArthurZ/jukebox/blob/main/lyrics.json",
},
}
PRETRAINED_LYRIC_TOKENS_SIZES = {
"jukebox": 512,
}
class JukeboxTokenizer(PreTrainedTokenizer):
"""
Constructs a Jukebox tokenizer. Jukebox can be conditioned on 3 different inputs :
- Artists, unique ids are associated to each artist from the provided dictionary.
- Genres, unique ids are associated to each genre from the provided dictionary.
- Lyrics, character based tokenization. Must be initialized with the list of characters that are inside the
vocabulary.
This tokenizer does not require training. It should be able to process a different number of inputs:
as the conditioning of the model can be done on the three different queries. If None is provided, defaults values will be used.:
Depending on the number of genres on which the model should be conditioned (`n_genres`).
```
>>> from transformers import JukeboxTokenizer
>>> tokenizer = JukeboxTokenizer.from_pretrained("openai/jukebox-1b-lyrics")
>>> tokenizer("Alan Jackson", "Country Rock", "old town road")['input_ids']
[tensor([[ 0, 0, 0, 6785, 546, 41, 38, 30, 76, 46, 41, 49,
40, 76, 44, 41, 27, 30]]), tensor([[ 0, 0, 0, 145, 0]]), tensor([[ 0, 0, 0, 145, 0]])]
```
You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
<Tip>
If nothing is provided, the genres and the artist will either be selected randomly or set to None
</Tip>
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to:
this superclass for more information regarding those methods.
However the code does not allow that and only supports composing from various genres.
Args:
artists_file (`str`):
Path to the vocabulary file which contains a mapping between artists and ids. The default file supports
both "v2" and "v3"
genres_file (`str`):
Path to the vocabulary file which contain a mapping between genres and ids.
lyrics_file (`str`):
Path to the vocabulary file which contains the accepted characters for the lyrics tokenization.
version (`List[str]`, `optional`, default to `["v3", "v2", "v2"]`) :
List of the tokenizer versions. The `5b-lyrics`'s top level prior model was trained using `v3` instead of
`v2`.
n_genres (`int`, `optional`, defaults to 1):
Maximum number of genres to use for composition.
max_n_lyric_tokens (`int`, `optional`, defaults to 512):
Maximum number of lyric tokens to keep.
unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_lyric_input_size = PRETRAINED_LYRIC_TOKENS_SIZES
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
artists_file,
genres_file,
lyrics_file,
version=["v3", "v2", "v2"],
max_n_lyric_tokens=512,
n_genres=5,
unk_token="<|endoftext|>",
**kwargs
):
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
super().__init__(
unk_token=unk_token,
n_genres=n_genres,
version=version,
max_n_lyric_tokens=max_n_lyric_tokens,
**kwargs,
)
self.version = version
self.max_n_lyric_tokens = max_n_lyric_tokens
self.n_genres = n_genres
with open(artists_file, encoding="utf-8") as vocab_handle:
self.artists_encoder = json.load(vocab_handle)
with open(genres_file, encoding="utf-8") as vocab_handle:
self.genres_encoder = json.load(vocab_handle)
with open(lyrics_file, encoding="utf-8") as vocab_handle:
self.lyrics_encoder = json.load(vocab_handle)
oov = "[^A-Za-z0-9.,:;!?\-'\"()\[\] \t\n]+"
# In v2, we had a n_vocab=80 and in v3 we missed + and so n_vocab=79 of characters.
if len(self.lyrics_encoder) == 79:
oov = oov.replace("\-'", "\-+'")
self.out_of_vocab = regex.compile(oov)
self.artists_decoder = {v: k for k, v in self.artists_encoder.items()}
self.genres_decoder = {v: k for k, v in self.genres_encoder.items()}
self.lyrics_decoder = {v: k for k, v in self.lyrics_encoder.items()}
@property
def vocab_size(self):
return len(self.artists_encoder) + len(self.genres_encoder) + len(self.lyrics_encoder)
def get_vocab(self):
return dict(self.artists_encoder, self.genres_encoder, self.lyrics_encoder)
def _convert_token_to_id(self, list_artists, list_genres, list_lyrics):
"""Converts the artist, genre and lyrics tokens to their index using the vocabulary.
The total_length, offset and duration have to be provided in order to select relevant lyrics and add padding to
the lyrics token sequence.
"""
artists_id = [self.artists_encoder.get(artist, 0) for artist in list_artists]
for genres in range(len(list_genres)):
list_genres[genres] = [self.genres_encoder.get(genre, 0) for genre in list_genres[genres]]
list_genres[genres] = list_genres[genres] + [-1] * (self.n_genres - len(list_genres[genres]))
lyric_ids = [[self.lyrics_encoder.get(character, 0) for character in list_lyrics[0]], [], []]
return artists_id, list_genres, lyric_ids
def _tokenize(self, lyrics):
"""
Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
Do NOT take care of added tokens. Only the lyrics are split into character for the character-based vocabulary.
"""
# only lyrics are not tokenized, but character based is easily handled
return [character for character in lyrics]
def tokenize(self, artist, genre, lyrics, **kwargs):
"""
Converts three strings in a 3 sequence of tokens using the tokenizer
"""
artist, genre, lyrics = self.prepare_for_tokenization(artist, genre, lyrics)
lyrics = self._tokenize(lyrics)
return artist, genre, lyrics
def prepare_for_tokenization(
self, artists: str, genres: str, lyrics: str, is_split_into_words: bool = False
) -> Tuple[str, str, str, Dict[str, Any]]:
"""
Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining `kwargs` as well. We test the
`kwargs` at the end of the encoding process to be sure all the arguments have been used.
Args:
artist (`str`):
The artist name to prepare. This will mostly lower the string
genres (`str`):
The genre name to prepare. This will mostly lower the string.
lyrics (`str`):
The lyrics to prepare.
is_split_into_words (`bool`, *optional*, defaults to `False`):
Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification.
kwargs:
Keyword arguments to use for the tokenization.
"""
for idx in range(len(self.version)):
if self.version[idx] == "v3":
artists[idx] = artists[idx].lower()
genres[idx] = [genres[idx].lower()]
else:
artists[idx] = self._normalize(artists[idx]) + ".v2"
genres[idx] = [
self._normalize(genre) + ".v2" for genre in genres[idx].split("_")
] # split is for the full dictionary with combined genres
if self.version[0] == "v2":
self.out_of_vocab = regex.compile("[^A-Za-z0-9.,:;!?\-'\"()\[\] \t\n]+")
vocab = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.,:;!?-+'\"()[] \t\n"
self.vocab = {vocab[index]: index + 1 for index in range(len(vocab))}
self.vocab["<unk>"] = 0
self.n_vocab = len(vocab) + 1
self.lyrics_encoder = self.vocab
self.lyrics_decoder = {v: k for k, v in self.vocab.items()}
self.lyrics_decoder[0] = ""
else:
self.out_of_vocab = regex.compile("[^A-Za-z0-9.,:;!?\-+'\"()\[\] \t\n]+")
lyrics = self._run_strip_accents(lyrics)
lyrics = lyrics.replace("\\", "\n")
lyrics = self.out_of_vocab.sub("", lyrics), [], []
return artists, genres, lyrics
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _normalize(self, text: str) -> str:
"""
Normalizes the input text. This process is for the genres and the artist
Args:
text (`str`):
Artist or Genre string to normalize
"""
accepted = (
[chr(i) for i in range(ord("a"), ord("z") + 1)]
+ [chr(i) for i in range(ord("A"), ord("Z") + 1)]
+ [chr(i) for i in range(ord("0"), ord("9") + 1)]
+ ["."]
)
accepted = frozenset(accepted)
pattern = re.compile(r"_+")
text = "".join([c if c in accepted else "_" for c in text.lower()])
text = pattern.sub("_", text).strip("_")
return text
def convert_lyric_tokens_to_string(self, lyrics: List[str]) -> str:
return " ".join(lyrics)
def convert_to_tensors(
self, inputs, tensor_type: Optional[Union[str, TensorType]] = None, prepend_batch_axis: bool = False
):
"""
Convert the inner content to tensors.
Args:
tensor_type (`str` or [`~utils.TensorType`], *optional*):
The type of tensors to use. If `str`, should be one of the values of the enum [`~utils.TensorType`]. If
unset, no modification is done.
prepend_batch_axis (`int`, *optional*, defaults to `False`):
Whether or not to add the batch dimension during the conversion.
"""
# Convert to TensorType
if not isinstance(tensor_type, TensorType):
tensor_type = TensorType(tensor_type)
# Get a function reference for the correct framework
if tensor_type == TensorType.TENSORFLOW:
if not is_tf_available():
raise ImportError(
"Unable to convert output to TensorFlow tensors format, TensorFlow is not installed."
)
import tensorflow as tf
as_tensor = tf.constant
is_tensor = tf.is_tensor
elif tensor_type == TensorType.PYTORCH:
if not is_torch_available():
raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.")
import torch
as_tensor = torch.tensor
is_tensor = torch.is_tensor
elif tensor_type == TensorType.JAX:
if not is_flax_available():
raise ImportError("Unable to convert output to JAX tensors format, JAX is not installed.")
import jax.numpy as jnp # noqa: F811
as_tensor = jnp.array
is_tensor = _is_jax
else:
as_tensor = np.asarray
is_tensor = _is_numpy
# Do the tensor conversion in batch
try:
if prepend_batch_axis:
inputs = [inputs]
if not is_tensor(inputs):
inputs = as_tensor(inputs)
except: # noqa E722
raise ValueError(
"Unable to create tensor, you should probably activate truncation and/or padding "
"with 'padding=True' 'truncation=True' to have batched tensors with the same length."
)
return inputs
def __call__(self, artist, genres, lyrics="", return_tensors="pt") -> BatchEncoding:
"""Convert the raw string to a list of token ids
Args:
artist (`str`):
Name of the artist.
genres (`str`):
List of genres that will be mixed to condition the audio
lyrics (`str`, *optional*, defaults to `""`):
Lyrics used to condition the generation
"""
input_ids = [0, 0, 0]
artist = [artist] * len(self.version)
genres = [genres] * len(self.version)
artists_tokens, genres_tokens, lyrics_tokens = self.tokenize(artist, genres, lyrics)
artists_id, genres_ids, full_tokens = self._convert_token_to_id(artists_tokens, genres_tokens, lyrics_tokens)
attention_masks = [-INFINITY] * len(full_tokens[-1])
input_ids = [
self.convert_to_tensors(
[input_ids + [artists_id[i]] + genres_ids[i] + full_tokens[i]], tensor_type=return_tensors
)
for i in range(len(self.version))
]
return BatchEncoding({"input_ids": input_ids, "attention_masks": attention_masks})
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
"""
Saves the tokenizer's vocabulary dictionary to the provided save_directory.
Args:
save_directory (`str`):
A path to the directory where to saved. It will be created if it doesn't exist.
filename_prefix (`Optional[str]`, *optional*):
A prefix to add to the names of the files saved by the tokenizer.
"""
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
return
artists_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["artists_file"]
)
with open(artists_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.artists_encoder, ensure_ascii=False))
genres_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["genres_file"]
)
with open(genres_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.genres_encoder, ensure_ascii=False))
lyrics_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["lyrics_file"]
)
with open(lyrics_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.lyrics_encoder, ensure_ascii=False))
return (artists_file, genres_file, lyrics_file)
def _convert_id_to_token(self, artists_index, genres_index, lyric_index):
"""
Converts an index (integer) in a token (str) using the vocab.
Args:
artists_index (`int`):
Index of the artist in its corresponding dictionary.
genres_index (`Union[List[int], int]`):
Index of the genre in its corresponding dictionary.
lyric_index (`List[int]`):
List of character indices, which each correspond to a character.
"""
artist = self.artists_decoder.get(artists_index)
genres = [self.genres_decoder.get(genre) for genre in genres_index]
lyrics = [self.lyrics_decoder.get(character) for character in lyric_index]
return artist, genres, lyrics

View File

@ -2711,6 +2711,37 @@ def load_tf_weights_in_imagegpt(*args, **kwargs):
requires_backends(load_tf_weights_in_imagegpt, ["torch"])
JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST = None
class JukeboxModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class JukeboxPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class JukeboxPrior(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class JukeboxVQVAE(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

View File

@ -0,0 +1,344 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow
from transformers.trainer_utils import set_seed
if is_torch_available():
import torch
from transformers import JukeboxModel, JukeboxTokenizer
@require_torch
class Jukebox1bModelTester(unittest.TestCase):
all_model_classes = (JukeboxModel,) if is_torch_available() else ()
model_id = "openai/jukebox-1b-lyrics"
metas = dict(
artist="Zac Brown Band",
genres="Country",
lyrics="""I met a traveller from an antique land,
Who said "Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command,
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed;
And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away
""",
)
# fmt: off
EXPECTED_OUTPUT_2 = [
1864, 1536, 1213, 1870, 1357, 1536, 519, 880, 1323, 789, 1082, 534,
1000, 1445, 1105, 1130, 967, 515, 1434, 1620, 534, 1495, 283, 1445,
333, 1307, 539, 1631, 1528, 375, 1434, 673, 627, 710, 778, 1883,
1405, 1276, 1455, 1228
]
EXPECTED_OUTPUT_1 = [
1125, 1751, 697, 1776, 1141, 1476, 391, 697, 1125, 684, 867, 416,
844, 1372, 1274, 717, 1274, 844, 1299, 1419, 697, 1370, 317, 1125,
191, 1440, 1370, 1440, 1370, 282, 1621, 1370, 368, 349, 867, 1872,
1262, 869, 1728, 747
]
EXPECTED_OUTPUT_0 = [
1755, 842, 307, 1843, 1022, 1395, 234, 1554, 806, 739, 1022, 442,
616, 556, 268, 1499, 933, 457, 1440, 1837, 755, 985, 308, 902,
293, 1443, 1671, 1141, 1533, 555, 1562, 1061, 287, 417, 1022, 2008,
1186, 1015, 1777, 268
]
EXPECTED_Y_COND = [1058304, 0, 786432, 7169, 507, 76, 27, 40, 30, 76]
EXPECTED_PRIMED_0 = [
390, 1160, 1002, 1907, 1788, 1788, 1788, 1907, 1002, 1002, 1854, 1002,
1002, 1002, 1002, 1002, 1002, 1160, 1160, 1606, 596, 596, 1160, 1002,
1516, 596, 1002, 1002, 1002, 1907, 1788, 1788, 1788, 1854, 1788, 1907,
1907, 1788, 596, 1626
]
EXPECTED_PRIMED_1 = [
1236, 1668, 1484, 1920, 1848, 1409, 139, 864, 1828, 1272, 1599, 824,
1672, 139, 555, 1484, 824, 1920, 555, 596, 1579, 1599, 1231, 1599,
1637, 1407, 212, 824, 1599, 116, 1433, 824, 258, 1599, 1433, 1895,
1063, 1433, 1433, 1599
]
EXPECTED_PRIMED_2 = [
1684, 1873, 1119, 1189, 395, 611, 1901, 972, 890, 1337, 1392, 1927,
96, 972, 672, 780, 1119, 890, 158, 771, 1073, 1927, 353, 1331,
1269, 1459, 1333, 1645, 812, 1577, 1337, 606, 353, 981, 1466, 619,
197, 391, 302, 1930
]
EXPECTED_VQVAE_ENCODE = [
390, 1160, 1002, 1907, 1788, 1788, 1788, 1907, 1002, 1002, 1854, 1002,
1002, 1002, 1002, 1002, 1002, 1160, 1160, 1606, 596, 596, 1160, 1002,
1516, 596, 1002, 1002, 1002, 1907, 1788, 1788, 1788, 1854, 1788, 1907,
1907, 1788, 596, 1626
]
EXPECTED_VQVAE_DECODE = [
-0.0492, -0.0524, -0.0565, -0.0640, -0.0686, -0.0684, -0.0677, -0.0664,
-0.0605, -0.0490, -0.0330, -0.0168, -0.0083, -0.0075, -0.0051, 0.0025,
0.0136, 0.0261, 0.0386, 0.0497, 0.0580, 0.0599, 0.0583, 0.0614,
0.0740, 0.0889, 0.1023, 0.1162, 0.1211, 0.1212, 0.1251, 0.1336,
0.1502, 0.1686, 0.1883, 0.2148, 0.2363, 0.2458, 0.2507, 0.2531
]
EXPECTED_AUDIO_COND = [
0.0256, -0.0544, 0.1600, -0.0032, 0.1066, 0.0825, -0.0013, 0.3440,
0.0210, 0.0412, -0.1777, -0.0892, -0.0164, 0.0285, -0.0613, -0.0617,
-0.0137, -0.0201, -0.0175, 0.0215, -0.0627, 0.0520, -0.0730, 0.0970,
-0.0100, 0.0442, -0.0586, 0.0207, -0.0015, -0.0082
]
EXPECTED_META_COND = [
0.0415, 0.0877, 0.0022, -0.0055, 0.0751, 0.0334, 0.0324, -0.0068,
0.0011, 0.0017, -0.0676, 0.0655, -0.0143, 0.0399, 0.0303, 0.0743,
-0.0168, -0.0394, -0.1113, 0.0124, 0.0442, 0.0267, -0.0003, -0.1536,
-0.0116, -0.1837, -0.0180, -0.1026, -0.0777, -0.0456
]
EXPECTED_LYRIC_COND = [
76, 27, 40, 30, 76, 46, 44, 47, 40, 37, 38, 31, 45, 45, 76, 38, 31, 33,
45, 76, 41, 32, 76, 45, 46, 41, 40, 31, 78, 76
]
# fmt: on
def prepare_inputs(self):
tokenizer = JukeboxTokenizer.from_pretrained(self.model_id)
tokens = tokenizer(**self.metas)["input_ids"]
return tokens
@slow
def test_sampling(self):
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval()
labels = self.prepare_inputs()
set_seed(0)
zs = [torch.zeros(1, 0, dtype=torch.long).cpu() for _ in range(3)]
zs = model._sample(zs, labels, [0], sample_length=40 * model.priors[0].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[0][0], torch.tensor(self.EXPECTED_OUTPUT_2))
set_seed(0)
zs = model._sample(zs, labels, [1], sample_length=40 * model.priors[1].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[1][0], torch.tensor(self.EXPECTED_OUTPUT_1))
set_seed(0)
zs = model._sample(zs, labels, [2], sample_length=40 * model.priors[2].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[2][0], torch.tensor(self.EXPECTED_OUTPUT_0))
@slow
def test_conditioning(self):
torch.backends.cuda.matmul.allow_tf32 = False
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval()
labels = self.prepare_inputs()
set_seed(0)
zs = [torch.zeros(1, 0, dtype=torch.long) for _ in range(3)]
top_prior = model.priors[0]
start = 0
music_token_conds = top_prior.get_music_tokens_conds(zs, start=start, end=start + top_prior.n_ctx)
metadata = top_prior.get_metadata(labels[0].clone(), start, 1058304, 0)
self.assertIsNone(music_token_conds)
self.assertListEqual(metadata.numpy()[0][:10].tolist(), self.EXPECTED_Y_COND)
audio_conditioning, metadata_conditioning, lyric_tokens = top_prior.get_cond(music_token_conds, metadata)
torch.testing.assert_allclose(
audio_conditioning[0][0][:30].detach(), torch.tensor(self.EXPECTED_AUDIO_COND), atol=1e-4, rtol=1e-4
)
torch.testing.assert_allclose(
metadata_conditioning[0][0][:30].detach(), torch.tensor(self.EXPECTED_META_COND), atol=1e-4, rtol=1e-4
)
torch.testing.assert_allclose(
lyric_tokens[0, :30].detach(), torch.tensor(self.EXPECTED_LYRIC_COND), atol=1e-4, rtol=1e-4
)
@slow
def test_primed_sampling(self):
torch.backends.cuda.matmul.allow_tf32 = False
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval()
set_seed(0)
waveform = torch.rand((1, 5120, 1))
tokens = [i for i in self.prepare_inputs()]
zs = [model.vqvae.encode(waveform, start_level=2, bs_chunks=waveform.shape[0])[0], None, None]
zs = model._sample(
zs, tokens, sample_levels=[0], save_results=False, sample_length=40 * model.priors[0].raw_to_tokens
)
torch.testing.assert_allclose(zs[0][0][:40], torch.tensor(self.EXPECTED_PRIMED_0))
upper_2 = torch.cat((zs[0], torch.zeros(1, 2048 - zs[0].shape[-1])), dim=-1).long()
zs = [upper_2, model.vqvae.encode(waveform, start_level=1, bs_chunks=waveform.shape[0])[0], None]
zs = model._sample(
zs, tokens, sample_levels=[1], save_results=False, sample_length=40 * model.priors[1].raw_to_tokens
)
torch.testing.assert_allclose(zs[1][0][:40], torch.tensor(self.EXPECTED_PRIMED_1))
upper_1 = torch.cat((zs[1], torch.zeros(1, 2048 - zs[1].shape[-1])), dim=-1).long()
zs = [upper_2, upper_1, model.vqvae.encode(waveform, start_level=0, bs_chunks=waveform.shape[0])[0]]
zs = model._sample(
zs, tokens, sample_levels=[2], save_results=False, sample_length=40 * model.priors[2].raw_to_tokens
)
torch.testing.assert_allclose(zs[2][0][:40].cpu(), torch.tensor(self.EXPECTED_PRIMED_2))
@slow
def test_vqvae(self):
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval()
set_seed(0)
x = torch.rand((1, 5120, 1))
with torch.no_grad():
zs = model.vqvae.encode(x, start_level=2, bs_chunks=x.shape[0])
torch.testing.assert_allclose(zs[0][0], torch.tensor(self.EXPECTED_VQVAE_ENCODE))
with torch.no_grad():
x = model.vqvae.decode(zs, start_level=2, bs_chunks=x.shape[0])
torch.testing.assert_allclose(x[0, :40, 0], torch.tensor(self.EXPECTED_VQVAE_DECODE), atol=1e-4, rtol=1e-4)
@require_torch
class Jukebox5bModelTester(unittest.TestCase):
all_model_classes = (JukeboxModel,) if is_torch_available() else ()
model_id = "openai/jukebox-5b-lyrics"
metas = dict(
artist="Zac Brown Band",
genres="Country",
lyrics="""I met a traveller from an antique land,
Who said "Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command,
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed;
And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away
""",
)
# fmt: off
EXPECTED_OUTPUT_2 = [
1489, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
1489, 1489, 1489, 1489, 1150, 1853, 1509, 1150, 1357, 1509, 6, 1272
]
EXPECTED_OUTPUT_1 = [
1125, 416, 1125, 1125, 1125, 1125, 1125, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416
]
EXPECTED_OUTPUT_0 = [
1755, 1061, 234, 1755, 1061, 1755, 185, 290, 307, 307, 616, 616,
616, 616, 616, 616, 307, 290, 417, 1755, 234, 1755, 185, 290,
290, 290, 307, 616, 616, 616, 616, 616, 290, 234, 234, 1755,
234, 234, 1755, 234, 185, 185, 307, 616, 616, 616, 616, 290,
1755, 1755, 1755, 234, 234, 1755, 1572, 290, 307, 616, 34, 616
]
EXPECTED_GPU_OUTPUTS_2 = [
1489, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653,
653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653, 653
]
EXPECTED_GPU_OUTPUTS_1 = [
1125, 1125, 416, 1125, 1125, 416, 1125, 1125, 416, 416, 1125, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416,
416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416, 416
]
EXPECTED_GPU_OUTPUTS_0 = [
491, 1755, 34, 1613, 1755, 417, 992, 1613, 222, 842, 1353, 1613,
844, 632, 185, 1613, 844, 632, 185, 1613, 185, 842, 677, 1613,
185, 114, 1353, 1613, 307, 89, 844, 1613, 307, 1332, 234, 1979,
307, 89, 1353, 616, 34, 842, 185, 842, 34, 842, 185, 842,
307, 114, 185, 89, 34, 1268, 185, 89, 34, 842, 185, 89
]
# fmt: on
def prepare_inputs(self, model_id):
tokenizer = JukeboxTokenizer.from_pretrained(model_id)
tokens = tokenizer(**self.metas)["input_ids"]
return tokens
@slow
def test_sampling(self):
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval()
labels = self.prepare_inputs(self.model_id)
set_seed(0)
zs = [torch.zeros(1, 0, dtype=torch.long).cpu() for _ in range(3)]
zs = model._sample(zs, labels, [0], sample_length=60 * model.priors[0].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[0][0], torch.tensor(self.EXPECTED_OUTPUT_2))
set_seed(0)
zs = model._sample(zs, labels, [1], sample_length=60 * model.priors[1].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[1][0], torch.tensor(self.EXPECTED_OUTPUT_1))
set_seed(0)
zs = model._sample(zs, labels, [2], sample_length=60 * model.priors[2].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[2][0], torch.tensor(self.EXPECTED_OUTPUT_0))
@slow
def test_slow_sampling(self):
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval().to("cuda")
labels = [i.cuda() for i in self.prepare_inputs(self.model_id)]
set_seed(0)
model.priors[0].cuda()
zs = [torch.zeros(1, 0, dtype=torch.long).cuda() for _ in range(3)]
zs = model._sample(zs, labels, [0], sample_length=60 * model.priors[0].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[0][0].cpu(), torch.tensor(self.EXPECTED_GPU_OUTPUTS_2))
model.priors[0].cpu()
set_seed(0)
model.priors[1].cuda()
zs = model._sample(zs, labels, [1], sample_length=60 * model.priors[1].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[1][0].cpu(), torch.tensor(self.EXPECTED_GPU_OUTPUTS_1))
model.priors[1].cpu()
set_seed(0)
model.priors[2].cuda()
zs = model._sample(zs, labels, [2], sample_length=60 * model.priors[2].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[2][0].cpu(), torch.tensor(self.EXPECTED_GPU_OUTPUTS_0))
@slow
def test_fp16_slow_sampling(self):
model = JukeboxModel.from_pretrained(self.model_id, min_duration=0).eval().half().to("cuda")
labels = [i.cuda() for i in self.prepare_inputs(self.model_id)]
set_seed(0)
zs = [torch.zeros(1, 0, dtype=torch.long).cuda() for _ in range(3)]
zs = model._sample(zs, labels, [0], sample_length=60 * model.priors[0].raw_to_tokens, save_results=False)
torch.testing.assert_allclose(zs[0][0].cpu(), torch.tensor(self.EXPECTED_GPU_OUTPUTS_2))

View File

@ -0,0 +1,209 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import JukeboxTokenizer
from transformers.testing_utils import require_torch
class JukeboxTokenizationTest(unittest.TestCase):
tokenizer_class = JukeboxTokenizer
metas = dict(
artist="Zac Brown Band",
genres="Country",
lyrics="""I met a traveller from an antique land,
Who said "Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command,
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed;
And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away
""",
)
@require_torch
def test_1b_lyrics_tokenizer(self):
"""
how to run the same test with openAI
...
"""
import torch
tokenizer = JukeboxTokenizer.from_pretrained("openai/jukebox-1b-lyrics")
tokens = tokenizer(**self.metas)["input_ids"]
# fmt: off
EXPECTED_OUTPUT = [
torch.tensor([[
0, 0, 0, 7169, 507, 9, 76, 39, 31, 46, 76, 27,
76, 46, 44, 27, 48, 31, 38, 38, 31, 44, 76, 32,
44, 41, 39, 76, 27, 40, 76, 27, 40, 46, 35, 43,
47, 31, 76, 38, 27, 40, 30, 64, 78, 76, 76, 76,
76, 76, 76, 76, 76, 23, 34, 41, 76, 45, 27, 35,
30, 76, 71, 20, 49, 41, 76, 48, 27, 45, 46, 76,
27, 40, 30, 76, 46, 44, 47, 40, 37, 38, 31, 45,
45, 76, 38, 31, 33, 45, 76, 41, 32, 76, 45, 46,
41, 40, 31, 78, 76, 76, 76, 76, 76, 76, 76, 76,
19, 46, 27, 40, 30, 76, 35, 40, 76, 46, 34, 31,
76, 30, 31, 45, 31, 44, 46, 63, 76, 63, 76, 63,
76, 63, 76, 14, 31, 27, 44, 76, 46, 34, 31, 39,
64, 76, 41, 40, 76, 46, 34, 31, 76, 45, 27, 40,
30, 64, 78, 76, 76, 76, 76, 76, 76, 76, 76, 8,
27, 38, 32, 76, 45, 47, 40, 37, 76, 27, 76, 45,
34, 27, 46, 46, 31, 44, 31, 30, 76, 48, 35, 45,
27, 33, 31, 76, 38, 35, 31, 45, 64, 76, 49, 34,
41, 45, 31, 76, 32, 44, 41, 49, 40, 64, 78, 76,
76, 76, 76, 76, 76, 76, 76, 1, 40, 30, 76, 49,
44, 35, 40, 37, 38, 31, 30, 76, 38, 35, 42, 64,
76, 27, 40, 30, 76, 45, 40, 31, 31, 44, 76, 41,
32, 76, 29, 41, 38, 30, 76, 29, 41, 39, 39, 27,
40, 30, 64, 78, 76, 76, 76, 76, 76, 76, 76, 76,
20, 31, 38, 38, 76, 46, 34, 27, 46, 76, 35, 46,
45, 76, 45, 29, 47, 38, 42, 46, 41, 44, 76, 49,
31, 38, 38, 76, 46, 34, 41, 45, 31, 76, 42, 27,
45, 45, 35, 41, 40, 45, 76, 44, 31, 27, 30, 78,
76, 76, 76, 76, 76, 76, 76, 76, 23, 34, 35, 29,
34, 76, 51, 31, 46, 76, 45, 47, 44, 48, 35, 48,
31, 64, 76, 45, 46, 27, 39, 42, 31, 30, 76, 41,
40, 76, 46, 34, 31, 45, 31, 76, 38, 35, 32, 31,
38, 31, 45, 45, 76, 46, 34, 35, 40, 33, 45, 64,
78, 76, 76, 76, 76, 76, 76, 76, 76, 20, 34, 31,
76, 34, 27, 40, 30, 76, 46, 34, 27, 46, 76, 39,
41, 29, 37, 31, 30, 76, 46, 34, 31, 39, 64, 76,
27, 40, 30, 76, 46, 34, 31, 76, 34, 31, 27, 44,
46, 76, 46, 34, 27, 46, 76, 32, 31, 30, 66, 78,
76, 76, 76, 76, 76, 76, 76, 76, 1, 40, 30, 76,
41, 40, 76, 46, 34, 31, 76, 42, 31, 30, 31, 45,
46, 27, 38, 64, 76, 46, 34, 31, 45, 31, 76, 49,
41, 44, 30, 45, 76, 27, 42, 42, 31, 27, 44, 65,
78, 76, 76, 76, 76, 76, 76, 76, 76, 13, 51, 76,
40, 27, 39, 31, 76, 35, 45, 76, 15, 52, 51, 39,
27, 40, 30, 35, 27, 45, 64, 76, 11, 35, 40, 33,
76, 41, 32, 76, 11, 35, 40, 33, 45, 66, 78, 76,
76, 76, 76, 76, 76, 76, 76, 12, 41, 41, 37, 76,
41, 40, 76, 39, 51, 76, 23, 41, 44, 37, 45, 64,
76, 51, 31, 76, 13, 35, 33, 34, 46, 51, 64, 76,
27, 40, 30, 76, 30, 31, 45, 42, 27, 35, 44, 67,
78, 76, 76, 76, 76, 76, 76, 76, 76, 14, 41, 46,
34, 35, 40, 33, 76, 28, 31, 45, 35, 30, 31, 76,
44, 31, 39, 27, 35, 40, 45, 63, 76, 18, 41, 47,
40, 30, 76, 46, 34, 31, 76, 30, 31, 29, 27, 51,
78, 76, 76, 76, 76, 76, 76, 76, 76, 15, 32, 76,
46, 34, 27, 46, 76, 29, 41, 38, 41, 45, 45, 27,
38, 76, 23, 44, 31, 29, 37, 64, 76, 28, 41, 47,
40, 30, 38, 31, 45, 45, 76, 27, 40, 30, 76, 28,
27, 44, 31, 78, 76, 76, 76, 76, 76, 76, 76, 76,
20, 34, 31, 76, 38, 41, 40, 31, 76, 27, 40, 30,
76, 38, 31, 48, 31, 38, 76, 45, 27, 40, 30, 45,
76, 45, 46, 44, 31, 46, 29, 34, 76, 32, 27, 44,
76, 27, 49, 27, 51, 78, 76, 76, 76, 76, 76, 76,
76, 76]]),
torch.tensor([[0, 0, 0, 1069, 11]]),
torch.tensor([[0, 0, 0, 1069, 11]]),
]
# fmt: on
self.assertTrue(torch.allclose(tokens[0], EXPECTED_OUTPUT[0]))
self.assertTrue(torch.allclose(tokens[1], EXPECTED_OUTPUT[1]))
self.assertTrue(torch.allclose(tokens[2], EXPECTED_OUTPUT[2]))
@require_torch
def test_5b_lyrics_tokenizer(self):
"""
The outputs are similar that open AI but do not have the same format as this one is adapted to the HF integration.
"""
import torch
tokenizer = JukeboxTokenizer.from_pretrained("openai/jukebox-5b-lyrics")
tokens = tokenizer(**self.metas)["input_ids"]
# fmt: off
EXPECTED_OUTPUT = [
torch.tensor([[
0, 0, 0, 1069, 11, -1, -1, -1, -1, 9, 77, 39,
31, 46, 77, 27, 77, 46, 44, 27, 48, 31, 38, 38,
31, 44, 77, 32, 44, 41, 39, 77, 27, 40, 77, 27,
40, 46, 35, 43, 47, 31, 77, 38, 27, 40, 30, 64,
79, 77, 77, 77, 77, 77, 77, 77, 77, 23, 34, 41,
77, 45, 27, 35, 30, 77, 72, 20, 49, 41, 77, 48,
27, 45, 46, 77, 27, 40, 30, 77, 46, 44, 47, 40,
37, 38, 31, 45, 45, 77, 38, 31, 33, 45, 77, 41,
32, 77, 45, 46, 41, 40, 31, 79, 77, 77, 77, 77,
77, 77, 77, 77, 19, 46, 27, 40, 30, 77, 35, 40,
77, 46, 34, 31, 77, 30, 31, 45, 31, 44, 46, 63,
77, 63, 77, 63, 77, 63, 77, 14, 31, 27, 44, 77,
46, 34, 31, 39, 64, 77, 41, 40, 77, 46, 34, 31,
77, 45, 27, 40, 30, 64, 79, 77, 77, 77, 77, 77,
77, 77, 77, 8, 27, 38, 32, 77, 45, 47, 40, 37,
77, 27, 77, 45, 34, 27, 46, 46, 31, 44, 31, 30,
77, 48, 35, 45, 27, 33, 31, 77, 38, 35, 31, 45,
64, 77, 49, 34, 41, 45, 31, 77, 32, 44, 41, 49,
40, 64, 79, 77, 77, 77, 77, 77, 77, 77, 77, 1,
40, 30, 77, 49, 44, 35, 40, 37, 38, 31, 30, 77,
38, 35, 42, 64, 77, 27, 40, 30, 77, 45, 40, 31,
31, 44, 77, 41, 32, 77, 29, 41, 38, 30, 77, 29,
41, 39, 39, 27, 40, 30, 64, 79, 77, 77, 77, 77,
77, 77, 77, 77, 20, 31, 38, 38, 77, 46, 34, 27,
46, 77, 35, 46, 45, 77, 45, 29, 47, 38, 42, 46,
41, 44, 77, 49, 31, 38, 38, 77, 46, 34, 41, 45,
31, 77, 42, 27, 45, 45, 35, 41, 40, 45, 77, 44,
31, 27, 30, 79, 77, 77, 77, 77, 77, 77, 77, 77,
23, 34, 35, 29, 34, 77, 51, 31, 46, 77, 45, 47,
44, 48, 35, 48, 31, 64, 77, 45, 46, 27, 39, 42,
31, 30, 77, 41, 40, 77, 46, 34, 31, 45, 31, 77,
38, 35, 32, 31, 38, 31, 45, 45, 77, 46, 34, 35,
40, 33, 45, 64, 79, 77, 77, 77, 77, 77, 77, 77,
77, 20, 34, 31, 77, 34, 27, 40, 30, 77, 46, 34,
27, 46, 77, 39, 41, 29, 37, 31, 30, 77, 46, 34,
31, 39, 64, 77, 27, 40, 30, 77, 46, 34, 31, 77,
34, 31, 27, 44, 46, 77, 46, 34, 27, 46, 77, 32,
31, 30, 66, 79, 77, 77, 77, 77, 77, 77, 77, 77,
1, 40, 30, 77, 41, 40, 77, 46, 34, 31, 77, 42,
31, 30, 31, 45, 46, 27, 38, 64, 77, 46, 34, 31,
45, 31, 77, 49, 41, 44, 30, 45, 77, 27, 42, 42,
31, 27, 44, 65, 79, 77, 77, 77, 77, 77, 77, 77,
77, 13, 51, 77, 40, 27, 39, 31, 77, 35, 45, 77,
15, 52, 51, 39, 27, 40, 30, 35, 27, 45, 64, 77,
11, 35, 40, 33, 77, 41, 32, 77, 11, 35, 40, 33,
45, 66, 79, 77, 77, 77, 77, 77, 77, 77, 77, 12,
41, 41, 37, 77, 41, 40, 77, 39, 51, 77, 23, 41,
44, 37, 45, 64, 77, 51, 31, 77, 13, 35, 33, 34,
46, 51, 64, 77, 27, 40, 30, 77, 30, 31, 45, 42,
27, 35, 44, 67, 79, 77, 77, 77, 77, 77, 77, 77,
77, 14, 41, 46, 34, 35, 40, 33, 77, 28, 31, 45,
35, 30, 31, 77, 44, 31, 39, 27, 35, 40, 45, 63,
77, 18, 41, 47, 40, 30, 77, 46, 34, 31, 77, 30,
31, 29, 27, 51, 79, 77, 77, 77, 77, 77, 77, 77,
77, 15, 32, 77, 46, 34, 27, 46, 77, 29, 41, 38,
41, 45, 45, 27, 38, 77, 23, 44, 31, 29, 37, 64,
77, 28, 41, 47, 40, 30, 38, 31, 45, 45, 77, 27,
40, 30, 77, 28, 27, 44, 31, 79, 77, 77, 77, 77,
77, 77, 77, 77, 20, 34, 31, 77, 38, 41, 40, 31,
77, 27, 40, 30, 77, 38, 31, 48, 31, 38, 77, 45,
27, 40, 30, 45, 77, 45, 46, 44, 31, 46, 29, 34,
77, 32, 27, 44, 77, 27, 49, 27, 51, 79, 77, 77,
77, 77, 77, 77, 77, 77]]),
torch.tensor([[0, 0, 0, 1069, 11, -1, -1, -1, -1]]),
torch.tensor([[0, 0, 0, 1069, 11, -1, -1, -1, -1]]),
]
# fmt: on
self.assertTrue(torch.allclose(tokens[0], EXPECTED_OUTPUT[0]))
self.assertTrue(torch.allclose(tokens[1], EXPECTED_OUTPUT[1]))
self.assertTrue(torch.allclose(tokens[2], EXPECTED_OUTPUT[2]))

View File

@ -51,6 +51,8 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
"TableTransformerDecoder", # Building part of bigger (tested) model.
"TimeSeriesTransformerEncoder", # Building part of bigger (tested) model.
"TimeSeriesTransformerDecoder", # Building part of bigger (tested) model.
"JukeboxVQVAE", # Building part of bigger (tested) model.
"JukeboxPrior", # Building part of bigger (tested) model.
"DeformableDetrEncoder", # Building part of bigger (tested) model.
"DeformableDetrDecoder", # Building part of bigger (tested) model.
"OPTDecoder", # Building part of bigger (tested) model.
@ -146,6 +148,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"CLIPSegTextModel",
"EsmForProteinFolding",
"TimeSeriesTransformerForPrediction",
"JukeboxVQVAE",
"JukeboxPrior",
"PegasusXEncoder",
"PegasusXDecoder",
"PegasusXDecoderWrapper",