Add X-CLIP (#18852)

* First draft

* Improve conversion script

* Make vision encoder work

* More improvements

* Improve conversion script

* Fix quality

* Add MultiframeIntegrationTransformer

* More improvements

* Make MiT output work

* Fix quality

* Add prompts generator

* Add tests

* Fix some tests

* Fix some more tests

* Fix more tests

* Improve conversion script

* Fix model outputs

* Fix more tests

* Add XClipProcessor

* Use processor in conversion script

* Fix integration test

* Update README, fix docs

* Fix all tests

* Add MIT output to XClipOutput

* Create better variable names

* Rename XClip to XCLIP

* Extend conversion script

* Add support for large models

* Add support for 16 frame models

* Add another model'

* Fix module issue

* Apply suggestions from code review

* Add figure to docs

* Fix CLIPProcessor issue

* Apply suggestions from code review

* Delete file

* Convert more checkpoints

* Convert last checkpoint

* Update nielsr to microsoft
This commit is contained in:
NielsRogge 2022-09-08 14:50:30 +02:00 committed by GitHub
parent 9832ac7c73
commit bb6f6d5338
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
26 changed files with 3260 additions and 11 deletions

View File

@ -383,6 +383,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

View File

@ -335,6 +335,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

View File

@ -359,6 +359,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (来自 Microsoft Research) 伴随论文 [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) 由 Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling 发布。
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。

View File

@ -371,6 +371,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](https://huggingface.co/docs/transformers/main/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

View File

@ -470,6 +470,8 @@
title: Vision Text Dual Encoder
- local: model_doc/visual_bert
title: VisualBERT
- local: model_doc/xclip
title: X-CLIP
title: Multimodal models
- isExpanded: false
sections:

View File

@ -175,6 +175,7 @@ The documentation is organized into five sections:
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[X-CLIP](model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
@ -312,6 +313,7 @@ Flax), PyTorch, and/or TensorFlow.
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
| X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| XGLM | ✅ | ✅ | ✅ | ✅ | ✅ |
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
| XLM-ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |

View File

@ -0,0 +1,69 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# X-CLIP
## Overview
The X-CLIP model was proposed in [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
X-CLIP is a minimal extension of [CLIP](clip) for video. The model consists of a text encoder, a cross-frame vision encoder, a multi-frame integration Transformer, and a video-specific prompt generator.
The abstract from the paper is the following:
*Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited.*
Tips:
- Usage of X-CLIP is identical to CLIP.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png"
alt="drawing" width="600"/>
<small> X-CLIP architecture. Taken from the <a href="https://arxiv.org/abs/2208.02816">original paper.</a> </small>
This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/microsoft/VideoX/tree/master/X-CLIP).
## XCLIPProcessor
[[autodoc]] XCLIPProcessor
## XCLIPConfig
[[autodoc]] XCLIPConfig
- from_text_vision_configs
## XCLIPTextConfig
[[autodoc]] XCLIPTextConfig
## XCLIPVisionConfig
[[autodoc]] XCLIPVisionConfig
## XCLIPModel
[[autodoc]] XCLIPModel
- forward
- get_text_features
- get_video_features
## XCLIPTextModel
[[autodoc]] XCLIPTextModel
- forward
## XCLIPVisionModel
[[autodoc]] XCLIPVisionModel
- forward

View File

@ -165,6 +165,7 @@ _import_structure = {
"models.clip": [
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"CLIPConfig",
"CLIPProcessor",
"CLIPTextConfig",
"CLIPTokenizer",
"CLIPVisionConfig",
@ -368,6 +369,13 @@ _import_structure = {
"WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP",
"WavLMConfig",
],
"models.x_clip": [
"XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"XCLIPConfig",
"XCLIPProcessor",
"XCLIPTextConfig",
"XCLIPVisionConfig",
],
"models.xglm": ["XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XGLMConfig"],
"models.xlm": ["XLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMConfig", "XLMTokenizer"],
"models.xlm_prophetnet": ["XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMProphetNetConfig"],
@ -641,7 +649,6 @@ else:
_import_structure["image_utils"] = ["ImageFeatureExtractionMixin"]
_import_structure["models.beit"].append("BeitFeatureExtractor")
_import_structure["models.clip"].append("CLIPFeatureExtractor")
_import_structure["models.clip"].append("CLIPProcessor")
_import_structure["models.convnext"].append("ConvNextFeatureExtractor")
_import_structure["models.deit"].append("DeiTFeatureExtractor")
_import_structure["models.detr"].append("DetrFeatureExtractor")
@ -988,6 +995,15 @@ else:
"CLIPVisionModel",
]
)
_import_structure["models.x_clip"].extend(
[
"XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
"XCLIPModel",
"XCLIPPreTrainedModel",
"XCLIPTextModel",
"XCLIPVisionModel",
]
)
_import_structure["models.convbert"].extend(
[
"CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
@ -3012,6 +3028,7 @@ if TYPE_CHECKING:
from .models.clip import (
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIPConfig,
CLIPProcessor,
CLIPTextConfig,
CLIPTokenizer,
CLIPVisionConfig,
@ -3189,6 +3206,13 @@ if TYPE_CHECKING:
from .models.wav2vec2_phoneme import Wav2Vec2PhonemeCTCTokenizer
from .models.wav2vec2_with_lm import Wav2Vec2ProcessorWithLM
from .models.wavlm import WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP, WavLMConfig
from .models.x_clip import (
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
XCLIPConfig,
XCLIPProcessor,
XCLIPTextConfig,
XCLIPVisionConfig,
)
from .models.xglm import XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XGLMConfig
from .models.xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig, XLMTokenizer
from .models.xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig
@ -3428,7 +3452,7 @@ if TYPE_CHECKING:
else:
from .image_utils import ImageFeatureExtractionMixin
from .models.beit import BeitFeatureExtractor
from .models.clip import CLIPFeatureExtractor, CLIPProcessor
from .models.clip import CLIPFeatureExtractor
from .models.convnext import ConvNextFeatureExtractor
from .models.deit import DeiTFeatureExtractor
from .models.detr import DetrFeatureExtractor
@ -4499,6 +4523,13 @@ if TYPE_CHECKING:
WavLMModel,
WavLMPreTrainedModel,
)
from .models.x_clip import (
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
XCLIPModel,
XCLIPPreTrainedModel,
XCLIPTextModel,
XCLIPVisionModel,
)
from .models.xglm import XGLM_PRETRAINED_MODEL_ARCHIVE_LIST, XGLMForCausalLM, XGLMModel, XGLMPreTrainedModel
from .models.xlm import (
XLM_PRETRAINED_MODEL_ARCHIVE_LIST,

View File

@ -151,6 +151,7 @@ from . import (
wav2vec2_phoneme,
wav2vec2_with_lm,
wavlm,
x_clip,
xglm,
xlm,
xlm_prophetnet,

View File

@ -144,6 +144,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("wav2vec2", "Wav2Vec2Config"),
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
("wavlm", "WavLMConfig"),
("xclip", "XCLIPConfig"),
("xglm", "XGLMConfig"),
("xlm", "XLMConfig"),
("xlm-prophetnet", "XLMProphetNetConfig"),
@ -259,6 +260,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xclip", "X_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xglm", "XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xlm", "XLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xlm-prophetnet", "XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -408,6 +410,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("wav2vec2-conformer", "Wav2Vec2-Conformer"),
("wav2vec2_phoneme", "Wav2Vec2Phoneme"),
("wavlm", "WavLM"),
("xclip", "X-CLIP"),
("xglm", "XGLM"),
("xlm", "XLM"),
("xlm-prophetnet", "XLM-ProphetNet"),
@ -428,6 +431,7 @@ SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict(
("data2vec-text", "data2vec"),
("data2vec-vision", "data2vec"),
("donut-swin", "donut"),
("xclip", "x_clip"),
]
)

View File

@ -75,6 +75,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("vit_mae", "ViTFeatureExtractor"),
("wav2vec2", "Wav2Vec2FeatureExtractor"),
("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"),
("xclip", "CLIPFeatureExtractor"),
("yolos", "YolosFeatureExtractor"),
]
)

View File

@ -138,6 +138,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("wav2vec2", "Wav2Vec2Model"),
("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
("wavlm", "WavLMModel"),
("xclip", "XCLIPModel"),
("xglm", "XGLMModel"),
("xlm", "XLMModel"),
("xlm-prophetnet", "XLMProphetNetModel"),

View File

@ -58,6 +58,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("wav2vec2-conformer", "Wav2Vec2Processor"),
("wav2vec2_with_lm", "Wav2Vec2ProcessorWithLM"),
("wavlm", "Wav2Vec2Processor"),
("xclip", "CLIPProcessor"),
]
)

View File

@ -253,6 +253,7 @@ else:
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
("xclip", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
(
"xglm",
(

View File

@ -36,6 +36,7 @@ _import_structure = {
"CLIPTextConfig",
"CLIPVisionConfig",
],
"processing_clip": ["CLIPProcessor"],
"tokenization_clip": ["CLIPTokenizer"],
}
@ -54,7 +55,6 @@ except OptionalDependencyNotAvailable:
pass
else:
_import_structure["feature_extraction_clip"] = ["CLIPFeatureExtractor"]
_import_structure["processing_clip"] = ["CLIPProcessor"]
try:
if not is_torch_available():
@ -108,6 +108,7 @@ if TYPE_CHECKING:
CLIPTextConfig,
CLIPVisionConfig,
)
from .processing_clip import CLIPProcessor
from .tokenization_clip import CLIPTokenizer
try:
@ -125,7 +126,6 @@ if TYPE_CHECKING:
pass
else:
from .feature_extraction_clip import CLIPFeatureExtractor
from .processing_clip import CLIPProcessor
try:
if not is_torch_available():

View File

@ -0,0 +1,73 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
_import_structure = {
"configuration_x_clip": [
"XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"XCLIPConfig",
"XCLIPTextConfig",
"XCLIPVisionConfig",
],
"processing_x_clip": ["XCLIPProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_x_clip"] = [
"XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
"XCLIPModel",
"XCLIPPreTrainedModel",
"XCLIPTextModel",
"XCLIPVisionModel",
]
if TYPE_CHECKING:
from .configuration_x_clip import (
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
XCLIPConfig,
XCLIPTextConfig,
XCLIPVisionConfig,
)
from .processing_x_clip import XCLIPProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_x_clip import (
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
XCLIPModel,
XCLIPPreTrainedModel,
XCLIPTextModel,
XCLIPVisionModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@ -0,0 +1,368 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" X-CLIP model configuration"""
import copy
import os
from typing import Union
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"microsoft/xclip-base-patch32": "https://huggingface.co/microsoft/xclip-base-patch32/resolve/main/config.json",
}
class XCLIPTextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`XCLIPModel`]. It is used to instantiate an X-CLIP
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the X-CLIP
[microsoft/xclip-base-patch32](https://huggingface.co/microsoft/xclip-base-patch32) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 49408):
Vocabulary size of the X-CLIP text model. Defines the number of different tokens that can be represented by
the `inputs_ids` passed when calling [`XCLIPModel`].
hidden_size (`int`, *optional*, defaults to 512):
Dimensionality of the encoder layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 2048):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
max_position_embeddings (`int`, *optional*, defaults to 77):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (`float``, *optional*, defaults to 1):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
Example:
```python
>>> from transformers import XCLIPTextModel, XCLIPTextConfig
>>> # Initializing a XCLIPTextModel with microsoft/xclip-base-patch32 style configuration
>>> configuration = XCLIPTextConfig()
>>> # Initializing a XCLIPTextConfig from the microsoft/xclip-base-patch32 style configuration
>>> model = XCLIPTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "xclip_text_model"
def __init__(
self,
vocab_size=49408,
hidden_size=512,
intermediate_size=2048,
num_hidden_layers=12,
num_attention_heads=8,
max_position_embeddings=77,
hidden_act="quick_gelu",
layer_norm_eps=0.00001,
dropout=0.0,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.dropout = dropout
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.max_position_embeddings = max_position_embeddings
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.attention_dropout = attention_dropout
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the text config dict if we are loading from XCLIPConfig
if config_dict.get("model_type") == "xclip":
config_dict = config_dict["text_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class XCLIPVisionConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`XCLIPModel`]. It is used to instantiate an X-CLIP
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the X-CLIP
[microsoft/xclip-base-patch32](https://huggingface.co/microsoft/xclip-base-patch32) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
mit_hidden_size (`int`, *optional*, defaults to 512):
Dimensionality of the encoder layers of the Multiframe Integration Transformer (MIT).
mit_intermediate_size (`int`, *optional*, defaults to 2048):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Multiframe Integration Transformer
(MIT).
mit_num_hidden_layers (`int`, *optional*, defaults to 1):
Number of hidden layers in the Multiframe Integration Transformer (MIT).
mit_num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Multiframe Integration Transformer (MIT).
image_size (`int`, *optional*, defaults to 224):
The size (resolution) of each image.
patch_size (`int`, *optional*, defaults to 32):
The size (resolution) of each patch.
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"`, `"gelu_new"` and ``"quick_gelu"` are supported.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (`float``, *optional*, defaults to 1):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
drop_path_rate (`float`, *optional*, defaults to 0.0):
Stochastic depth rate.
Example:
```python
>>> from transformers import XCLIPVisionModel, XCLIPVisionConfig
>>> # Initializing a XCLIPVisionModel with microsoft/xclip-base-patch32 style configuration
>>> configuration = XCLIPVisionConfig()
>>> # Initializing a XCLIPVisionModel model from the microsoft/xclip-base-patch32 style configuration
>>> model = XCLIPVisionModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "xclip_vision_model"
def __init__(
self,
hidden_size=768,
intermediate_size=3072,
num_hidden_layers=12,
num_attention_heads=12,
mit_hidden_size=512,
mit_intermediate_size=2048,
mit_num_hidden_layers=1,
mit_num_attention_heads=8,
num_channels=3,
image_size=224,
patch_size=32,
num_frames=8,
hidden_act="quick_gelu",
layer_norm_eps=0.00001,
dropout=0.0,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
drop_path_rate=0.0,
**kwargs
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.dropout = dropout
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.mit_hidden_size = mit_hidden_size
self.mit_intermediate_size = mit_intermediate_size
self.mit_num_hidden_layers = mit_num_hidden_layers
self.mit_num_attention_heads = mit_num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.num_frames = num_frames
self.image_size = image_size
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.attention_dropout = attention_dropout
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.drop_path_rate = drop_path_rate
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the vision config dict if we are loading from XCLIPConfig
if config_dict.get("model_type") == "xclip":
config_dict = config_dict["vision_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class XCLIPConfig(PretrainedConfig):
r"""
[`XCLIPConfig`] is the configuration class to store the configuration of a [`XCLIPModel`]. It is used to
instantiate X-CLIP model according to the specified arguments, defining the text model and vision model configs.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config_dict (`dict`, *optional*):
Dictionary of configuration options used to initialize [`XCLIPTextConfig`].
vision_config_dict (`dict`, *optional*):
Dictionary of configuration options used to initialize [`XCLIPVisionConfig`].
projection_dim (`int`, *optional*, defaults to 512):
Dimentionality of text and vision projection layers.
prompt_layers (`int`, *optional*, defaults to 2):
Number of layers in the video specific prompt generator.
prompt_alpha (`float`, *optional*, defaults to 0.1):
Alpha value to use in the video specific prompt generator.
prompt_hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
The non-linear activation function (function or string) in the video specific prompt generator. If string,
`"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
prompt_num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads in the cross-attention of the video specific prompt generator.
prompt_attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for the attention layers in the video specific prompt generator.
prompt_projection_dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for the projection layers in the video specific prompt generator.
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
The inital value of the *logit_scale* parameter. Default is used as per the original XCLIP implementation.
kwargs (*optional*):
Dictionary of keyword arguments.
"""
model_type = "xclip"
is_composition = True
def __init__(
self,
text_config_dict=None,
vision_config_dict=None,
projection_dim=512,
prompt_layers=2,
prompt_alpha=0.1,
prompt_hidden_act="quick_gelu",
prompt_num_attention_heads=8,
prompt_attention_dropout=0.0,
prompt_projection_dropout=0.0,
logit_scale_init_value=2.6592,
**kwargs
):
super().__init__(text_config_dict=text_config_dict, vision_config_dict=vision_config_dict, **kwargs)
if text_config_dict is None:
text_config_dict = {}
logger.info("text_config_dict is None. Initializing the XCLIPTextConfig with default values.")
if vision_config_dict is None:
vision_config_dict = {}
logger.info("vision_config_dict is None. initializing the XCLIPVisionConfig with default values.")
self.text_config = XCLIPTextConfig(**text_config_dict)
self.vision_config = XCLIPVisionConfig(**vision_config_dict)
self.projection_dim = projection_dim
self.prompt_layers = prompt_layers
self.prompt_alpha = prompt_alpha
self.prompt_hidden_act = prompt_hidden_act
self.prompt_num_attention_heads = prompt_num_attention_heads
self.prompt_attention_dropout = prompt_attention_dropout
self.prompt_projection_dropout = prompt_projection_dropout
self.logit_scale_init_value = logit_scale_init_value
self.initializer_factor = 1.0
@classmethod
def from_text_vision_configs(cls, text_config: XCLIPTextConfig, vision_config: XCLIPVisionConfig, **kwargs):
r"""
Instantiate a [`XCLIPConfig`] (or a derived class) from xclip text model configuration and xclip vision model
configuration.
Returns:
[`XCLIPConfig`]: An instance of a configuration object
"""
return cls(text_config_dict=text_config.to_dict(), vision_config_dict=vision_config.to_dict(), **kwargs)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["text_config"] = self.text_config.to_dict()
output["vision_config"] = self.vision_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

View File

@ -0,0 +1,386 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import numpy as np
import torch
import gdown
from huggingface_hub import hf_hub_download
from transformers import (
CLIPTokenizer,
CLIPTokenizerFast,
VideoMAEFeatureExtractor,
XCLIPConfig,
XCLIPModel,
XCLIPProcessor,
XCLIPTextConfig,
XCLIPVisionConfig,
)
def get_xclip_config(model_name, num_frames):
text_config = XCLIPTextConfig()
# derive patch size from model name
start_idx = model_name.find("patch")
patch_size = int(model_name[start_idx + len("patch") : start_idx + len("patch") + 2])
vision_config = XCLIPVisionConfig(patch_size=patch_size, num_frames=num_frames)
if "large" in model_name:
text_config.hidden_size = 768
text_config.intermediate_size = 3072
text_config.num_attention_heads = 12
vision_config.hidden_size = 1024
vision_config.intermediate_size = 4096
vision_config.num_attention_heads = 16
vision_config.num_hidden_layers = 24
vision_config.mit_hidden_size = 768
vision_config.mit_intermediate_size = 3072
if model_name == "xclip-large-patch14-16-frames":
vision_config.image_size = 336
config = XCLIPConfig.from_text_vision_configs(text_config, vision_config)
if "large" in model_name:
config.projection_dim = 768
return config
def rename_key(name):
# text encoder
if name == "token_embedding.weight":
name = name.replace("token_embedding.weight", "text_model.embeddings.token_embedding.weight")
if name == "positional_embedding":
name = name.replace("positional_embedding", "text_model.embeddings.position_embedding.weight")
if "ln_1" in name:
name = name.replace("ln_1", "layer_norm1")
if "ln_2" in name:
name = name.replace("ln_2", "layer_norm2")
if "c_fc" in name:
name = name.replace("c_fc", "fc1")
if "c_proj" in name:
name = name.replace("c_proj", "fc2")
if name.startswith("transformer.resblocks"):
name = name.replace("transformer.resblocks", "text_model.encoder.layers")
if "attn.out_proj" in name and "message" not in name:
name = name.replace("attn.out_proj", "self_attn.out_proj")
if "ln_final" in name:
name = name.replace("ln_final", "text_model.final_layer_norm")
# visual encoder
if name == "visual.class_embedding":
name = name.replace("visual.class_embedding", "vision_model.embeddings.class_embedding")
if name == "visual.positional_embedding":
name = name.replace("visual.positional_embedding", "vision_model.embeddings.position_embedding.weight")
if name.startswith("visual.transformer.resblocks"):
name = name.replace("visual.transformer.resblocks", "vision_model.encoder.layers")
if "visual.conv1" in name:
name = name.replace("visual.conv1", "vision_model.embeddings.patch_embedding")
if "visual.ln_pre" in name:
name = name.replace("visual.ln_pre", "vision_model.pre_layernorm")
if "visual.ln_post" in name:
name = name.replace("visual.ln_post", "vision_model.post_layernorm")
if "visual.proj" in name:
name = name.replace("visual.proj", "visual_projection.weight")
if "text_projection" in name:
name = name.replace("text_projection", "text_projection.weight")
# things on top
if "prompts_visual_proj" in name:
name = name.replace("prompts_visual_proj", "prompts_visual_projection")
if "prompts_visual_ln" in name:
name = name.replace("prompts_visual_ln", "prompts_visual_layernorm")
# mit
if name == "mit.positional_embedding":
name = name.replace("positional", "position")
if name.startswith("mit.resblocks"):
name = name.replace("mit.resblocks", "mit.encoder.layers")
# prompts generator
if name.startswith("prompts_generator.norm"):
name = name.replace("prompts_generator.norm", "prompts_generator.layernorm")
return name
def convert_state_dict(orig_state_dict, config):
for key in orig_state_dict.copy().keys():
val = orig_state_dict.pop(key)
if "attn.in_proj" in key:
key_split = key.split(".")
if key.startswith("visual"):
layer_num = key_split[3]
dim = config.vision_config.hidden_size
if "message_attn" in key:
if "weight" in key:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.q_proj.weight"] = val[
:dim, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.k_proj.weight"] = val[
dim : dim * 2, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.v_proj.weight"] = val[
-dim:, :
]
else:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.q_proj.bias"] = val[
:dim
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.k_proj.bias"] = val[
dim : dim * 2
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.message_attn.v_proj.bias"] = val[
-dim:
]
else:
if "weight" in key:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[
:dim, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[
dim : dim * 2, :
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[
-dim:, :
]
else:
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[
dim : dim * 2
]
orig_state_dict[f"vision_model.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
elif key.startswith("mit"):
layer_num = key_split[2]
dim = config.vision_config.mit_hidden_size
if "weight" in key:
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[:dim, :]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[dim : dim * 2, :]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[-dim:, :]
else:
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[dim : dim * 2]
orig_state_dict[f"mit.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
else:
layer_num = key_split[2]
dim = config.text_config.hidden_size
if "weight" in key:
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[:dim, :]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[
dim : dim * 2, :
]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[-dim:, :]
else:
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[
dim : dim * 2
]
orig_state_dict[f"text_model.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
else:
new_key_name = rename_key(key)
if new_key_name in ["visual_projection.weight", "text_projection.weight"]:
val = val.T
orig_state_dict[new_key_name] = val
return orig_state_dict
def prepare_video(num_frames):
if num_frames == 8:
filename = "eating_spaghetti_8_frames.npy"
elif num_frames == 16:
filename = "eating_spaghetti.npy"
elif num_frames == 32:
filename = "eating_spaghetti_32_frames.npy"
file = hf_hub_download(
repo_id="datasets/hf-internal-testing/spaghetti-video",
filename=filename,
)
video = np.load(file)
return list(video)
def convert_xclip_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_hub=False):
model_to_url = {
# fully supervised kinetics-400 checkpoints
"xclip-base-patch32": "https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_32_8.pth",
"xclip-base-patch32-16-frames": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_32_16.pth"
),
"xclip-base-patch16": "https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_16_8.pth",
"xclip-base-patch16-16-frames": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k400_16_16.pth"
),
"xclip-large-patch14": "https://drive.google.com/u/0/uc?id=1NUOImq0o5DlQTST17iIP3vG7DgmHQuCx&amp;export=download&amp;confirm=t&amp;uuid=b26caedc-88e2-473e-830a-9d158b653cdb",
"xclip-large-patch14-16-frames": "https://drive.google.com/u/0/uc?id=1FOYgnJc097OJ4lGwtRCCydQyVPJEOH7d&amp;export=download&amp;confirm=t&amp;uuid=538fa810-e671-4050-b385-9a623f89804f",
# fully supervised kinetics-600 checkpoints
"xclip-base-patch16-kinetics-600": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k600_16_8.pth"
),
"xclip-base-patch16-kinetics-600-16-frames": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/k600_16_16.pth"
),
"xclip-large-patch14-kinetics-600": "https://drive.google.com/u/0/uc?id=1FV8C1INuM91sLAN4ImjzePLIlpMSihwV&amp;export=download&amp;confirm=t&amp;uuid=141d4977-4a65-44ae-864f-4b0c19f838be",
# few shot
"xclip-base-patch16-hmdb-2-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_2.pth"
),
"xclip-base-patch16-hmdb-4-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_4.pth"
),
"xclip-base-patch16-hmdb-8-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_8.pth"
),
"xclip-base-patch16-hmdb-16-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_hmdb_16.pth"
),
"xclip-base-patch16-ucf-2-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_2.pth"
),
"xclip-base-patch16-ucf-4-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_4.pth"
),
"xclip-base-patch16-ucf-8-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_8.pth"
),
"xclip-base-patch16-ucf-16-shot": (
"https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/few_ucf_16.pth"
),
# zero shot
"xclip-base-patch16-zero-shot": "https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/zero.pth",
}
checkpoint_url = model_to_url[model_name]
num_frames = 8
if "16-frames" in model_name:
num_frames = 16
elif "shot" in model_name:
num_frames = 32
config = get_xclip_config(model_name, num_frames)
model = XCLIPModel(config)
model.eval()
if "drive" in checkpoint_url:
output = "pytorch_model.bin"
gdown.cached_download(checkpoint_url, output, quiet=False)
state_dict = torch.load(output, map_location="cpu")["model"]
else:
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url)["model"]
state_dict = convert_state_dict(state_dict, config)
model = XCLIPModel(config)
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
assert missing_keys == ["text_model.embeddings.position_ids", "vision_model.embeddings.position_ids"]
model.eval()
size = 336 if model_name == "xclip-large-patch14-16-frames" else 224
feature_extractor = VideoMAEFeatureExtractor(size=size)
slow_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
fast_tokenizer = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32")
processor = XCLIPProcessor(feature_extractor=feature_extractor, tokenizer=fast_tokenizer)
video = prepare_video(num_frames)
inputs = processor(
text=["playing sports", "eating spaghetti", "go shopping"], videos=video, return_tensors="pt", padding=True
)
print("Shape of pixel values:", inputs.pixel_values.shape)
with torch.no_grad():
outputs = model(**inputs)
# Verify outputs
logits_per_video = outputs.logits_per_video
probs = logits_per_video.softmax(dim=1)
print("Probs:", probs)
# kinetics-400
if model_name == "xclip-base-patch32":
expected_probs = torch.tensor([[0.0019, 0.9951, 0.0030]])
elif model_name == "xclip-base-patch32-16-frames":
expected_probs = torch.tensor([[7.0999e-04, 9.9883e-01, 4.5580e-04]])
elif model_name == "xclip-base-patch16":
expected_probs = torch.tensor([[0.0083, 0.9681, 0.0236]])
elif model_name == "xclip-base-patch16-16-frames":
expected_probs = torch.tensor([[7.6937e-04, 9.9728e-01, 1.9473e-03]])
elif model_name == "xclip-large-patch14":
expected_probs = torch.tensor([[0.0062, 0.9864, 0.0075]])
elif model_name == "xclip-large-patch14-16-frames":
expected_probs = torch.tensor([[3.3877e-04, 9.9937e-01, 2.8888e-04]])
# kinetics-600
elif model_name == "xclip-base-patch16-kinetics-600":
expected_probs = torch.tensor([[0.0555, 0.8914, 0.0531]])
elif model_name == "xclip-base-patch16-kinetics-600-16-frames":
expected_probs = torch.tensor([[3.8554e-04, 9.9929e-01, 3.2754e-04]])
elif model_name == "xclip-large-patch14-kinetics-600":
expected_probs = torch.tensor([[0.0036, 0.9920, 0.0045]])
# few shot
elif model_name == "xclip-base-patch16-hmdb-2-shot":
expected_probs = torch.tensor([[7.1890e-06, 9.9994e-01, 5.6559e-05]])
elif model_name == "xclip-base-patch16-hmdb-4-shot":
expected_probs = torch.tensor([[1.0320e-05, 9.9993e-01, 6.2435e-05]])
elif model_name == "xclip-base-patch16-hmdb-8-shot":
expected_probs = torch.tensor([[4.1377e-06, 9.9990e-01, 9.8386e-05]])
elif model_name == "xclip-base-patch16-hmdb-16-shot":
expected_probs = torch.tensor([[4.1347e-05, 9.9962e-01, 3.3411e-04]])
elif model_name == "xclip-base-patch16-ucf-2-shot":
expected_probs = torch.tensor([[8.5857e-05, 9.9928e-01, 6.3291e-04]])
elif model_name == "xclip-base-patch16-ucf-4-shot":
expected_probs = torch.tensor([[8.5857e-05, 9.9928e-01, 6.3291e-04]])
elif model_name == "xclip-base-patch16-ucf-8-shot":
expected_probs = torch.tensor([[0.0027, 0.9904, 0.0070]])
elif model_name == "xclip-base-patch16-ucf-16-shot":
expected_probs = torch.tensor([[9.8219e-04, 9.9593e-01, 3.0863e-03]])
# zero shot
elif model_name == "xclip-base-patch16-zero-shot":
expected_probs = torch.tensor([[3.5082e-04, 9.9785e-01, 1.7966e-03]])
else:
raise ValueError(f"Model name {model_name} not supported")
assert torch.allclose(probs, expected_probs, atol=1e-3)
print("Looks ok!")
if pytorch_dump_folder_path is not None:
print(f"Saving model {model_name} to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
if push_to_hub:
print("Pushing model, processor and slow tokenizer files to the hub...")
model.push_to_hub(model_name, organization="nielsr")
processor.push_to_hub(model_name, organization="nielsr")
slow_tokenizer.push_to_hub(model_name, organization="nielsr")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--model_name",
default="xclip-base-patch32",
type=str,
help="Name of the model.",
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
)
parser.add_argument(
"--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
)
args = parser.parse_args()
convert_xclip_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,109 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Image/Text processor class for XCLIP
"""
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
class XCLIPProcessor(ProcessorMixin):
r"""
Constructs an X-CLIP processor which wraps a VideoMAE feature extractor and a CLIP tokenizer into a single
processor.
[`XCLIPProcessor`] offers all the functionalities of [`VideoMAEFeatureExtractor`] and [`CLIPTokenizerFast`]. See
the [`~XCLIPProcessor.__call__`] and [`~XCLIPProcessor.decode`] for more information.
Args:
feature_extractor ([`VideoMAEFeatureExtractor`]):
The feature extractor is a required input.
tokenizer ([`CLIPTokenizerFast`]):
The tokenizer is a required input.
"""
feature_extractor_class = "VideoMAEFeatureExtractor"
tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast")
def __init__(self, feature_extractor, tokenizer):
super().__init__(feature_extractor, tokenizer)
self.current_processor = self.feature_extractor
def __call__(self, text=None, videos=None, return_tensors=None, **kwargs):
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `videos` and `kwargs` arguments to
VideoMAEFeatureExtractor's [`~VideoMAEFeatureExtractor.__call__`] if `videos` is not `None`. Please refer to
the doctsring of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
videos (`List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, `List[List[PIL.Image.Image]]`, `List[List[np.ndarrray]]`,:
`List[List[torch.Tensor]]`): The video or batch of videos to be prepared. Each video should be a list
of frames, which can be either PIL images or NumPy arrays. In case of NumPy arrays/PyTorch tensors,
each frame should be of shape (H, W, C), where H and W are frame height and width, and C is a number of
channels.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `videos` is not `None`.
"""
if text is None and videos is None:
raise ValueError("You have to specify either text or videos. Both cannot be none.")
if text is not None:
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
if videos is not None:
image_features = self.feature_extractor(videos, return_tensors=return_tensors, **kwargs)
if text is not None and videos is not None:
encoding["pixel_values"] = image_features.pixel_values
return encoding
elif text is not None:
return encoding
else:
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)

View File

@ -5202,6 +5202,37 @@ class WavLMPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"])
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None
class XCLIPModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class XCLIPPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class XCLIPTextModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class XCLIPVisionModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
XGLM_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

@ -24,13 +24,6 @@ class CLIPFeatureExtractor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class CLIPProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class ConvNextFeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]

View File

View File

@ -0,0 +1,672 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch XCLIP model. """
import inspect
import os
import tempfile
import unittest
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import XCLIPConfig, XCLIPTextConfig, XCLIPVisionConfig
from transformers.testing_utils import require_torch, require_torch_multi_gpu, require_vision, slow, torch_device
from transformers.utils import is_torch_available, is_vision_available
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
ModelTesterMixin,
_config_zero_init,
floats_tensor,
ids_tensor,
random_attention_mask,
)
if is_torch_available():
import torch
from torch import nn
from transformers import XCLIPModel, XCLIPTextModel, XCLIPVisionModel
from transformers.models.x_clip.modeling_x_clip import XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
from transformers import XCLIPProcessor
class XCLIPVisionModelTester:
def __init__(
self,
parent,
batch_size=8,
image_size=30,
patch_size=2,
num_channels=3,
num_frames=8, # important; the batch size * time must be divisible by the number of frames
is_training=True,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
mit_hidden_size=64,
dropout=0.1,
attention_dropout=0.1,
initializer_range=0.02,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.image_size = image_size
self.patch_size = patch_size
self.num_channels = num_channels
self.num_frames = num_frames
self.is_training = is_training
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.mit_hidden_size = mit_hidden_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.initializer_range = initializer_range
self.scope = scope
# in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
num_patches = (image_size // patch_size) ** 2
self.seq_length = num_patches + 1
def prepare_config_and_inputs(self):
pixel_values = floats_tensor(
[self.batch_size * self.num_frames, self.num_channels, self.image_size, self.image_size]
)
config = self.get_config()
return config, pixel_values
def get_config(self):
return XCLIPVisionConfig(
image_size=self.image_size,
patch_size=self.patch_size,
num_channels=self.num_channels,
num_frames=self.num_frames,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
mit_hidden_size=self.mit_hidden_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
initializer_range=self.initializer_range,
)
def create_and_check_model(self, config, pixel_values):
model = XCLIPVisionModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(pixel_values)
# expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
image_size = (self.image_size, self.image_size)
patch_size = (self.patch_size, self.patch_size)
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
self.parent.assertEqual(
result.last_hidden_state.shape, (self.batch_size * self.num_frames, num_patches + 1, self.hidden_size)
)
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size * self.num_frames, self.hidden_size))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict
@require_torch
class XCLIPVisionModelTest(ModelTesterMixin, unittest.TestCase):
"""
Here we also overwrite some of the tests of test_modeling_common.py, as X-CLIP does not use input_ids, inputs_embeds,
attention_mask and seq_length.
"""
all_model_classes = (XCLIPVisionModel,) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_resize_embeddings = False
test_head_masking = False
def setUp(self):
self.model_tester = XCLIPVisionModelTester(self)
self.config_tester = ConfigTester(
self, config_class=XCLIPVisionConfig, has_text_modality=False, hidden_size=37
)
def test_config(self):
self.config_tester.run_common_tests()
@unittest.skip(reason="X-CLIP does not use inputs_embeds")
def test_inputs_embeds(self):
pass
def test_model_common_attributes(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
x = model.get_output_embeddings()
self.assertTrue(x is None or isinstance(x, nn.Linear))
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["pixel_values"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_training(self):
pass
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="XCLIPVisionModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="XCLIPVisionModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = XCLIPVisionModel.from_pretrained(model_name)
self.assertIsNotNone(model)
def test_gradient_checkpointing_backward_compatibility(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
if not model_class.supports_gradient_checkpointing:
continue
print("Model class:", model_class)
config.gradient_checkpointing = True
model = model_class(config)
self.assertTrue(model.is_gradient_checkpointing)
def test_attention_outputs(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
# we add 1 here due to the special message token in X-CLIP's vision encoder
seq_len = getattr(self.model_tester, "seq_length", None) + 1
encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(len(outputs.attentions), self.model_tester.num_hidden_layers)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(len(outputs.attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(outputs.attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_seq_length],
)
out_len = len(outputs)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(out_len + 1, len(outputs))
self_attentions = outputs.attentions
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_seq_length],
)
@require_torch_multi_gpu
def test_multi_gpu_data_parallel_forward(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# some params shouldn't be scattered by nn.DataParallel
# so just remove them if they are present.
blacklist_non_batched_params = ["head_mask", "decoder_head_mask", "cross_attn_head_mask"]
for k in blacklist_non_batched_params:
inputs_dict.pop(k, None)
# move input tensors to cuda:O
for k, v in inputs_dict.items():
if torch.is_tensor(v):
inputs_dict[k] = v.to(0)
for model_class in self.all_model_classes:
model = model_class(config=config)
model.to(0)
model.eval()
# Wrap model in nn.DataParallel
model = nn.DataParallel(model)
with torch.no_grad():
test = self._prepare_for_class(inputs_dict, model_class)
for k, v in test.items():
if isinstance(v, torch.Tensor):
print(k, v.shape)
else:
print(k, v)
_ = model(**self._prepare_for_class(inputs_dict, model_class))
class XCLIPTextModelTester:
def __init__(
self,
parent,
batch_size=8,
seq_length=7,
is_training=True,
use_input_mask=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
max_position_embeddings=512,
initializer_range=0.02,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.scope = scope
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
if input_mask is not None:
batch_size, seq_length = input_mask.shape
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
for batch_idx, start_index in enumerate(rnd_start_indices):
input_mask[batch_idx, :start_index] = 1
input_mask[batch_idx, start_index:] = 0
config = self.get_config()
return config, input_ids, input_mask
def get_config(self):
return XCLIPTextConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
max_position_embeddings=self.max_position_embeddings,
initializer_range=self.initializer_range,
)
def create_and_check_model(self, config, input_ids, input_mask):
model = XCLIPTextModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_ids, attention_mask=input_mask)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, input_mask = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
class XCLIPTextModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (XCLIPTextModel,) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_head_masking = False
def setUp(self):
self.model_tester = XCLIPTextModelTester(self)
self.config_tester = ConfigTester(self, config_class=XCLIPTextConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_training(self):
pass
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="X-CLIP does not use inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="XCLIPTextModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="XCLIPTextModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = XCLIPTextModel.from_pretrained(model_name)
self.assertIsNotNone(model)
class XCLIPModelTester:
def __init__(self, parent, projection_dim=64, mit_hidden_size=64, is_training=True):
self.parent = parent
self.projection_dim = projection_dim
self.mit_hidden_size = mit_hidden_size
self.text_model_tester = XCLIPTextModelTester(parent)
self.vision_model_tester = XCLIPVisionModelTester(parent)
self.is_training = is_training
def prepare_config_and_inputs(self):
text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
vision_config, _ = self.vision_model_tester.prepare_config_and_inputs()
pixel_values = floats_tensor(
[
self.vision_model_tester.batch_size,
self.vision_model_tester.num_frames,
self.vision_model_tester.num_channels,
self.vision_model_tester.image_size,
self.vision_model_tester.image_size,
]
)
config = self.get_config()
return config, input_ids, attention_mask, pixel_values
def get_config(self):
return XCLIPConfig.from_text_vision_configs(
self.text_model_tester.get_config(),
self.vision_model_tester.get_config(),
projection_dim=self.projection_dim,
)
def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
model = XCLIPModel(config).to(torch_device).eval()
with torch.no_grad():
result = model(input_ids, pixel_values, attention_mask)
self.parent.assertEqual(
result.logits_per_video.shape,
(self.vision_model_tester.batch_size, self.text_model_tester.batch_size),
)
self.parent.assertEqual(
result.logits_per_text.shape,
(self.text_model_tester.batch_size, self.vision_model_tester.batch_size),
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, attention_mask, pixel_values = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"pixel_values": pixel_values,
"return_loss": True,
}
return config, inputs_dict
@require_torch
class XCLIPModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (XCLIPModel,) if is_torch_available() else ()
fx_compatible = False
test_head_masking = False
test_pruning = False
test_resize_embeddings = False
test_attention_outputs = False
test_torchscript = False
maxdiff = None
def setUp(self):
self.model_tester = XCLIPModelTester(self)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
@unittest.skip(reason="Hidden_states is tested in individual model tests")
def test_hidden_states_output(self):
pass
@unittest.skip(reason="Inputs_embeds is tested in individual model tests")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="Retain_grad is tested in individual model tests")
def test_retain_grad_hidden_states_attentions(self):
pass
@unittest.skip(reason="XCLIPModel does not have input/output embeddings")
def test_model_common_attributes(self):
pass
@unittest.skip(reason="XCLIPModel does not support feedforward chunking")
def test_feed_forward_chunking(self):
pass
# override as the `logit_scale`, `prompts_generator.alpha` parameters require special treatment
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
# check if `logit_scale` is initilized as per the original implementation
if name == "logit_scale":
self.assertAlmostEqual(
param.data.item(),
np.log(1 / 0.07),
delta=1e-3,
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
elif name == "prompts_generator.alpha":
self.assertAlmostEqual(param.data.mean().item(), model.config.prompt_alpha)
else:
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
def _create_and_check_torchscript(self, config, inputs_dict):
if not self.test_torchscript:
return
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
configs_no_init.torchscript = True
configs_no_init.return_dict = False
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
model.to(torch_device)
model.eval()
try:
input_ids = inputs_dict["input_ids"]
pixel_values = inputs_dict["pixel_values"] # X-CLIP needs pixel_values
traced_model = torch.jit.trace(model, (input_ids, pixel_values))
except RuntimeError:
self.fail("Couldn't trace module.")
with tempfile.TemporaryDirectory() as tmp_dir_name:
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
try:
torch.jit.save(traced_model, pt_file_name)
except Exception:
self.fail("Couldn't save module.")
try:
loaded_model = torch.jit.load(pt_file_name)
except Exception:
self.fail("Couldn't load module.")
model.to(torch_device)
model.eval()
loaded_model.to(torch_device)
loaded_model.eval()
model_state_dict = model.state_dict()
loaded_model_state_dict = loaded_model.state_dict()
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
models_equal = True
for layer_name, p1 in model_state_dict.items():
p2 = loaded_model_state_dict[layer_name]
if p1.data.ne(p2.data).sum() > 0:
models_equal = False
self.assertTrue(models_equal)
def test_load_vision_text_config(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# Save XCLIPConfig and check if we can load XCLIPVisionConfig from it
with tempfile.TemporaryDirectory() as tmp_dir_name:
config.save_pretrained(tmp_dir_name)
vision_config = XCLIPVisionConfig.from_pretrained(tmp_dir_name)
self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
# Save XCLIPConfig and check if we can load XCLIPTextConfig from it
with tempfile.TemporaryDirectory() as tmp_dir_name:
config.save_pretrained(tmp_dir_name)
text_config = XCLIPTextConfig.from_pretrained(tmp_dir_name)
self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
@slow
def test_model_from_pretrained(self):
for model_name in XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = XCLIPModel.from_pretrained(model_name)
self.assertIsNotNone(model)
# We will verify our results on a spaghetti video
def prepare_video():
file = hf_hub_download(
repo_id="datasets/hf-internal-testing/spaghetti-video", filename="eating_spaghetti_8_frames.npy"
)
video = np.load(file)
return list(video)
@require_vision
@require_torch
class XCLIPModelIntegrationTest(unittest.TestCase):
@slow
def test_inference(self):
model_name = "microsoft/xclip-base-patch32"
model = XCLIPModel.from_pretrained(model_name).to(torch_device)
processor = XCLIPProcessor.from_pretrained(model_name)
video = prepare_video()
inputs = processor(
text=["playing sports", "eating spaghetti", "go shopping"], videos=video, return_tensors="pt", padding=True
).to(torch_device)
# forward pass
with torch.no_grad():
outputs = model(**inputs)
# verify the logits
self.assertEqual(
outputs.logits_per_video.shape,
torch.Size((inputs.pixel_values.shape[0], inputs.input_ids.shape[0])),
)
self.assertEqual(
outputs.logits_per_text.shape,
torch.Size((inputs.input_ids.shape[0], inputs.pixel_values.shape[0])),
)
expected_logits = torch.tensor([[14.3819, 20.6031, 15.0526]], device=torch_device)
self.assertTrue(torch.allclose(outputs.logits_per_video, expected_logits, atol=1e-3))

View File

@ -49,6 +49,7 @@ CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK = {
"SpeechEncoderDecoderConfig",
"VisionEncoderDecoderConfig",
"VisionTextDualEncoderConfig",
"XCLIPConfig",
}

View File

@ -207,6 +207,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"TFWav2Vec2ForCTC",
"TFHubertForCTC",
"MaskFormerForInstanceSegmentation",
"XCLIPVisionModel",
"XCLIPTextModel",
]
# Update this list for models that have multiple model types for the same