mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 05:10:06 +06:00
Add CLIPSeg (#20066)
* Add first draft * Update conversion script * Improve conversion script * Improve conversion script some more * Add conditional embeddings * Add initial decoder * Fix activation function of decoder * Make decoder outputs match original implementation * Make decoder outputs match original implementation * Add more copied from statements * Improve model outputs * Fix auto tokenizer file * Fix more tests * Add test * Improve README and docs, improve conditional embeddings * Fix more tests * Remove print statements * Remove initial embeddings * Improve conversion script * Add interpolation of position embeddings * Finish addition of interpolation of position embeddings * Add support for refined checkpoint * Fix refined checkpoint * Remove unused parameter * Improve conversion script * Add support for training * Fix conversion script * Add CLIPSegFeatureExtractor * Fix processor * Fix CLIPSegProcessor * Fix conversion script * Fix most tests * Fix equivalence test * Fix README * Add model to doc tests * Use better variable name * Convert other checkpoint as well * Update config, add link to paper * Add docs * Update organization * Replace base_model_prefix with clip * Fix base_model_prefix * Fix checkpoint of config * Fix config checkpoint * Remove file * Use logits for output * Fix tests Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
This commit is contained in:
parent
3e39fd09a9
commit
258963062b
@ -279,6 +279,7 @@ Current number of checkpoints: ** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/main/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||
1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
|
||||
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||
|
@ -279,6 +279,7 @@ Número actual de puntos de control: ** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/main/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||
1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
|
||||
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||
|
@ -314,6 +314,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/main/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||
1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
|
||||
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||
|
@ -229,6 +229,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/main/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||
1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
|
||||
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||
|
@ -253,6 +253,7 @@ conda install -c huggingface transformers
|
||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。
|
||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。
|
||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
|
||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/main/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
|
||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。
|
||||
1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (来自 Microsoft Research Asia) 伴随论文 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 由 Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 发布。
|
||||
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (来自 YituTech) 伴随论文 [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) 由 Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan 发布。
|
||||
|
@ -265,6 +265,7 @@ conda install -c huggingface transformers
|
||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/main/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||
1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
|
||||
1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||
|
@ -466,6 +466,8 @@
|
||||
sections:
|
||||
- local: model_doc/clip
|
||||
title: CLIP
|
||||
- local: model_doc/clipseg
|
||||
title: CLIPSeg
|
||||
- local: model_doc/data2vec
|
||||
title: Data2Vec
|
||||
- local: model_doc/donut
|
||||
|
@ -67,6 +67,7 @@ The documentation is organized into five sections:
|
||||
1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||
1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||
1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||
1. **[Conditional DETR](model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.
|
||||
1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||
@ -223,6 +224,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| CANINE | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||
| CLIP | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| CLIPSeg | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| CodeGen | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
| Conditional DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
|
93
docs/source/en/model_doc/clipseg.mdx
Normal file
93
docs/source/en/model_doc/clipseg.mdx
Normal file
@ -0,0 +1,93 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# CLIPSeg
|
||||
|
||||
## Overview
|
||||
|
||||
The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke
|
||||
and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero- and one-shot image segmentation.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Image segmentation is usually addressed by training a
|
||||
model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive
|
||||
as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system
|
||||
that can generate image segmentations based on arbitrary
|
||||
prompts at test time. A prompt can be either a text or an
|
||||
image. This approach enables us to create a unified model
|
||||
(trained once) for three common segmentation tasks, which
|
||||
come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation.
|
||||
We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense
|
||||
prediction. After training on an extended version of the
|
||||
PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on
|
||||
an additional image expressing the query. We analyze different variants of the latter image-based prompts in detail.
|
||||
This novel hybrid input allows for dynamic adaptation not
|
||||
only to the three segmentation tasks mentioned above, but
|
||||
to any binary segmentation task where a text or image query
|
||||
can be formulated. Finally, we find our system to adapt well
|
||||
to generalized queries involving affordances or properties*
|
||||
|
||||
Tips:
|
||||
|
||||
- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`].
|
||||
- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text
|
||||
(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom
|
||||
conditional embeddings (provided to the model as `conditional_embeddings`).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/clipseg_architecture.png"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> CLIPSeg overview. Taken from the <a href="https://arxiv.org/abs/2112.10003">original paper.</a> </small>
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
||||
The original code can be found [here](https://github.com/timojl/clipseg).
|
||||
|
||||
|
||||
## CLIPSegConfig
|
||||
|
||||
[[autodoc]] CLIPSegConfig
|
||||
- from_text_vision_configs
|
||||
|
||||
## CLIPSegTextConfig
|
||||
|
||||
[[autodoc]] CLIPSegTextConfig
|
||||
|
||||
## CLIPSegVisionConfig
|
||||
|
||||
[[autodoc]] CLIPSegVisionConfig
|
||||
|
||||
## CLIPSegProcessor
|
||||
|
||||
[[autodoc]] CLIPSegProcessor
|
||||
|
||||
## CLIPSegModel
|
||||
|
||||
[[autodoc]] CLIPSegModel
|
||||
- forward
|
||||
- get_text_features
|
||||
- get_image_features
|
||||
|
||||
## CLIPSegTextModel
|
||||
|
||||
[[autodoc]] CLIPSegTextModel
|
||||
- forward
|
||||
|
||||
## CLIPSegVisionModel
|
||||
|
||||
[[autodoc]] CLIPSegVisionModel
|
||||
- forward
|
||||
|
||||
## CLIPSegForImageSegmentation
|
||||
|
||||
[[autodoc]] CLIPSegForImageSegmentation
|
||||
- forward
|
@ -171,6 +171,13 @@ _import_structure = {
|
||||
"CLIPTokenizer",
|
||||
"CLIPVisionConfig",
|
||||
],
|
||||
"models.clipseg": [
|
||||
"CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"CLIPSegConfig",
|
||||
"CLIPSegProcessor",
|
||||
"CLIPSegTextConfig",
|
||||
"CLIPSegVisionConfig",
|
||||
],
|
||||
"models.codegen": ["CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP", "CodeGenConfig", "CodeGenTokenizer"],
|
||||
"models.conditional_detr": ["CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConditionalDetrConfig"],
|
||||
"models.convbert": ["CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvBertConfig", "ConvBertTokenizer"],
|
||||
@ -1074,6 +1081,16 @@ else:
|
||||
"CLIPVisionModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.clipseg"].extend(
|
||||
[
|
||||
"CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"CLIPSegModel",
|
||||
"CLIPSegPreTrainedModel",
|
||||
"CLIPSegTextModel",
|
||||
"CLIPSegVisionModel",
|
||||
"CLIPSegForImageSegmentation",
|
||||
]
|
||||
)
|
||||
_import_structure["models.x_clip"].extend(
|
||||
[
|
||||
"XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
@ -3225,6 +3242,13 @@ if TYPE_CHECKING:
|
||||
CLIPTokenizer,
|
||||
CLIPVisionConfig,
|
||||
)
|
||||
from .models.clipseg import (
|
||||
CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
CLIPSegConfig,
|
||||
CLIPSegProcessor,
|
||||
CLIPSegTextConfig,
|
||||
CLIPSegVisionConfig,
|
||||
)
|
||||
from .models.codegen import CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP, CodeGenConfig, CodeGenTokenizer
|
||||
from .models.conditional_detr import CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, ConditionalDetrConfig
|
||||
from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
|
||||
@ -3993,6 +4017,14 @@ if TYPE_CHECKING:
|
||||
CLIPTextModel,
|
||||
CLIPVisionModel,
|
||||
)
|
||||
from .models.clipseg import (
|
||||
CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
CLIPSegForImageSegmentation,
|
||||
CLIPSegModel,
|
||||
CLIPSegPreTrainedModel,
|
||||
CLIPSegTextModel,
|
||||
CLIPSegVisionModel,
|
||||
)
|
||||
from .models.codegen import (
|
||||
CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
CodeGenForCausalLM,
|
||||
|
@ -37,6 +37,7 @@ from . import (
|
||||
camembert,
|
||||
canine,
|
||||
clip,
|
||||
clipseg,
|
||||
codegen,
|
||||
conditional_detr,
|
||||
convbert,
|
||||
|
@ -42,6 +42,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("camembert", "CamembertConfig"),
|
||||
("canine", "CanineConfig"),
|
||||
("clip", "CLIPConfig"),
|
||||
("clipseg", "CLIPSegConfig"),
|
||||
("codegen", "CodeGenConfig"),
|
||||
("conditional_detr", "ConditionalDetrConfig"),
|
||||
("convbert", "ConvBertConfig"),
|
||||
@ -182,6 +183,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("conditional_detr", "CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("convbert", "CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
@ -315,6 +317,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("camembert", "CamemBERT"),
|
||||
("canine", "CANINE"),
|
||||
("clip", "CLIP"),
|
||||
("clipseg", "CLIPSeg"),
|
||||
("codegen", "CodeGen"),
|
||||
("conditional_detr", "Conditional DETR"),
|
||||
("convbert", "ConvBERT"),
|
||||
|
@ -39,6 +39,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
("beit", "BeitFeatureExtractor"),
|
||||
("clip", "CLIPFeatureExtractor"),
|
||||
("clipseg", "ViTFeatureExtractor"),
|
||||
("conditional_detr", "ConditionalDetrFeatureExtractor"),
|
||||
("convnext", "ConvNextFeatureExtractor"),
|
||||
("cvt", "ConvNextFeatureExtractor"),
|
||||
|
@ -41,6 +41,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("camembert", "CamembertModel"),
|
||||
("canine", "CanineModel"),
|
||||
("clip", "CLIPModel"),
|
||||
("clipseg", "CLIPSegModel"),
|
||||
("codegen", "CodeGenModel"),
|
||||
("conditional_detr", "ConditionalDetrModel"),
|
||||
("convbert", "ConvBertModel"),
|
||||
@ -813,6 +814,7 @@ _MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Model for Zero Shot Image Classification mapping
|
||||
("clip", "CLIPModel"),
|
||||
("clipseg", "CLIPSegModel"),
|
||||
]
|
||||
)
|
||||
|
||||
|
@ -40,6 +40,7 @@ logger = logging.get_logger(__name__)
|
||||
PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
("clip", "CLIPProcessor"),
|
||||
("clipseg", "CLIPSegProcessor"),
|
||||
("flava", "FlavaProcessor"),
|
||||
("groupvit", "CLIPProcessor"),
|
||||
("layoutlmv2", "LayoutLMv2Processor"),
|
||||
|
@ -93,6 +93,13 @@ else:
|
||||
"CLIPTokenizerFast" if is_tokenizers_available() else None,
|
||||
),
|
||||
),
|
||||
(
|
||||
"clipseg",
|
||||
(
|
||||
"CLIPTokenizer",
|
||||
"CLIPTokenizerFast" if is_tokenizers_available() else None,
|
||||
),
|
||||
),
|
||||
("codegen", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
|
||||
(
|
||||
|
75
src/transformers/models/clipseg/__init__.py
Normal file
75
src/transformers/models/clipseg/__init__.py
Normal file
@ -0,0 +1,75 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_clipseg": [
|
||||
"CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"CLIPSegConfig",
|
||||
"CLIPSegTextConfig",
|
||||
"CLIPSegVisionConfig",
|
||||
],
|
||||
"processing_clipseg": ["CLIPSegProcessor"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_clipseg"] = [
|
||||
"CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"CLIPSegModel",
|
||||
"CLIPSegPreTrainedModel",
|
||||
"CLIPSegTextModel",
|
||||
"CLIPSegVisionModel",
|
||||
"CLIPSegForImageSegmentation",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_clipseg import (
|
||||
CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
CLIPSegConfig,
|
||||
CLIPSegTextConfig,
|
||||
CLIPSegVisionConfig,
|
||||
)
|
||||
from .processing_clipseg import CLIPSegProcessor
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_clipseg import (
|
||||
CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
CLIPSegForImageSegmentation,
|
||||
CLIPSegModel,
|
||||
CLIPSegPreTrainedModel,
|
||||
CLIPSegTextModel,
|
||||
CLIPSegVisionModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
383
src/transformers/models/clipseg/configuration_clipseg.py
Normal file
383
src/transformers/models/clipseg/configuration_clipseg.py
Normal file
@ -0,0 +1,383 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" CLIPSeg model configuration"""
|
||||
|
||||
import copy
|
||||
import os
|
||||
from typing import Union
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"CIDAS/clipseg-rd64": "https://huggingface.co/CIDAS/clipseg-rd64/resolve/main/config.json",
|
||||
}
|
||||
|
||||
|
||||
class CLIPSegTextConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`CLIPSegModel`]. It is used to instantiate an
|
||||
CLIPSeg model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the CLIPSeg
|
||||
[CIDAS/clipseg-rd64](https://huggingface.co/CIDAS/clipseg-rd64) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 49408):
|
||||
Vocabulary size of the CLIPSeg text model. Defines the number of different tokens that can be represented
|
||||
by the `inputs_ids` passed when calling [`CLIPSegModel`].
|
||||
hidden_size (`int`, *optional*, defaults to 512):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
intermediate_size (`int`, *optional*, defaults to 2048):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 8):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 77):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
defaults to 1e-5): The epsilon used by the layer normalization layers.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
initializer_factor (`float``, *optional*, defaults to 1):
|
||||
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||
testing).
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import CLIPSegTextConfig, CLIPSegTextModel
|
||||
|
||||
>>> # Initializing a CLIPSegTextConfig with CIDAS/clipseg-rd64 style configuration
|
||||
>>> configuration = CLIPSegTextConfig()
|
||||
|
||||
>>> # Initializing a CLIPSegTextModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
|
||||
>>> model = CLIPSegTextModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
model_type = "clipseg_text_model"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=49408,
|
||||
hidden_size=512,
|
||||
intermediate_size=2048,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=8,
|
||||
max_position_embeddings=77,
|
||||
hidden_act="quick_gelu",
|
||||
layer_norm_eps=0.00001,
|
||||
dropout=0.0,
|
||||
attention_dropout=0.0,
|
||||
initializer_range=0.02,
|
||||
initializer_factor=1.0,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.hidden_act = hidden_act
|
||||
self.initializer_range = initializer_range
|
||||
self.initializer_factor = initializer_factor
|
||||
self.attention_dropout = attention_dropout
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||
|
||||
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
# get the text config dict if we are loading from CLIPSegConfig
|
||||
if config_dict.get("model_type") == "clipseg":
|
||||
config_dict = config_dict["text_config"]
|
||||
|
||||
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||
logger.warning(
|
||||
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||
)
|
||||
|
||||
return cls.from_dict(config_dict, **kwargs)
|
||||
|
||||
|
||||
class CLIPSegVisionConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`CLIPSegModel`]. It is used to instantiate an
|
||||
CLIPSeg model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the CLIPSeg
|
||||
[CIDAS/clipseg-rd64](https://huggingface.co/CIDAS/clipseg-rd64) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
image_size (`int`, *optional*, defaults to 224):
|
||||
The size (resolution) of each image.
|
||||
patch_size (`int`, *optional*, defaults to 32):
|
||||
The size (resolution) of each patch.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
defaults to 1e-5): The epsilon used by the layer normalization layers.
|
||||
dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
initializer_factor (`float``, *optional*, defaults to 1):
|
||||
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||
testing).
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import CLIPSegVisionConfig, CLIPSegVisionModel
|
||||
|
||||
>>> # Initializing a CLIPSegVisionConfig with CIDAS/clipseg-rd64 style configuration
|
||||
>>> configuration = CLIPSegVisionConfig()
|
||||
|
||||
>>> # Initializing a CLIPSegVisionModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
|
||||
>>> model = CLIPSegVisionModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
|
||||
model_type = "clipseg_vision_model"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
hidden_size=768,
|
||||
intermediate_size=3072,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
num_channels=3,
|
||||
image_size=224,
|
||||
patch_size=32,
|
||||
hidden_act="quick_gelu",
|
||||
layer_norm_eps=0.00001,
|
||||
dropout=0.0,
|
||||
attention_dropout=0.0,
|
||||
initializer_range=0.02,
|
||||
initializer_factor=1.0,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.num_channels = num_channels
|
||||
self.patch_size = patch_size
|
||||
self.image_size = image_size
|
||||
self.initializer_range = initializer_range
|
||||
self.initializer_factor = initializer_factor
|
||||
self.attention_dropout = attention_dropout
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.hidden_act = hidden_act
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||
|
||||
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
# get the vision config dict if we are loading from CLIPSegConfig
|
||||
if config_dict.get("model_type") == "clipseg":
|
||||
config_dict = config_dict["vision_config"]
|
||||
|
||||
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||
logger.warning(
|
||||
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||
)
|
||||
|
||||
return cls.from_dict(config_dict, **kwargs)
|
||||
|
||||
|
||||
class CLIPSegConfig(PretrainedConfig):
|
||||
r"""
|
||||
[`CLIPSegConfig`] is the configuration class to store the configuration of a [`CLIPSegModel`]. It is used to
|
||||
instantiate a CLIPSeg model according to the specified arguments, defining the text model and vision model configs.
|
||||
Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIPSeg
|
||||
[CIDAS/clipseg-rd64](https://huggingface.co/CIDAS/clipseg-rd64) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
text_config (`dict`, *optional*):
|
||||
Dictionary of configuration options used to initialize [`CLIPSegTextConfig`].
|
||||
vision_config (`dict`, *optional*):
|
||||
Dictionary of configuration options used to initialize [`CLIPSegVisionConfig`].
|
||||
projection_dim (`int`, *optional*, defaults to 512):
|
||||
Dimensionality of text and vision projection layers.
|
||||
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
|
||||
The inital value of the *logit_scale* paramter. Default is used as per the original CLIPSeg implementation.
|
||||
extract_layers (`List[int]`, *optional*, defaults to [3, 6, 9]):
|
||||
Layers to extract when forwarding the query image through the frozen visual backbone of CLIP.
|
||||
reduce_dim (`int`, *optional*, defaults to 64):
|
||||
Dimensionality to reduce the CLIP vision embedding.
|
||||
decoder_num_attention_heads (`int`, *optional*, defaults to 4):
|
||||
Number of attention heads in the decoder of CLIPSeg.
|
||||
decoder_attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
decoder_hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
defaults to 1e-5): The epsilon used by the layer normalization layers.
|
||||
decoder_intermediate_size (`int`, *optional*, defaults to 2048):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layers in the Transformer decoder.
|
||||
conditional_layer (`int`, *optional*, defaults to 0):
|
||||
The layer to use of the Transformer encoder whose activations will be combined with the condition
|
||||
embeddings using FiLM (Feature-wise Linear Modulation). If 0, the last layer is used.
|
||||
use_complex_transposed_convolution (`bool`, *optional*, defaults to `False`):
|
||||
Whether to use a more complex transposed convolution in the decoder, enabling more fine-grained
|
||||
segmentation.
|
||||
kwargs (*optional*):
|
||||
Dictionary of keyword arguments.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import CLIPSegConfig, CLIPSegModel
|
||||
|
||||
>>> # Initializing a CLIPSegConfig with CIDAS/clipseg-rd64 style configuration
|
||||
>>> configuration = CLIPSegConfig()
|
||||
|
||||
>>> # Initializing a CLIPSegModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
|
||||
>>> model = CLIPSegModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
|
||||
>>> # We can also initialize a CLIPSegConfig from a CLIPSegTextConfig and a CLIPSegVisionConfig
|
||||
|
||||
>>> # Initializing a CLIPSegText and CLIPSegVision configuration
|
||||
>>> config_text = CLIPSegTextConfig()
|
||||
>>> config_vision = CLIPSegVisionConfig()
|
||||
|
||||
>>> config = CLIPSegConfig.from_text_vision_configs(config_text, config_vision)
|
||||
```"""
|
||||
|
||||
model_type = "clipseg"
|
||||
is_composition = True
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
text_config=None,
|
||||
vision_config=None,
|
||||
projection_dim=512,
|
||||
logit_scale_init_value=2.6592,
|
||||
extract_layers=[3, 6, 9],
|
||||
reduce_dim=64,
|
||||
decoder_num_attention_heads=4,
|
||||
decoder_attention_dropout=0.0,
|
||||
decoder_hidden_act="quick_gelu",
|
||||
decoder_intermediate_size=2048,
|
||||
conditional_layer=0,
|
||||
use_complex_transposed_convolution=False,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
text_config_dict = kwargs.pop("text_config_dict", None)
|
||||
vision_config_dict = kwargs.pop("vision_config_dict", None)
|
||||
if text_config_dict is not None:
|
||||
text_config = text_config_dict
|
||||
if vision_config_dict is not None:
|
||||
vision_config = vision_config_dict
|
||||
|
||||
if text_config is None:
|
||||
text_config = {}
|
||||
logger.info("text_config is None. Initializing the CLIPSegTextConfig with default values.")
|
||||
|
||||
if vision_config is None:
|
||||
vision_config = {}
|
||||
logger.info("vision_config is None. initializing the CLIPSegVisionConfig with default values.")
|
||||
|
||||
self.text_config = CLIPSegTextConfig(**text_config)
|
||||
self.vision_config = CLIPSegVisionConfig(**vision_config)
|
||||
|
||||
self.projection_dim = projection_dim
|
||||
self.logit_scale_init_value = logit_scale_init_value
|
||||
self.extract_layers = extract_layers
|
||||
self.reduce_dim = reduce_dim
|
||||
self.decoder_num_attention_heads = decoder_num_attention_heads
|
||||
self.decoder_attention_dropout = decoder_attention_dropout
|
||||
self.decoder_hidden_act = decoder_hidden_act
|
||||
self.decoder_intermediate_size = decoder_intermediate_size
|
||||
self.conditional_layer = conditional_layer
|
||||
self.initializer_factor = 1.0
|
||||
self.use_complex_transposed_convolution = use_complex_transposed_convolution
|
||||
|
||||
@classmethod
|
||||
def from_text_vision_configs(cls, text_config: CLIPSegTextConfig, vision_config: CLIPSegVisionConfig, **kwargs):
|
||||
r"""
|
||||
Instantiate a [`CLIPSegConfig`] (or a derived class) from clipseg text model configuration and clipseg vision
|
||||
model configuration.
|
||||
|
||||
Returns:
|
||||
[`CLIPSegConfig`]: An instance of a configuration object
|
||||
"""
|
||||
|
||||
return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||
|
||||
Returns:
|
||||
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
output["text_config"] = self.text_config.to_dict()
|
||||
output["vision_config"] = self.vision_config.to_dict()
|
||||
output["model_type"] = self.__class__.model_type
|
||||
return output
|
@ -0,0 +1,264 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Convert CLIPSeg checkpoints from the original repository. URL: https://github.com/timojl/clipseg."""
|
||||
|
||||
import argparse
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
|
||||
import requests
|
||||
from transformers import (
|
||||
CLIPSegConfig,
|
||||
CLIPSegForImageSegmentation,
|
||||
CLIPSegProcessor,
|
||||
CLIPSegTextConfig,
|
||||
CLIPSegVisionConfig,
|
||||
CLIPTokenizer,
|
||||
ViTFeatureExtractor,
|
||||
)
|
||||
|
||||
|
||||
def get_clipseg_config(model_name):
|
||||
text_config = CLIPSegTextConfig()
|
||||
vision_config = CLIPSegVisionConfig(patch_size=16)
|
||||
|
||||
use_complex_transposed_convolution = True if "refined" in model_name else False
|
||||
reduce_dim = 16 if "rd16" in model_name else 64
|
||||
|
||||
config = CLIPSegConfig.from_text_vision_configs(
|
||||
text_config,
|
||||
vision_config,
|
||||
use_complex_transposed_convolution=use_complex_transposed_convolution,
|
||||
reduce_dim=reduce_dim,
|
||||
)
|
||||
return config
|
||||
|
||||
|
||||
def rename_key(name):
|
||||
# update prefixes
|
||||
if "clip_model" in name:
|
||||
name = name.replace("clip_model", "clip")
|
||||
if "transformer" in name:
|
||||
if "visual" in name:
|
||||
name = name.replace("visual.transformer", "vision_model")
|
||||
else:
|
||||
name = name.replace("transformer", "text_model")
|
||||
if "resblocks" in name:
|
||||
name = name.replace("resblocks", "encoder.layers")
|
||||
if "ln_1" in name:
|
||||
name = name.replace("ln_1", "layer_norm1")
|
||||
if "ln_2" in name:
|
||||
name = name.replace("ln_2", "layer_norm2")
|
||||
if "c_fc" in name:
|
||||
name = name.replace("c_fc", "fc1")
|
||||
if "c_proj" in name:
|
||||
name = name.replace("c_proj", "fc2")
|
||||
if "attn" in name and "self" not in name:
|
||||
name = name.replace("attn", "self_attn")
|
||||
# text encoder
|
||||
if "token_embedding" in name:
|
||||
name = name.replace("token_embedding", "text_model.embeddings.token_embedding")
|
||||
if "positional_embedding" in name and "visual" not in name:
|
||||
name = name.replace("positional_embedding", "text_model.embeddings.position_embedding.weight")
|
||||
if "ln_final" in name:
|
||||
name = name.replace("ln_final", "text_model.final_layer_norm")
|
||||
# vision encoder
|
||||
if "visual.class_embedding" in name:
|
||||
name = name.replace("visual.class_embedding", "vision_model.embeddings.class_embedding")
|
||||
if "visual.conv1" in name:
|
||||
name = name.replace("visual.conv1", "vision_model.embeddings.patch_embedding")
|
||||
if "visual.positional_embedding" in name:
|
||||
name = name.replace("visual.positional_embedding", "vision_model.embeddings.position_embedding.weight")
|
||||
if "visual.ln_pre" in name:
|
||||
name = name.replace("visual.ln_pre", "vision_model.pre_layrnorm")
|
||||
if "visual.ln_post" in name:
|
||||
name = name.replace("visual.ln_post", "vision_model.post_layernorm")
|
||||
# projection layers
|
||||
if "visual.proj" in name:
|
||||
name = name.replace("visual.proj", "visual_projection.weight")
|
||||
if "text_projection" in name:
|
||||
name = name.replace("text_projection", "text_projection.weight")
|
||||
# decoder
|
||||
if "trans_conv" in name:
|
||||
name = name.replace("trans_conv", "transposed_convolution")
|
||||
if "film_mul" in name or "film_add" in name or "reduce" in name or "transposed_convolution" in name:
|
||||
name = "decoder." + name
|
||||
if "blocks" in name:
|
||||
name = name.replace("blocks", "decoder.layers")
|
||||
if "linear1" in name:
|
||||
name = name.replace("linear1", "mlp.fc1")
|
||||
if "linear2" in name:
|
||||
name = name.replace("linear2", "mlp.fc2")
|
||||
if "norm1" in name and "layer_" not in name:
|
||||
name = name.replace("norm1", "layer_norm1")
|
||||
if "norm2" in name and "layer_" not in name:
|
||||
name = name.replace("norm2", "layer_norm2")
|
||||
|
||||
return name
|
||||
|
||||
|
||||
def convert_state_dict(orig_state_dict, config):
|
||||
for key in orig_state_dict.copy().keys():
|
||||
val = orig_state_dict.pop(key)
|
||||
|
||||
if key.startswith("clip_model") and "attn.in_proj" in key:
|
||||
key_split = key.split(".")
|
||||
if "visual" in key:
|
||||
layer_num = int(key_split[4])
|
||||
dim = config.vision_config.hidden_size
|
||||
prefix = "vision_model"
|
||||
else:
|
||||
layer_num = int(key_split[3])
|
||||
dim = config.text_config.hidden_size
|
||||
prefix = "text_model"
|
||||
|
||||
if "weight" in key:
|
||||
orig_state_dict[f"clip.{prefix}.encoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[:dim, :]
|
||||
orig_state_dict[f"clip.{prefix}.encoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[
|
||||
dim : dim * 2, :
|
||||
]
|
||||
orig_state_dict[f"clip.{prefix}.encoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[-dim:, :]
|
||||
else:
|
||||
orig_state_dict[f"clip.{prefix}.encoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
|
||||
orig_state_dict[f"clip.{prefix}.encoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[dim : dim * 2]
|
||||
orig_state_dict[f"clip.{prefix}.encoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
|
||||
elif "self_attn" in key and "out_proj" not in key:
|
||||
key_split = key.split(".")
|
||||
layer_num = int(key_split[1])
|
||||
dim = config.reduce_dim
|
||||
if "weight" in key:
|
||||
orig_state_dict[f"decoder.layers.{layer_num}.self_attn.q_proj.weight"] = val[:dim, :]
|
||||
orig_state_dict[f"decoder.layers.{layer_num}.self_attn.k_proj.weight"] = val[dim : dim * 2, :]
|
||||
orig_state_dict[f"decoder.layers.{layer_num}.self_attn.v_proj.weight"] = val[-dim:, :]
|
||||
else:
|
||||
orig_state_dict[f"decoder.layers.{layer_num}.self_attn.q_proj.bias"] = val[:dim]
|
||||
orig_state_dict[f"decoder.layers.{layer_num}.self_attn.k_proj.bias"] = val[dim : dim * 2]
|
||||
orig_state_dict[f"decoder.layers.{layer_num}.self_attn.v_proj.bias"] = val[-dim:]
|
||||
else:
|
||||
new_name = rename_key(key)
|
||||
if "visual_projection" in new_name or "text_projection" in new_name:
|
||||
val = val.T
|
||||
orig_state_dict[new_name] = val
|
||||
|
||||
return orig_state_dict
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = Image.open(requests.get(url, stream=True).raw)
|
||||
return image
|
||||
|
||||
|
||||
def convert_clipseg_checkpoint(model_name, checkpoint_path, pytorch_dump_folder_path, push_to_hub):
|
||||
config = get_clipseg_config(model_name)
|
||||
model = CLIPSegForImageSegmentation(config)
|
||||
model.eval()
|
||||
|
||||
state_dict = torch.load(checkpoint_path, map_location="cpu")
|
||||
|
||||
# remove some keys
|
||||
for key in state_dict.copy().keys():
|
||||
if key.startswith("model"):
|
||||
state_dict.pop(key, None)
|
||||
|
||||
# rename some keys
|
||||
state_dict = convert_state_dict(state_dict, config)
|
||||
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
|
||||
|
||||
if missing_keys != ["clip.text_model.embeddings.position_ids", "clip.vision_model.embeddings.position_ids"]:
|
||||
raise ValueError("Missing keys that are not expected: {}".format(missing_keys))
|
||||
if unexpected_keys != ["decoder.reduce.weight", "decoder.reduce.bias"]:
|
||||
raise ValueError(f"Unexpected keys: {unexpected_keys}")
|
||||
|
||||
feature_extractor = ViTFeatureExtractor(size=352)
|
||||
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
|
||||
processor = CLIPSegProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||
|
||||
image = prepare_img()
|
||||
text = ["a glass", "something to fill", "wood", "a jar"]
|
||||
|
||||
inputs = processor(text=text, images=[image] * len(text), padding="max_length", return_tensors="pt")
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify values
|
||||
expected_conditional = torch.tensor([0.1110, -0.1882, 0.1645])
|
||||
expected_pooled_output = torch.tensor([0.2692, -0.7197, -0.1328])
|
||||
if model_name == "clipseg-rd64-refined":
|
||||
expected_masks_slice = torch.tensor(
|
||||
[[-10.0407, -9.9431, -10.2646], [-9.9751, -9.7064, -9.9586], [-9.6891, -9.5645, -9.9618]]
|
||||
)
|
||||
elif model_name == "clipseg-rd64":
|
||||
expected_masks_slice = torch.tensor(
|
||||
[[-7.2877, -7.2711, -7.2463], [-7.2652, -7.2780, -7.2520], [-7.2239, -7.2204, -7.2001]]
|
||||
)
|
||||
elif model_name == "clipseg-rd16":
|
||||
expected_masks_slice = torch.tensor(
|
||||
[[-6.3955, -6.4055, -6.4151], [-6.3911, -6.4033, -6.4100], [-6.3474, -6.3702, -6.3762]]
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Model name {model_name} not supported.")
|
||||
|
||||
assert torch.allclose(outputs.logits[0, :3, :3], expected_masks_slice, atol=1e-3)
|
||||
assert torch.allclose(outputs.conditional_embeddings[0, :3], expected_conditional, atol=1e-3)
|
||||
assert torch.allclose(outputs.pooled_output[0, :3], expected_pooled_output, atol=1e-3)
|
||||
print("Looks ok!")
|
||||
|
||||
if pytorch_dump_folder_path is not None:
|
||||
print(f"Saving model and processor to {pytorch_dump_folder_path}")
|
||||
model.save_pretrained(pytorch_dump_folder_path)
|
||||
processor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
if push_to_hub:
|
||||
print(f"Pushing model and processor for {model_name} to the hub")
|
||||
model.push_to_hub(f"CIDAS/{model_name}")
|
||||
processor.push_to_hub(f"CIDAS/{model_name}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--model_name",
|
||||
default="clipseg-rd64",
|
||||
type=str,
|
||||
choices=["clipseg-rd16", "clipseg-rd64", "clipseg-rd64-refined"],
|
||||
help=(
|
||||
"Name of the model. Supported models are: clipseg-rd64, clipseg-rd16 and clipseg-rd64-refined (rd meaning"
|
||||
" reduce dimension)"
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--checkpoint_path",
|
||||
default="/Users/nielsrogge/Documents/CLIPSeg/clip_plus_rd64-uni.pth",
|
||||
type=str,
|
||||
help=(
|
||||
"Path to the original checkpoint. Note that the script assumes that the checkpoint includes both CLIP and"
|
||||
" the decoder weights."
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
convert_clipseg_checkpoint(args.model_name, args.checkpoint_path, args.pytorch_dump_folder_path, args.push_to_hub)
|
1493
src/transformers/models/clipseg/modeling_clipseg.py
Normal file
1493
src/transformers/models/clipseg/modeling_clipseg.py
Normal file
File diff suppressed because it is too large
Load Diff
108
src/transformers/models/clipseg/processing_clipseg.py
Normal file
108
src/transformers/models/clipseg/processing_clipseg.py
Normal file
@ -0,0 +1,108 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Image/Text processor class for CLIPSeg
|
||||
"""
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...tokenization_utils_base import BatchEncoding
|
||||
|
||||
|
||||
class CLIPSegProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a CLIPSeg processor which wraps a CLIPSeg feature extractor and a CLIP tokenizer into a single
|
||||
processor.
|
||||
|
||||
[`CLIPSegProcessor`] offers all the functionalities of [`ViTFeatureExtractor`] and [`CLIPTokenizerFast`]. See the
|
||||
[`~CLIPSegProcessor.__call__`] and [`~CLIPSegProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
feature_extractor ([`ViTFeatureExtractor`]):
|
||||
The feature extractor is a required input.
|
||||
tokenizer ([`CLIPTokenizerFast`]):
|
||||
The tokenizer is a required input.
|
||||
"""
|
||||
feature_extractor_class = "ViTFeatureExtractor"
|
||||
tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast")
|
||||
|
||||
def __init__(self, feature_extractor, tokenizer):
|
||||
super().__init__(feature_extractor, tokenizer)
|
||||
self.current_processor = self.feature_extractor
|
||||
|
||||
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
|
||||
and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
|
||||
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
|
||||
ViTFeatureExtractor's [`~ViTFeatureExtractor.__call__`] if `images` is not `None`. Please refer to the
|
||||
doctsring of the above two methods for more information.
|
||||
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||
number of channels, H and W are image height and width.
|
||||
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors of a particular framework. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||
|
||||
Returns:
|
||||
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||
`None`).
|
||||
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
|
||||
"""
|
||||
|
||||
if text is None and images is None:
|
||||
raise ValueError("You have to specify either text or images. Both cannot be none.")
|
||||
|
||||
if text is not None:
|
||||
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
|
||||
|
||||
if images is not None:
|
||||
image_features = self.feature_extractor(images, return_tensors=return_tensors, **kwargs)
|
||||
|
||||
if text is not None and images is not None:
|
||||
encoding["pixel_values"] = image_features.pixel_values
|
||||
return encoding
|
||||
elif text is not None:
|
||||
return encoding
|
||||
else:
|
||||
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
|
||||
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||
refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
|
||||
the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
@ -56,7 +56,7 @@ class ProcessorMixin(PushToHubMixin):
|
||||
# Sanitize args and kwargs
|
||||
for key in kwargs:
|
||||
if key not in self.attributes:
|
||||
raise TypeError(f"Unexepcted keyword argument {key}.")
|
||||
raise TypeError(f"Unexpected keyword argument {key}.")
|
||||
for arg, attribute_name in zip(args, self.attributes):
|
||||
if attribute_name in kwargs:
|
||||
raise TypeError(f"Got multiple values for argument {attribute_name}.")
|
||||
|
@ -1207,6 +1207,44 @@ class CLIPVisionModel(metaclass=DummyObject):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class CLIPSegForImageSegmentation(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class CLIPSegModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class CLIPSegPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class CLIPSegTextModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class CLIPSegVisionModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
0
tests/models/clipseg/__init__.py
Normal file
0
tests/models/clipseg/__init__.py
Normal file
735
tests/models/clipseg/test_modeling_clipseg.py
Normal file
735
tests/models/clipseg/test_modeling_clipseg.py
Normal file
@ -0,0 +1,735 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch CLIPSeg model. """
|
||||
|
||||
|
||||
import inspect
|
||||
import os
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
import requests
|
||||
import transformers
|
||||
from transformers import MODEL_MAPPING, CLIPSegConfig, CLIPSegProcessor, CLIPSegTextConfig, CLIPSegVisionConfig
|
||||
from transformers.models.auto import get_values
|
||||
from transformers.testing_utils import (
|
||||
is_flax_available,
|
||||
is_pt_flax_cross_test,
|
||||
require_torch,
|
||||
require_vision,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.utils import is_torch_available, is_vision_available
|
||||
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import (
|
||||
ModelTesterMixin,
|
||||
_config_zero_init,
|
||||
floats_tensor,
|
||||
ids_tensor,
|
||||
random_attention_mask,
|
||||
)
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from transformers import CLIPSegForImageSegmentation, CLIPSegModel, CLIPSegTextModel, CLIPSegVisionModel
|
||||
from transformers.models.clipseg.modeling_clipseg import CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
|
||||
if is_flax_available():
|
||||
import jax.numpy as jnp
|
||||
from transformers.modeling_flax_pytorch_utils import (
|
||||
convert_pytorch_state_dict_to_flax,
|
||||
load_flax_weights_in_pytorch_model,
|
||||
)
|
||||
|
||||
|
||||
class CLIPSegVisionModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=12,
|
||||
image_size=30,
|
||||
patch_size=2,
|
||||
num_channels=3,
|
||||
is_training=True,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
initializer_range=0.02,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.is_training = is_training
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.initializer_range = initializer_range
|
||||
self.scope = scope
|
||||
|
||||
# in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
|
||||
num_patches = (image_size // patch_size) ** 2
|
||||
self.seq_length = num_patches + 1
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
config = self.get_config()
|
||||
|
||||
return config, pixel_values
|
||||
|
||||
def get_config(self):
|
||||
return CLIPSegVisionConfig(
|
||||
image_size=self.image_size,
|
||||
patch_size=self.patch_size,
|
||||
num_channels=self.num_channels,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
dropout=self.dropout,
|
||||
attention_dropout=self.attention_dropout,
|
||||
initializer_range=self.initializer_range,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, pixel_values):
|
||||
model = CLIPSegVisionModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
result = model(pixel_values)
|
||||
# expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
|
||||
image_size = (self.image_size, self.image_size)
|
||||
patch_size = (self.patch_size, self.patch_size)
|
||||
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
|
||||
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, pixel_values = config_and_inputs
|
||||
inputs_dict = {"pixel_values": pixel_values}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class CLIPSegVisionModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
"""
|
||||
Here we also overwrite some of the tests of test_modeling_common.py, as CLIPSeg does not use input_ids, inputs_embeds,
|
||||
attention_mask and seq_length.
|
||||
"""
|
||||
|
||||
all_model_classes = (CLIPSegVisionModel,) if is_torch_available() else ()
|
||||
fx_compatible = False
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_head_masking = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = CLIPSegVisionModelTester(self)
|
||||
self.config_tester = ConfigTester(
|
||||
self, config_class=CLIPSegVisionConfig, has_text_modality=False, hidden_size=37
|
||||
)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
@unittest.skip(reason="CLIPSeg does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
def test_model_common_attributes(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||
x = model.get_output_embeddings()
|
||||
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||
|
||||
def test_forward_signature(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
signature = inspect.signature(model.forward)
|
||||
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||
arg_names = [*signature.parameters.keys()]
|
||||
|
||||
expected_arg_names = ["pixel_values"]
|
||||
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_training(self):
|
||||
pass
|
||||
|
||||
def test_training_gradient_checkpointing(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPSegVisionModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_from_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPSegVisionModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_to_base(self):
|
||||
pass
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = CLIPSegVisionModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
class CLIPSegTextModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=12,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
max_position_embeddings=512,
|
||||
initializer_range=0.02,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.initializer_range = initializer_range
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
if input_mask is not None:
|
||||
batch_size, seq_length = input_mask.shape
|
||||
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
|
||||
for batch_idx, start_index in enumerate(rnd_start_indices):
|
||||
input_mask[batch_idx, :start_index] = 1
|
||||
input_mask[batch_idx, start_index:] = 0
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_ids, input_mask
|
||||
|
||||
def get_config(self):
|
||||
return CLIPSegTextConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
dropout=self.dropout,
|
||||
attention_dropout=self.attention_dropout,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
initializer_range=self.initializer_range,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_ids, input_mask):
|
||||
model = CLIPSegTextModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
result = model(input_ids, attention_mask=input_mask)
|
||||
result = model(input_ids)
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, input_ids, input_mask = config_and_inputs
|
||||
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class CLIPSegTextModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (CLIPSegTextModel,) if is_torch_available() else ()
|
||||
fx_compatible = False
|
||||
test_pruning = False
|
||||
test_head_masking = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = CLIPSegTextModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=CLIPSegTextConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_training(self):
|
||||
pass
|
||||
|
||||
def test_training_gradient_checkpointing(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPSeg does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPSegTextModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_from_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPSegTextModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_to_base(self):
|
||||
pass
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = CLIPSegTextModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
class CLIPSegModelTester:
|
||||
def __init__(self, parent, is_training=True):
|
||||
self.parent = parent
|
||||
self.text_model_tester = CLIPSegTextModelTester(parent)
|
||||
self.vision_model_tester = CLIPSegVisionModelTester(parent)
|
||||
self.is_training = is_training
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
|
||||
vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_ids, attention_mask, pixel_values
|
||||
|
||||
def get_config(self):
|
||||
return CLIPSegConfig.from_text_vision_configs(
|
||||
self.text_model_tester.get_config(),
|
||||
self.vision_model_tester.get_config(),
|
||||
projection_dim=64,
|
||||
reduce_dim=32,
|
||||
extract_layers=[1, 2, 3],
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
|
||||
model = CLIPSegModel(config).to(torch_device).eval()
|
||||
with torch.no_grad():
|
||||
result = model(input_ids, pixel_values, attention_mask)
|
||||
self.parent.assertEqual(
|
||||
result.logits_per_image.shape, (self.vision_model_tester.batch_size, self.text_model_tester.batch_size)
|
||||
)
|
||||
self.parent.assertEqual(
|
||||
result.logits_per_text.shape, (self.text_model_tester.batch_size, self.vision_model_tester.batch_size)
|
||||
)
|
||||
|
||||
def create_and_check_model_for_image_segmentation(self, config, input_ids, attention_maks, pixel_values):
|
||||
model = CLIPSegForImageSegmentation(config).to(torch_device).eval()
|
||||
with torch.no_grad():
|
||||
result = model(input_ids, pixel_values)
|
||||
self.parent.assertEqual(
|
||||
result.logits.shape,
|
||||
(
|
||||
self.vision_model_tester.batch_size,
|
||||
self.vision_model_tester.image_size,
|
||||
self.vision_model_tester.image_size,
|
||||
),
|
||||
)
|
||||
self.parent.assertEqual(
|
||||
result.conditional_embeddings.shape, (self.text_model_tester.batch_size, config.projection_dim)
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, input_ids, attention_mask, pixel_values = config_and_inputs
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"pixel_values": pixel_values,
|
||||
}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class CLIPSegModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (CLIPSegModel, CLIPSegForImageSegmentation) if is_torch_available() else ()
|
||||
fx_compatible = False
|
||||
test_head_masking = False
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_attention_outputs = False
|
||||
|
||||
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
|
||||
# CLIPSegForImageSegmentation requires special treatment
|
||||
if return_labels:
|
||||
if model_class.__name__ == "CLIPSegForImageSegmentation":
|
||||
batch_size, _, height, width = inputs_dict["pixel_values"].shape
|
||||
inputs_dict["labels"] = torch.zeros(
|
||||
[batch_size, height, width], device=torch_device, dtype=torch.float
|
||||
)
|
||||
|
||||
return inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = CLIPSegModelTester(self)
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_model_for_image_segmentation(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model_for_image_segmentation(*config_and_inputs)
|
||||
|
||||
@unittest.skip(reason="Hidden_states is tested in individual model tests")
|
||||
def test_hidden_states_output(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="Inputs_embeds is tested in individual model tests")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="Retain_grad is tested in individual model tests")
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPSegModel does not have input/output embeddings")
|
||||
def test_model_common_attributes(self):
|
||||
pass
|
||||
|
||||
# override as the some parameters require custom initialization
|
||||
def test_initialization(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
configs_no_init = _config_zero_init(config)
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
for name, param in model.named_parameters():
|
||||
if param.requires_grad:
|
||||
# check if `logit_scale` is initilized as per the original implementation
|
||||
if "logit_scale" in name:
|
||||
self.assertAlmostEqual(
|
||||
param.data.item(),
|
||||
np.log(1 / 0.07),
|
||||
delta=1e-3,
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
elif "film" in name or "transposed_conv" in name or "reduce" in name:
|
||||
# those parameters use PyTorch' default nn.Linear initialization scheme
|
||||
pass
|
||||
else:
|
||||
self.assertIn(
|
||||
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||
[0.0, 1.0],
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
|
||||
def _create_and_check_torchscript(self, config, inputs_dict):
|
||||
if not self.test_torchscript:
|
||||
return
|
||||
|
||||
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
|
||||
configs_no_init.torchscript = True
|
||||
configs_no_init.return_dict = False
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
try:
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
pixel_values = inputs_dict["pixel_values"] # CLIPSeg needs pixel_values
|
||||
traced_model = torch.jit.trace(model, (input_ids, pixel_values))
|
||||
except RuntimeError:
|
||||
self.fail("Couldn't trace module.")
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
|
||||
|
||||
try:
|
||||
torch.jit.save(traced_model, pt_file_name)
|
||||
except Exception:
|
||||
self.fail("Couldn't save module.")
|
||||
|
||||
try:
|
||||
loaded_model = torch.jit.load(pt_file_name)
|
||||
except Exception:
|
||||
self.fail("Couldn't load module.")
|
||||
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
loaded_model.to(torch_device)
|
||||
loaded_model.eval()
|
||||
|
||||
model_state_dict = model.state_dict()
|
||||
loaded_model_state_dict = loaded_model.state_dict()
|
||||
|
||||
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
|
||||
|
||||
models_equal = True
|
||||
for layer_name, p1 in model_state_dict.items():
|
||||
p2 = loaded_model_state_dict[layer_name]
|
||||
if p1.data.ne(p2.data).sum() > 0:
|
||||
models_equal = False
|
||||
|
||||
self.assertTrue(models_equal)
|
||||
|
||||
def test_load_vision_text_config(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
# Save CLIPSegConfig and check if we can load CLIPSegVisionConfig from it
|
||||
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||
config.save_pretrained(tmp_dir_name)
|
||||
vision_config = CLIPSegVisionConfig.from_pretrained(tmp_dir_name)
|
||||
self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
|
||||
|
||||
# Save CLIPSegConfig and check if we can load CLIPSegTextConfig from it
|
||||
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||
config.save_pretrained(tmp_dir_name)
|
||||
text_config = CLIPSegTextConfig.from_pretrained(tmp_dir_name)
|
||||
self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
|
||||
|
||||
# overwrite from common since FlaxCLIPSegModel returns nested output
|
||||
# which is not supported in the common test
|
||||
@is_pt_flax_cross_test
|
||||
def test_equivalence_pt_to_flax(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
with self.subTest(model_class.__name__):
|
||||
|
||||
# load PyTorch class
|
||||
pt_model = model_class(config).eval()
|
||||
# Flax models don't use the `use_cache` option and cache is not returned as a default.
|
||||
# So we disable `use_cache` here for PyTorch model.
|
||||
pt_model.config.use_cache = False
|
||||
|
||||
fx_model_class_name = "Flax" + model_class.__name__
|
||||
|
||||
if not hasattr(transformers, fx_model_class_name):
|
||||
return
|
||||
|
||||
fx_model_class = getattr(transformers, fx_model_class_name)
|
||||
|
||||
# load Flax class
|
||||
fx_model = fx_model_class(config, dtype=jnp.float32)
|
||||
# make sure only flax inputs are forward that actually exist in function args
|
||||
fx_input_keys = inspect.signature(fx_model.__call__).parameters.keys()
|
||||
|
||||
# prepare inputs
|
||||
pt_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||
|
||||
# remove function args that don't exist in Flax
|
||||
pt_inputs = {k: v for k, v in pt_inputs.items() if k in fx_input_keys}
|
||||
|
||||
fx_state = convert_pytorch_state_dict_to_flax(pt_model.state_dict(), fx_model)
|
||||
fx_model.params = fx_state
|
||||
|
||||
with torch.no_grad():
|
||||
pt_outputs = pt_model(**pt_inputs).to_tuple()
|
||||
|
||||
# convert inputs to Flax
|
||||
fx_inputs = {k: np.array(v) for k, v in pt_inputs.items() if torch.is_tensor(v)}
|
||||
fx_outputs = fx_model(**fx_inputs).to_tuple()
|
||||
self.assertEqual(len(fx_outputs), len(pt_outputs), "Output lengths differ between Flax and PyTorch")
|
||||
for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs[:4]):
|
||||
self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
pt_model.save_pretrained(tmpdirname)
|
||||
fx_model_loaded = fx_model_class.from_pretrained(tmpdirname, from_pt=True)
|
||||
|
||||
fx_outputs_loaded = fx_model_loaded(**fx_inputs).to_tuple()
|
||||
self.assertEqual(
|
||||
len(fx_outputs_loaded), len(pt_outputs), "Output lengths differ between Flax and PyTorch"
|
||||
)
|
||||
for fx_output_loaded, pt_output in zip(fx_outputs_loaded[:4], pt_outputs[:4]):
|
||||
self.assert_almost_equals(fx_output_loaded, pt_output.numpy(), 4e-2)
|
||||
|
||||
# overwrite from common since FlaxCLIPSegModel returns nested output
|
||||
# which is not supported in the common test
|
||||
@is_pt_flax_cross_test
|
||||
def test_equivalence_flax_to_pt(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
with self.subTest(model_class.__name__):
|
||||
# load corresponding PyTorch class
|
||||
pt_model = model_class(config).eval()
|
||||
|
||||
# So we disable `use_cache` here for PyTorch model.
|
||||
pt_model.config.use_cache = False
|
||||
|
||||
fx_model_class_name = "Flax" + model_class.__name__
|
||||
|
||||
if not hasattr(transformers, fx_model_class_name):
|
||||
# no flax model exists for this class
|
||||
return
|
||||
|
||||
fx_model_class = getattr(transformers, fx_model_class_name)
|
||||
|
||||
# load Flax class
|
||||
fx_model = fx_model_class(config, dtype=jnp.float32)
|
||||
# make sure only flax inputs are forward that actually exist in function args
|
||||
fx_input_keys = inspect.signature(fx_model.__call__).parameters.keys()
|
||||
|
||||
pt_model = load_flax_weights_in_pytorch_model(pt_model, fx_model.params)
|
||||
|
||||
# make sure weights are tied in PyTorch
|
||||
pt_model.tie_weights()
|
||||
|
||||
# prepare inputs
|
||||
pt_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||
|
||||
# remove function args that don't exist in Flax
|
||||
pt_inputs = {k: v for k, v in pt_inputs.items() if k in fx_input_keys}
|
||||
|
||||
with torch.no_grad():
|
||||
pt_outputs = pt_model(**pt_inputs).to_tuple()
|
||||
|
||||
fx_inputs = {k: np.array(v) for k, v in pt_inputs.items() if torch.is_tensor(v)}
|
||||
|
||||
fx_outputs = fx_model(**fx_inputs).to_tuple()
|
||||
self.assertEqual(len(fx_outputs), len(pt_outputs), "Output lengths differ between Flax and PyTorch")
|
||||
|
||||
for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs[:4]):
|
||||
self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
fx_model.save_pretrained(tmpdirname)
|
||||
pt_model_loaded = model_class.from_pretrained(tmpdirname, from_flax=True)
|
||||
|
||||
with torch.no_grad():
|
||||
pt_outputs_loaded = pt_model_loaded(**pt_inputs).to_tuple()
|
||||
|
||||
self.assertEqual(
|
||||
len(fx_outputs), len(pt_outputs_loaded), "Output lengths differ between Flax and PyTorch"
|
||||
)
|
||||
for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs_loaded[:4]):
|
||||
self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2)
|
||||
|
||||
def test_training(self):
|
||||
if not self.model_tester.is_training:
|
||||
return
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.return_dict = True
|
||||
|
||||
if model_class in get_values(MODEL_MAPPING):
|
||||
continue
|
||||
|
||||
print("Model class:", model_class)
|
||||
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||
for k, v in inputs.items():
|
||||
print(k, v.shape)
|
||||
loss = model(**inputs).loss
|
||||
loss.backward()
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = CLIPSegModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = Image.open(requests.get(url, stream=True).raw)
|
||||
return image
|
||||
|
||||
|
||||
@require_vision
|
||||
@require_torch
|
||||
class CLIPSegModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference_image_segmentation(self):
|
||||
model_name = "CIDAS/clipseg-rd64-refined"
|
||||
processor = CLIPSegProcessor.from_pretrained(model_name)
|
||||
model = CLIPSegForImageSegmentation.from_pretrained(model_name).to(torch_device)
|
||||
|
||||
image = prepare_img()
|
||||
texts = ["a cat", "a remote", "a blanket"]
|
||||
inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pt").to(torch_device)
|
||||
|
||||
# forward pass
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify the predicted masks
|
||||
self.assertEqual(
|
||||
outputs.logits.shape,
|
||||
torch.Size((3, 352, 352)),
|
||||
)
|
||||
expected_masks_slice = torch.tensor(
|
||||
[[-7.4577, -7.4952, -7.4072], [-7.3115, -7.0969, -7.1624], [-6.9472, -6.7641, -6.8911]]
|
||||
)
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_masks_slice, atol=1e-3))
|
||||
|
||||
# verify conditional and pooled output
|
||||
expected_conditional = torch.tensor([0.5601, -0.0314, 0.1980])
|
||||
expected_pooled_output = torch.tensor([0.2692, -0.7197, -0.1328])
|
||||
self.assertTrue(torch.allclose(outputs.conditional_embeddings[0, :3], expected_conditional, atol=1e-3))
|
||||
self.assertTrue(torch.allclose(outputs.pooled_output[0, :3], expected_pooled_output, atol=1e-3))
|
188
tests/models/clipseg/test_processor_clipseg.py
Normal file
188
tests/models/clipseg/test_processor_clipseg.py
Normal file
@ -0,0 +1,188 @@
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from transformers import CLIPTokenizer, CLIPTokenizerFast
|
||||
from transformers.models.clip.tokenization_clip import VOCAB_FILES_NAMES
|
||||
from transformers.testing_utils import require_vision
|
||||
from transformers.utils import FEATURE_EXTRACTOR_NAME, is_vision_available
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import CLIPSegProcessor, ViTFeatureExtractor
|
||||
|
||||
|
||||
@require_vision
|
||||
class CLIPSegProcessorTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
# fmt: off
|
||||
vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l</w>", "w</w>", "r</w>", "t</w>", "low</w>", "er</w>", "lowest</w>", "newer</w>", "wider", "<unk>", "<|startoftext|>", "<|endoftext|>"]
|
||||
# fmt: on
|
||||
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||
merges = ["#version: 0.2", "l o", "lo w</w>", "e r</w>", ""]
|
||||
self.special_tokens_map = {"unk_token": "<unk>"}
|
||||
|
||||
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
|
||||
self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
|
||||
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||
fp.write(json.dumps(vocab_tokens) + "\n")
|
||||
with open(self.merges_file, "w", encoding="utf-8") as fp:
|
||||
fp.write("\n".join(merges))
|
||||
|
||||
feature_extractor_map = {
|
||||
"do_resize": True,
|
||||
"size": 20,
|
||||
"do_center_crop": True,
|
||||
"crop_size": 18,
|
||||
"do_normalize": True,
|
||||
"image_mean": [0.48145466, 0.4578275, 0.40821073],
|
||||
"image_std": [0.26862954, 0.26130258, 0.27577711],
|
||||
}
|
||||
self.feature_extractor_file = os.path.join(self.tmpdirname, FEATURE_EXTRACTOR_NAME)
|
||||
with open(self.feature_extractor_file, "w", encoding="utf-8") as fp:
|
||||
json.dump(feature_extractor_map, fp)
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return CLIPTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_rust_tokenizer(self, **kwargs):
|
||||
return CLIPTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_feature_extractor(self, **kwargs):
|
||||
return ViTFeatureExtractor.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdirname)
|
||||
|
||||
def prepare_image_inputs(self):
|
||||
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
|
||||
or a list of PyTorch tensors if one specifies torchify=True."""
|
||||
|
||||
image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
|
||||
|
||||
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
|
||||
|
||||
return image_inputs
|
||||
|
||||
def test_save_load_pretrained_default(self):
|
||||
tokenizer_slow = self.get_tokenizer()
|
||||
tokenizer_fast = self.get_rust_tokenizer()
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
|
||||
processor_slow = CLIPSegProcessor(tokenizer=tokenizer_slow, feature_extractor=feature_extractor)
|
||||
processor_slow.save_pretrained(self.tmpdirname)
|
||||
processor_slow = CLIPSegProcessor.from_pretrained(self.tmpdirname, use_fast=False)
|
||||
|
||||
processor_fast = CLIPSegProcessor(tokenizer=tokenizer_fast, feature_extractor=feature_extractor)
|
||||
processor_fast.save_pretrained(self.tmpdirname)
|
||||
processor_fast = CLIPSegProcessor.from_pretrained(self.tmpdirname)
|
||||
|
||||
self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab())
|
||||
self.assertEqual(processor_fast.tokenizer.get_vocab(), tokenizer_fast.get_vocab())
|
||||
self.assertEqual(tokenizer_slow.get_vocab(), tokenizer_fast.get_vocab())
|
||||
self.assertIsInstance(processor_slow.tokenizer, CLIPTokenizer)
|
||||
self.assertIsInstance(processor_fast.tokenizer, CLIPTokenizerFast)
|
||||
|
||||
self.assertEqual(processor_slow.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||
self.assertEqual(processor_fast.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||
self.assertIsInstance(processor_slow.feature_extractor, ViTFeatureExtractor)
|
||||
self.assertIsInstance(processor_fast.feature_extractor, ViTFeatureExtractor)
|
||||
|
||||
def test_save_load_pretrained_additional_features(self):
|
||||
processor = CLIPSegProcessor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
|
||||
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
|
||||
|
||||
processor = CLIPSegProcessor.from_pretrained(
|
||||
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||
)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||
self.assertIsInstance(processor.tokenizer, CLIPTokenizerFast)
|
||||
|
||||
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
|
||||
self.assertIsInstance(processor.feature_extractor, ViTFeatureExtractor)
|
||||
|
||||
def test_feature_extractor(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = CLIPSegProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
image_input = self.prepare_image_inputs()
|
||||
|
||||
input_feat_extract = feature_extractor(image_input, return_tensors="np")
|
||||
input_processor = processor(images=image_input, return_tensors="np")
|
||||
|
||||
for key in input_feat_extract.keys():
|
||||
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||
|
||||
def test_tokenizer(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = CLIPSegProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
input_str = "lower newer"
|
||||
|
||||
encoded_processor = processor(text=input_str)
|
||||
|
||||
encoded_tok = tokenizer(input_str)
|
||||
|
||||
for key in encoded_tok.keys():
|
||||
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||
|
||||
def test_processor(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = CLIPSegProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
input_str = "lower newer"
|
||||
image_input = self.prepare_image_inputs()
|
||||
|
||||
inputs = processor(text=input_str, images=image_input)
|
||||
|
||||
self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"])
|
||||
|
||||
# test if it raises when no input is passed
|
||||
with pytest.raises(ValueError):
|
||||
processor()
|
||||
|
||||
def test_tokenizer_decode(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = CLIPSegProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||
|
||||
decoded_processor = processor.batch_decode(predicted_ids)
|
||||
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||
|
||||
self.assertListEqual(decoded_tok, decoded_processor)
|
@ -46,6 +46,7 @@ PRIVATE_MODELS = [
|
||||
# Being in this list is an exception and should **not** be the rule.
|
||||
IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
||||
# models to ignore for not tested
|
||||
"CLIPSegDecoder", # Building part of bigger (tested) model.
|
||||
"TableTransformerEncoder", # Building part of bigger (tested) model.
|
||||
"TableTransformerDecoder", # Building part of bigger (tested) model.
|
||||
"TimeSeriesTransformerEncoder", # Building part of bigger (tested) model.
|
||||
@ -140,6 +141,9 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
|
||||
# should **not** be the rule.
|
||||
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
# models to ignore for model xxx mapping
|
||||
"CLIPSegForImageSegmentation",
|
||||
"CLIPSegVisionModel",
|
||||
"CLIPSegTextModel",
|
||||
"EsmForProteinFolding",
|
||||
"TimeSeriesTransformerForPrediction",
|
||||
"PegasusXEncoder",
|
||||
|
@ -35,6 +35,7 @@ src/transformers/models/bloom/configuration_bloom.py
|
||||
src/transformers/models/camembert/configuration_camembert.py
|
||||
src/transformers/models/canine/configuration_canine.py
|
||||
src/transformers/models/clip/configuration_clip.py
|
||||
src/transformers/models/clipseg/modeling_clipseg.py
|
||||
src/transformers/models/codegen/configuration_codegen.py
|
||||
src/transformers/models/conditional_detr/configuration_conditional_detr.py
|
||||
src/transformers/models/conditional_detr/modeling_conditional_detr.py
|
||||
|
Loading…
Reference in New Issue
Block a user