transformers/docs/source/en/model_doc/clip.md
Purusharth Malik 199d7adf10
Updated the model card for CLIP (#37040)
* Update clip.md

* Update docs/source/en/model_doc/clip.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/clip.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/clip.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Incorporated suggested changes

* Update docs/source/en/model_doc/clip.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/clip.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/clip.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-04-02 14:57:38 -07:00

8.5 KiB

PyTorch TensorFlow Flax FlashAttention SDPA

CLIP

CLIP is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.

You can find all the original CLIP checkpoints under the OpenAI organization.

Tip

Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.

The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [Pipeline] or the [AutoModel] class.

import torch
from transformers import pipeline

clip = pipeline(
   task="zero-shot-image-classification",
   model="openai/clip-vit-base-patch32",
   torch_dtype=torch.bfloat16,
   device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")

Notes

  • Use [CLIPImageProcessor] to resize (or rescale) and normalizes images for the model.

CLIPConfig

autodoc CLIPConfig - from_text_vision_configs

CLIPTextConfig

autodoc CLIPTextConfig

CLIPVisionConfig

autodoc CLIPVisionConfig

CLIPTokenizer

autodoc CLIPTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

CLIPTokenizerFast

autodoc CLIPTokenizerFast

CLIPImageProcessor

autodoc CLIPImageProcessor - preprocess

CLIPImageProcessorFast

autodoc CLIPImageProcessorFast - preprocess

CLIPFeatureExtractor

autodoc CLIPFeatureExtractor

CLIPProcessor

autodoc CLIPProcessor

CLIPModel

autodoc CLIPModel - forward - get_text_features - get_image_features

CLIPTextModel

autodoc CLIPTextModel - forward

CLIPTextModelWithProjection

autodoc CLIPTextModelWithProjection - forward

CLIPVisionModelWithProjection

autodoc CLIPVisionModelWithProjection - forward

CLIPVisionModel

autodoc CLIPVisionModel - forward

CLIPForImageClassification

autodoc CLIPForImageClassification - forward

TFCLIPModel

autodoc TFCLIPModel - call - get_text_features - get_image_features

TFCLIPTextModel

autodoc TFCLIPTextModel - call

TFCLIPVisionModel

autodoc TFCLIPVisionModel - call

FlaxCLIPModel

autodoc FlaxCLIPModel - call - get_text_features - get_image_features

FlaxCLIPTextModel

autodoc FlaxCLIPTextModel - call

FlaxCLIPTextModelWithProjection

autodoc FlaxCLIPTextModelWithProjection - call

FlaxCLIPVisionModel

autodoc FlaxCLIPVisionModel - call