
* Update clip.md * Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Incorporated suggested changes * Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/clip.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
8.5 KiB
CLIP
CLIP is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.
You can find all the original CLIP checkpoints under the OpenAI organization.
Tip
Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.
The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [Pipeline
] or the [AutoModel
] class.
import torch
from transformers import pipeline
clip = pipeline(
task="zero-shot-image-classification",
model="openai/clip-vit-base-patch32",
torch_dtype=torch.bfloat16,
device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
Notes
- Use [
CLIPImageProcessor
] to resize (or rescale) and normalizes images for the model.
CLIPConfig
autodoc CLIPConfig - from_text_vision_configs
CLIPTextConfig
autodoc CLIPTextConfig
CLIPVisionConfig
autodoc CLIPVisionConfig
CLIPTokenizer
autodoc CLIPTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary
CLIPTokenizerFast
autodoc CLIPTokenizerFast
CLIPImageProcessor
autodoc CLIPImageProcessor - preprocess
CLIPImageProcessorFast
autodoc CLIPImageProcessorFast - preprocess
CLIPFeatureExtractor
autodoc CLIPFeatureExtractor
CLIPProcessor
autodoc CLIPProcessor
CLIPModel
autodoc CLIPModel - forward - get_text_features - get_image_features
CLIPTextModel
autodoc CLIPTextModel - forward
CLIPTextModelWithProjection
autodoc CLIPTextModelWithProjection - forward
CLIPVisionModelWithProjection
autodoc CLIPVisionModelWithProjection - forward
CLIPVisionModel
autodoc CLIPVisionModel - forward
CLIPForImageClassification
autodoc CLIPForImageClassification - forward
TFCLIPModel
autodoc TFCLIPModel - call - get_text_features - get_image_features
TFCLIPTextModel
autodoc TFCLIPTextModel - call
TFCLIPVisionModel
autodoc TFCLIPVisionModel - call
FlaxCLIPModel
autodoc FlaxCLIPModel - call - get_text_features - get_image_features
FlaxCLIPTextModel
autodoc FlaxCLIPTextModel - call
FlaxCLIPTextModelWithProjection
autodoc FlaxCLIPTextModelWithProjection - call
FlaxCLIPVisionModel
autodoc FlaxCLIPVisionModel - call