PyTorch TensorFlow Flax FlashAttention SDPA
# CLIP [CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score. You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai?search_models=clip) organization. > [!TIP] > Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks. The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [`Pipeline`] or the [`AutoModel`] class. ```py import torch from transformers import pipeline clip = pipeline( task="zero-shot-image-classification", model="openai/clip-vit-base-patch32", torch_dtype=torch.bfloat16, device=0 ) labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels) ``` ```py import requests import torch from PIL import Image from transformers import AutoProcessor, AutoModel model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.bfloat16, attn_implementation="sdpa") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"] inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) most_likely_idx = probs.argmax(dim=1).item() most_likely_label = labels[most_likely_idx] print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}") ``` ## Notes - Use [`CLIPImageProcessor`] to resize (or rescale) and normalizes images for the model. ## CLIPConfig [[autodoc]] CLIPConfig - from_text_vision_configs ## CLIPTextConfig [[autodoc]] CLIPTextConfig ## CLIPVisionConfig [[autodoc]] CLIPVisionConfig ## CLIPTokenizer [[autodoc]] CLIPTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary ## CLIPTokenizerFast [[autodoc]] CLIPTokenizerFast ## CLIPImageProcessor [[autodoc]] CLIPImageProcessor - preprocess ## CLIPImageProcessorFast [[autodoc]] CLIPImageProcessorFast - preprocess ## CLIPFeatureExtractor [[autodoc]] CLIPFeatureExtractor ## CLIPProcessor [[autodoc]] CLIPProcessor ## CLIPModel [[autodoc]] CLIPModel - forward - get_text_features - get_image_features ## CLIPTextModel [[autodoc]] CLIPTextModel - forward ## CLIPTextModelWithProjection [[autodoc]] CLIPTextModelWithProjection - forward ## CLIPVisionModelWithProjection [[autodoc]] CLIPVisionModelWithProjection - forward ## CLIPVisionModel [[autodoc]] CLIPVisionModel - forward ## CLIPForImageClassification [[autodoc]] CLIPForImageClassification - forward ## TFCLIPModel [[autodoc]] TFCLIPModel - call - get_text_features - get_image_features ## TFCLIPTextModel [[autodoc]] TFCLIPTextModel - call ## TFCLIPVisionModel [[autodoc]] TFCLIPVisionModel - call ## FlaxCLIPModel [[autodoc]] FlaxCLIPModel - __call__ - get_text_features - get_image_features ## FlaxCLIPTextModel [[autodoc]] FlaxCLIPTextModel - __call__ ## FlaxCLIPTextModelWithProjection [[autodoc]] FlaxCLIPTextModelWithProjection - __call__ ## FlaxCLIPVisionModel [[autodoc]] FlaxCLIPVisionModel - __call__