Update altCLIP model card (#38306)

* Update altclip.md * Update altclip.md * Update altclip.md * Update altclip.md * Update altclip.md * Update altclip.md * Rename altclip.md to altclip.mdx * Rename altclip.mdx to altclip.md * Update altclip.md * Update altclip.md * Update altclip.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-07-04 13:20:12 +06:00 · 2025-06-11 23:48:34 +02:00 · 2025-06-11 23:48:34 +02:00 · 32dbf4bddb
commit 32dbf4bddb
parent 1dcb022e8f
1 changed files with 68 additions and 64 deletions
--- a/docs/source/en/model_doc/altclip.md
+++ b/docs/source/en/model_doc/altclip.md
@ -14,103 +14,107 @@ rendered properly in your Markdown viewer.
 -->
-# AltCLIP
+<div style="float: right;">
-
+  <div class="flex flex-wrap space-x-1">
-<div class="flex flex-wrap space-x-1">
+    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
-## Overview
+# AltCLIP
-The AltCLIP model was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679v2) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu. AltCLIP
+[AltCLIP](https://huggingface.co/papers/2211.06679v2) replaces the [CLIP](./clip) text encoder with a multilingual XLM-R encoder and aligns image and text representations with teacher learning and contrastive learning.
 (Altering the Language Encoder in CLIP) is a neural network trained on a variety of image-text and text-text pairs. By switching CLIP's
 text encoder with a pretrained multilingual text encoder XLM-R, we could obtain very close performances with CLIP on almost all tasks, and extended original CLIP's capabilities such as multilingual understanding.
-The abstract from the paper is the following:
+You can find all the original AltCLIP checkpoints under the [AltClip](https://huggingface.co/collections/BAAI/alt-clip-diffusion-66987a97de8525205f1221bf) collection.
-*In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. 
+> [!TIP]
-Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained 
+> Click on the AltCLIP models in the right sidebar for more examples of how to apply AltCLIP to different tasks.
 multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of 
 teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art 
 performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with 
 CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
-This model was contributed by [jongjyh](https://huggingface.co/jongjyh).
+The examples below demonstrates how to calculate similarity scores between an image and one or more captions with the [`AutoModel`] class.
-## Usage tips and example
+<hfoptions id="usage">
-
+<hfoption id="AutoModel">
 The usage of AltCLIP is very similar to the CLIP. the difference between CLIP is the text encoder. Note that we use bidirectional attention instead of casual attention
 and we take the [CLS] token in XLM-R to represent text embedding.
 AltCLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
 classification. AltCLIP uses a ViT like transformer to get visual features and a bidirectional language model to get the text
 features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
 product between the projected image and text features is then used as a similar score.
 To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
 which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
 also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
 The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model.
 The [`AltCLIPProcessor`] wraps a [`CLIPImageProcessor`] and a [`XLMRobertaTokenizer`] into a single instance to both
 encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
 [`AltCLIPProcessor`] and [`AltCLIPModel`].
 ```python
->>> from PIL import Image
+import torch
->>> import requests
+import requests
 from PIL import Image
 from transformers import AltCLIPModel, AltCLIPProcessor
->>> from transformers import AltCLIPModel, AltCLIPProcessor
+model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", torch_dtype=torch.bfloat16)
 processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
->>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
->>> processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
+image = Image.open(requests.get(url, stream=True).raw)
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
 >>> image = Image.open(requests.get(url, stream=True).raw)
->>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
 logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
 probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
->>> outputs = model(**inputs)
+labels = ["a photo of a cat", "a photo of a dog"]
->>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+for label, prob in zip(labels, probs[0]):
->>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+    print(f"{label}: {prob.item():.4f}")
 ```
-<Tip>
+</hfoption>
 </hfoptions>
-This model is based on `CLIPModel`, use it like you would use the original [CLIP](clip).
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-</Tip>
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
 ```python
 # !pip install torchao
 import torch
 import requests
 from PIL import Image
 from transformers import AltCLIPModel, AltCLIPProcessor, TorchAoConfig
 model = AltCLIPModel.from_pretrained(
    "BAAI/AltCLIP",
    quantization_config=TorchAoConfig("int4_weight_only", group_size=128),
    torch_dtype=torch.bfloat16,
 )
 processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
 outputs = model(**inputs)
 logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
 probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
 labels = ["a photo of a cat", "a photo of a dog"]
 for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob.item():.4f}")
 ```
 ## Notes
 - AltCLIP uses bidirectional attention instead of causal attention and it uses the `[CLS]` token in XLM-R to represent a text embedding.
 - Use [`CLIPImageProcessor`] to resize (or rescale) and normalize images for the model.
 - [`AltCLIPProcessor`] combines [`CLIPImageProcessor`] and [`XLMRobertaTokenizer`] into a single instance to encode text and prepare images.
 ## AltCLIPConfig
 [[autodoc]] AltCLIPConfig
    - from_text_vision_configs
 ## AltCLIPTextConfig
 [[autodoc]] AltCLIPTextConfig
 ## AltCLIPVisionConfig
 [[autodoc]] AltCLIPVisionConfig
 ## AltCLIPProcessor
 [[autodoc]] AltCLIPProcessor
 ## AltCLIPModel
 [[autodoc]] AltCLIPModel
    - forward
    - get_text_features
    - get_image_features
 ## AltCLIPTextModel
 [[autodoc]] AltCLIPTextModel
    - forward
 ## AltCLIPVisionModel
 [[autodoc]] AltCLIPVisionModel
-    - forward
+
 ## AltCLIPProcessor
 [[autodoc]] AltCLIPProcessor