PyTorch FlashAttention SDPA
# SigLIP [SigLIP](https://huggingface.co/papers/2303.15343) is a multimodal image-text model similar to [CLIP](clip). It uses separate image and text encoders to generate representations for both modalities. Unlike CLIP, SigLIP employs a pairwise sigmoid loss on image-text pairs during training. This training loss eliminates the need for a global view of all pairwise similarities between images and texts within a batch. Consequently, it enables more efficient scaling to larger batch sizes while also delivering superior performance with smaller batch sizes. You can find all the original SigLIP checkpoints under the [SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) collection. > [!TIP] > Click on the SigLIP models in the right sidebar for more examples of how to apply SigLIP to different image and text tasks. The example below demonstrates how to generate similarity scores between texts and image(s) with [`Pipeline`] or the [`AutoModel`] class. ```py import torch from transformers import pipeline image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"] pipeline = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-224", device=0, torch_dtype=torch.bfloat16) pipeline(image, candidate_labels=candidate_labels) ``` ```py import torch import requests from PIL import Image from transformers import AutoProcessor, AutoModel model = AutoModel.from_pretrained("google/siglip-base-patch16-224", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa") processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" image = Image.open(requests.get(url, stream=True).raw) candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"] texts = [f'This is a photo of {label}.' for label in candidate_labels] inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = torch.sigmoid(logits_per_image) print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4. ```py import torch import requests from PIL import Image from transformers import AutoProcessor, AutoModel, BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=True) model = AutoModel.from_pretrained("google/siglip-base-patch16-224", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa") processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" image = Image.open(requests.get(url, stream=True).raw) candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"] texts = [f'This is a photo of {label}.' for label in candidate_labels] inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = torch.sigmoid(logits_per_image) print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") ``` ## Notes - Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size. - When using the standalone [`SiglipTokenizer`] or [`SiglipProcessor`], make sure to pass `padding="max_length"` because that is how the model was trained. - To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor. - Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention. ```py # pip install -U flash-attn --no-build-isolation from transformers import SiglipModel model = SiglipModel.from_pretrained( "google/siglip-so400m-patch14-384", attn_implementation="flash_attention_2", torch_dtype=torch.float16, device_map=device, ) ``` ## SiglipConfig [[autodoc]] SiglipConfig - from_text_vision_configs ## SiglipTextConfig [[autodoc]] SiglipTextConfig ## SiglipVisionConfig [[autodoc]] SiglipVisionConfig ## SiglipTokenizer [[autodoc]] SiglipTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary ## SiglipImageProcessor [[autodoc]] SiglipImageProcessor - preprocess ## SiglipImageProcessorFast [[autodoc]] SiglipImageProcessorFast - preprocess ## SiglipProcessor [[autodoc]] SiglipProcessor ## SiglipModel [[autodoc]] SiglipModel - forward - get_text_features - get_image_features ## SiglipTextModel [[autodoc]] SiglipTextModel - forward ## SiglipVisionModel [[autodoc]] SiglipVisionModel - forward ## SiglipForImageClassification [[autodoc]] SiglipForImageClassification - forward