From 2166b6b4ff09f6dd3867ab982f262f66482aa968 Mon Sep 17 00:00:00 2001 From: DongKyu Kang Date: Sat, 21 Jun 2025 05:46:19 +0900 Subject: [PATCH] Update blip model card (#38513) * Update docs/source/en/model_doc/blip.md * fix(docs/source/en/model_doc/blip.md): fix redundent typo error * fix (docs/source/en/model_doc/blip.md): modify of review contents * fix(docs/source/en/model_doc/blip.md): modify code block * Update blip.md --------- Co-authored-by: devkade Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/blip.md | 77 ++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 18 deletions(-) diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md index e97f7374fd1..a8d4c5a14bb 100644 --- a/docs/source/en/model_doc/blip.md +++ b/docs/source/en/model_doc/blip.md @@ -14,35 +14,76 @@ rendered properly in your Markdown viewer. --> -# BLIP - -
-PyTorch -TensorFlow +
+
+ PyTorch + TensorFlow +
-## Overview +# BLIP -The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://huggingface.co/papers/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. +[BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data. -BLIP is a model that is able to perform various multi-modal tasks including: -- Visual Question Answering -- Image-Text retrieval (Image-text matching) -- Image Captioning -The abstract from the paper is the following: +You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection. -*Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. -However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.* +> [!TIP] +> This model was contributed by [ybelkada](https://huggingface.co/ybelkada). +> +> Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks. -![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) +The example below demonstrates how to visual question answering with [`Pipeline`] or the [`AutoModel`] class. -This model was contributed by [ybelkada](https://huggingface.co/ybelkada). -The original code can be found [here](https://github.com/salesforce/BLIP). + + + +```python +import torch +from transformers import pipeline + +pipeline = pipeline( + task="visual-question-answering", + model="Salesforce/blip-vqa-base", + torch_dtype=torch.float16, + device=0 +) +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +pipeline(question="What is the weather in this image?", image=url) +``` + + + + +```python +import requests +import torch +from PIL import Image +from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering + +processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base") +model = AutoModelForVisualQuestionAnswering.from_pretrained( + "Salesforce/blip-vqa-base", + torch_dtype=torch.float16, + device_map="auto" +) + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +image = Image.open(requests.get(url, stream=True).raw) + +question = "What is the weather in this image?" +inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16) + +output = model.generate(**inputs) +processor.batch_decode(output, skip_special_tokens=True)[0] +``` + + + ## Resources -- [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset +Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) to learn how to fine-tune BLIP for image captioning on a custom dataset. ## BlipConfig