
* Update docs/source/en/model_doc/blip.md * fix(docs/source/en/model_doc/blip.md): fix redundent typo error * fix (docs/source/en/model_doc/blip.md): modify of review contents * fix(docs/source/en/model_doc/blip.md): modify code block * Update blip.md --------- Co-authored-by: devkade <mouseku@moana-master> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
5.1 KiB
BLIP
BLIP (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for both understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.
You can find all the original BLIP checkpoints under the BLIP collection.
Tip
This model was contributed by ybelkada.
Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks.
The example below demonstrates how to visual question answering with [Pipeline
] or the [AutoModel
] class.
import torch
from transformers import pipeline
pipeline = pipeline(
task="visual-question-answering",
model="Salesforce/blip-vqa-base",
torch_dtype=torch.float16,
device=0
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
pipeline(question="What is the weather in this image?", image=url)
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = AutoModelForVisualQuestionAnswering.from_pretrained(
"Salesforce/blip-vqa-base",
torch_dtype=torch.float16,
device_map="auto"
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
question = "What is the weather in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs)
processor.batch_decode(output, skip_special_tokens=True)[0]
Resources
Refer to this notebook to learn how to fine-tune BLIP for image captioning on a custom dataset.
BlipConfig
autodoc BlipConfig - from_text_vision_configs
BlipTextConfig
autodoc BlipTextConfig
BlipVisionConfig
autodoc BlipVisionConfig
BlipProcessor
autodoc BlipProcessor
BlipImageProcessor
autodoc BlipImageProcessor - preprocess
BlipImageProcessorFast
autodoc BlipImageProcessorFast - preprocess
BlipModel
BlipModel
is going to be deprecated in future versions, please use BlipForConditionalGeneration
, BlipForImageTextRetrieval
or BlipForQuestionAnswering
depending on your usecase.
autodoc BlipModel - forward - get_text_features - get_image_features
BlipTextModel
autodoc BlipTextModel - forward
BlipTextLMHeadModel
autodoc BlipTextLMHeadModel
- forward
BlipVisionModel
autodoc BlipVisionModel - forward
BlipForConditionalGeneration
autodoc BlipForConditionalGeneration - forward
BlipForImageTextRetrieval
autodoc BlipForImageTextRetrieval - forward
BlipForQuestionAnswering
autodoc BlipForQuestionAnswering - forward
TFBlipModel
autodoc TFBlipModel - call - get_text_features - get_image_features
TFBlipTextModel
autodoc TFBlipTextModel - call
TFBlipTextLMHeadModel
autodoc TFBlipTextLMHeadModel
- forward
TFBlipVisionModel
autodoc TFBlipVisionModel - call
TFBlipForConditionalGeneration
autodoc TFBlipForConditionalGeneration - call
TFBlipForImageTextRetrieval
autodoc TFBlipForImageTextRetrieval - call
TFBlipForQuestionAnswering
autodoc TFBlipForQuestionAnswering - call