mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-08 07:10:06 +06:00

* Improve docs * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Address comment * Address comment * Improve pixtral docs --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
184 lines
7.9 KiB
Markdown
184 lines
7.9 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||
the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations under the License.
|
||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||
rendered properly in your Markdown viewer.
|
||
|
||
-->
|
||
|
||
# LLaVa
|
||
|
||
## Overview
|
||
|
||
LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.
|
||
|
||
The LLaVa model was proposed in [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) and improved in [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/pdf/2310.03744) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.
|
||
|
||
The abstract from the paper is the following:
|
||
|
||
*Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available*
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_architecture.jpg"
|
||
alt="drawing" width="600"/>
|
||
|
||
<small> LLaVa architecture. Taken from the <a href="https://arxiv.org/abs/2304.08485">original paper.</a> </small>
|
||
|
||
This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ) and [ybelkada](https://huggingface.co/ybelkada).
|
||
The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/main/llava).
|
||
|
||
## Usage tips
|
||
|
||
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
|
||
|
||
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
|
||
|
||
### Single image inference
|
||
|
||
For best results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
|
||
|
||
```python
|
||
from transformers import AutoProcessor
|
||
|
||
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
|
||
|
||
conversation = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "image"},
|
||
{"type": "text", "text": "What’s shown in this image?"},
|
||
],
|
||
},
|
||
{
|
||
"role": "assistant",
|
||
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
|
||
},
|
||
{
|
||
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "text", "text": "Describe the image in more details."},
|
||
],
|
||
},
|
||
]
|
||
|
||
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
|
||
|
||
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
|
||
print(text_prompt)
|
||
>>> "USER: <image>\n<What’s shown in this image? ASSISTANT: This image shows a red stop sign.</s>USER: Describe the image in more details. ASSISTANT:"
|
||
```
|
||
|
||
### Batched inference
|
||
|
||
LLaVa also supports batched inference. Here is how you can do it:
|
||
|
||
```python
|
||
import requests
|
||
from PIL import Image
|
||
import torch
|
||
from transformers import AutoProcessor, LLavaForConditionalGeneration
|
||
|
||
# Load the model in half-precision
|
||
model = LLavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
|
||
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
|
||
|
||
# Get two different images
|
||
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
|
||
image_stop = Image.open(requests.get(url, stream=True).raw)
|
||
|
||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||
image_cats = Image.open(requests.get(url, stream=True).raw)
|
||
|
||
# Prepare a batch of two prompts
|
||
conversation_1 = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "image"},
|
||
{"type": "text", "text": "What is shown in this image?"},
|
||
],
|
||
},
|
||
]
|
||
|
||
conversation_2 = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "image"},
|
||
{"type": "text", "text": "What is shown in this image?"},
|
||
],
|
||
},
|
||
]
|
||
|
||
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
|
||
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)
|
||
prompts = [prompt_1, prompt_2]
|
||
|
||
# We can simply feed images in the order they have to be used in the text prompt
|
||
inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16)
|
||
|
||
# Generate
|
||
generate_ids = model.generate(**inputs, max_new_tokens=30)
|
||
processor.batch_decode(generate_ids, skip_special_tokens=True)
|
||
```
|
||
|
||
- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint:
|
||
|
||
[llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format:
|
||
```bash
|
||
"<|im_start|>user <image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant"
|
||
```
|
||
|
||
For multiple turns conversation:
|
||
|
||
```bash
|
||
"<|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant <answer1><|im_end|><|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant "
|
||
```
|
||
|
||
[llava-1.5 models](https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0) requires the following format:
|
||
```bash
|
||
"USER: <image>\n<prompt> ASSISTANT:"
|
||
```
|
||
|
||
For multiple turns conversation:
|
||
|
||
```bash
|
||
"USER: <image>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
|
||
```
|
||
|
||
### Using Flash Attention 2
|
||
|
||
Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [Flash Attention 2 section of performance docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one).
|
||
|
||
## Resources
|
||
|
||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.
|
||
|
||
<PipelineTag pipeline="image-to-text"/>
|
||
|
||
- A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference.
|
||
- A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎
|
||
|
||
|
||
## LlavaConfig
|
||
|
||
[[autodoc]] LlavaConfig
|
||
|
||
## LlavaProcessor
|
||
|
||
[[autodoc]] LlavaProcessor
|
||
|
||
## LlavaForConditionalGeneration
|
||
|
||
[[autodoc]] LlavaForConditionalGeneration
|
||
- forward
|