mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-05 22:00:09 +06:00

* add default chat templates * Update src/transformers/models/llava/processing_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/llava_next/processing_llava_next.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * more clear docstring and docs * Update docs/source/en/model_doc/llava.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * add tests * remove default templates (see #31733) * load chat template from another file * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * revert some changes in docs * forgot vipllava * chat template file is not temporary hack * warn if loading from processor * not that file * similarly modify `save_pretrained` * Update tests/models/llava_next/test_processor_llava_next.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/vipllava/test_processor_vipllava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
129 lines
6.2 KiB
Markdown
129 lines
6.2 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||
the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations under the License.
|
||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||
rendered properly in your Markdown viewer.
|
||
|
||
-->
|
||
|
||
# LLaVa
|
||
|
||
## Overview
|
||
|
||
LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.
|
||
|
||
The LLaVa model was proposed in [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) and improved in [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/pdf/2310.03744) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.
|
||
|
||
The abstract from the paper is the following:
|
||
|
||
*Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available*
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_architecture.jpg"
|
||
alt="drawing" width="600"/>
|
||
|
||
<small> LLaVa architecture. Taken from the <a href="https://arxiv.org/abs/2304.08485">original paper.</a> </small>
|
||
|
||
This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ) and [ybelkada](https://huggingface.co/ybelkada).
|
||
The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/main/llava).
|
||
|
||
## Usage tips
|
||
|
||
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
|
||
|
||
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
|
||
|
||
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
|
||
|
||
```python
|
||
from transformers import AutoProcessor
|
||
|
||
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
|
||
|
||
conversation = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "image"},
|
||
{"type": "text", "text": "What’s shown in this image?"},
|
||
],
|
||
},
|
||
{
|
||
"role": "assistant",
|
||
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
|
||
},
|
||
{
|
||
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "text", "text": "Describe the image in more details."},
|
||
],
|
||
},
|
||
]
|
||
|
||
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
|
||
|
||
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
|
||
print(text_prompt)
|
||
>>> "USER: <image>\n<What’s shown in this image? ASSISTANT: This image shows a red stop sign.</s>USER: Describe the image in more details. ASSISTANT:"
|
||
```
|
||
|
||
- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint:
|
||
|
||
[llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format:
|
||
```bash
|
||
"<|im_start|>user <image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant"
|
||
```
|
||
|
||
For multiple turns conversation:
|
||
|
||
```bash
|
||
"<|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant <answer1><|im_end|><|im_start|>user <image>\n<prompt1><|im_end|><|im_start|>assistant "
|
||
```
|
||
|
||
[llava-1.5 models](https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0) requires the following format:
|
||
```bash
|
||
"USER: <image>\n<prompt> ASSISTANT:"
|
||
```
|
||
|
||
For multiple turns conversation:
|
||
|
||
```bash
|
||
"USER: <image>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
|
||
```
|
||
|
||
|
||
### Using Flash Attention 2
|
||
|
||
Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [Flash Attention 2 section of performance docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one).
|
||
|
||
## Resources
|
||
|
||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.
|
||
|
||
<PipelineTag pipeline="image-to-text"/>
|
||
|
||
- A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference.
|
||
- A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎
|
||
|
||
|
||
## LlavaConfig
|
||
|
||
[[autodoc]] LlavaConfig
|
||
|
||
## LlavaProcessor
|
||
|
||
[[autodoc]] LlavaProcessor
|
||
|
||
## LlavaForConditionalGeneration
|
||
|
||
[[autodoc]] LlavaForConditionalGeneration
|
||
- forward
|