transformers/docs/source/en/model_doc/llava_next.md
2025-06-18 13:57:42 -04:00

5.5 KiB
Raw Blame History

PyTorch FlashAttention Multimodal

LLaVA-NeXT

LLaVANeXT is the latest version of the LLaVA family of multimodal models. It supports high-resolution inputs (up to 1344 × 336), improved OCR, richer reasoning, and efficient deployment—while using a minimalist design and under 1M instruction images.

You can find all the original LLaVANeXT checkpoints under the LLaVA-NeXT collection on Hugging Face.

Tip

Click on the LLaVANeXT models in the right sidebar for more examples on OCR, visual question answering, document understanding, and multi-image reasoning.

The example below demonstrates how to generate text based on an image with [pipeline] or the [AutoModel] class.

from transformers import pipeline
from PIL import Image
import requests

pipe = pipeline("image-to-text", model="llava-hf/llava-v1.6-mistral-7b-hf", device="cuda")
image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png", stream=True).raw)

result = pipe(image, prompt="What does this chart show?")
print(result[0]["generated_text"])
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import requests, torch

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
).to("cuda")

image = Image.open(requests.get(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png", stream=True).raw)

conversation = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What does this chart show?"}]}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Quantization reduces memory usage by representing weights in lower precision. Use BitsAndBytes 4-bit quantization like this:

from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForImageTextToText.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quant_config,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

Use the AttentionMaskVisualizer to explore which tokens the model attends to:

from transformers.utils.attention_visualizer import AttentionMaskVisualizer

viz = AttentionMaskVisualizer("llava-hf/llava-v1.6-mistral-7b-hf")
viz("<image> What is shown in this image?")

Notes

  • Prompt format matters: Different checkpoints (Mistral, Vicuna, LLaMA, LLaMA3 / OpenChat formats) require specific syntax. Always use processor.apply_chat_template.
  • Multi-image support: You can pass multiple images—just align <image> tokens with image order.
  • Batching tip: Set processor.tokenizer.padding_side = "left" for best batched generation results.
  • Check out the official blog post for detailed benchmarks showing LLaVANeXT outperforming Gemini Pro on MMVet, TextVQA, ScienceQA, and more.

LlavaNextConfig

autodoc LlavaNextConfig

LlavaNextImageProcessor

autodoc LlavaNextImageProcessor - preprocess

LlavaNextImageProcessorFast

autodoc LlavaNextImageProcessorFast - preprocess

LlavaNextProcessor

autodoc LlavaNextProcessor

LlavaNextModel

autodoc LlavaNextModel

LlavaNextForConditionalGeneration

autodoc LlavaNextForConditionalGeneration - forward