LLaVA-NeXT

LLaVA‑NeXT is the latest version of the LLaVA family of multimodal models. It supports high-resolution inputs (up to 1344 × 336), improved OCR, richer reasoning, and efficient deployment—while using a minimalist design and under 1M instruction images.

You can find all the original LLaVA‑NeXT checkpoints under the LLaVA-NeXT collection on Hugging Face.

Tip

Click on the LLaVA‑NeXT models in the right sidebar for more examples on OCR, visual question answering, document understanding, and multi-image reasoning.

The example below demonstrates how to generate text based on an image with [pipeline] or the [AutoModel] class.

from transformers import pipeline
from PIL import Image
import requests

pipe = pipeline("image-to-text", model="llava-hf/llava-v1.6-mistral-7b-hf", device="cuda")
image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png", stream=True).raw)

result = pipe(image, prompt="What does this chart show?")
print(result[0]["generated_text"])

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import requests, torch

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
).to("cuda")

image = Image.open(requests.get(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png", stream=True).raw)

conversation = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What does this chart show?"}]}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Quantization reduces memory usage by representing weights in lower precision. Use BitsAndBytes 4-bit quantization like this:

from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForImageTextToText.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quant_config,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

Use the AttentionMaskVisualizer to explore which tokens the model attends to:

from transformers.utils.attention_visualizer import AttentionMaskVisualizer

viz = AttentionMaskVisualizer("llava-hf/llava-v1.6-mistral-7b-hf")
viz("<image> What is shown in this image?")

Notes

Prompt format matters: Different checkpoints (Mistral, Vicuna, LLaMA, LLaMA‑3 / OpenChat formats) require specific syntax. Always use processor.apply_chat_template.
Multi-image support: You can pass multiple images—just align <image> tokens with image order.
Batching tip: Set processor.tokenizer.padding_side = "left" for best batched generation results.
Check out the official blog post for detailed benchmarks showing LLaVA‑NeXT outperforming Gemini Pro on MM‑Vet, TextVQA, ScienceQA, and more.

LlavaNextConfig

autodoc LlavaNextConfig

LlavaNextImageProcessor

autodoc LlavaNextImageProcessor - preprocess

LlavaNextImageProcessorFast

autodoc LlavaNextImageProcessorFast - preprocess

LlavaNextProcessor

autodoc LlavaNextProcessor

LlavaNextModel

autodoc LlavaNextModel

LlavaNextForConditionalGeneration

autodoc LlavaNextForConditionalGeneration - forward

5.5 KiB Raw Blame History Unescape Escape