5.5 KiB
LLaVA-NeXT
LLaVA‑NeXT is the latest version of the LLaVA family of multimodal models. It supports high-resolution inputs (up to 1344 × 336), improved OCR, richer reasoning, and efficient deployment—while using a minimalist design and under 1M instruction images.
You can find all the original LLaVA‑NeXT checkpoints under the LLaVA-NeXT collection on Hugging Face.
Tip
Click on the LLaVA‑NeXT models in the right sidebar for more examples on OCR, visual question answering, document understanding, and multi-image reasoning.
The example below demonstrates how to generate text based on an image with [pipeline
] or the [AutoModel
] class.
from transformers import pipeline
from PIL import Image
import requests
pipe = pipeline("image-to-text", model="llava-hf/llava-v1.6-mistral-7b-hf", device="cuda")
image = Image.open(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png", stream=True).raw)
result = pipe(image, prompt="What does this chart show?")
print(result[0]["generated_text"])
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import requests, torch
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
).to("cuda")
image = Image.open(requests.get(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png", stream=True).raw)
conversation = [
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What does this chart show?"}]}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
Quantization reduces memory usage by representing weights in lower precision. Use BitsAndBytes 4-bit quantization like this:
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForImageTextToText.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
quantization_config=quant_config,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
Use the AttentionMaskVisualizer to explore which tokens the model attends to:
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
viz = AttentionMaskVisualizer("llava-hf/llava-v1.6-mistral-7b-hf")
viz("<image> What is shown in this image?")

Notes
- Prompt format matters: Different checkpoints (Mistral, Vicuna, LLaMA, LLaMA‑3 / OpenChat formats) require specific syntax. Always use
processor.apply_chat_template
. - Multi-image support: You can pass multiple images—just align
<image>
tokens with image order. - Batching tip: Set
processor.tokenizer.padding_side = "left"
for best batched generation results. - Check out the official blog post for detailed benchmarks showing LLaVA‑NeXT outperforming Gemini Pro on MM‑Vet, TextVQA, ScienceQA, and more.
LlavaNextConfig
autodoc LlavaNextConfig
LlavaNextImageProcessor
autodoc LlavaNextImageProcessor - preprocess
LlavaNextImageProcessorFast
autodoc LlavaNextImageProcessorFast - preprocess
LlavaNextProcessor
autodoc LlavaNextProcessor
LlavaNextModel
autodoc LlavaNextModel
LlavaNextForConditionalGeneration
autodoc LlavaNextForConditionalGeneration - forward