# LLaVA-OneVision ## Overview The LLaVA-OneVision model was proposed in [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) by LLaVA-OneVision architecture. Taken from the original paper. Tips: - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating. - Llava-OneVision uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding". - Note that the model should use a specific prompt format, on which the large language model (LLM) was trained. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. We will use [llava-onevision-qwen2-7b-si-hf](https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-si-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows: ```python from transformers import AutoProcessor processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-si-hf") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What’s shown in this image?"}, ], }, { "role": "assistant", "content": [{"type": "text", "text": "This image shows a red stop sign."},] }, { "role": "user", "content": [ {"type": "text", "text": "Describe the image in more details."}, ], }, ] text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) # Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images print(text_prompt) >>> "<|im_start|>user\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\nPage showing the list of options.<|im_end|>" ``` This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main). ## Usage example ### Single image inference Here's how to load the model and perform inference in half-precision (`torch.float16`): ```python from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration import torch from PIL import Image import requests processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf") model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) model.to("cuda:0") # prepare image and text prompt, using the appropriate prompt template url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true" image = Image.open(requests.get(url, stream=True).raw) conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0", torch.float16) # autoregressively complete prompt output = model.generate(**inputs, max_new_tokens=100) print(processor.decode(output[0], skip_special_tokens=True)) 'user\n\nWhat is shown in this image?\nassistant\nThe image shows a radar chart, also known as a spider chart or a star chart, which is used to compare multiple quantitative variables. Each axis represents a different variable, and the chart is filled with' ``` ### Multi image inference LLaVa-OneVision can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). For that you have to use checkpoints with an "ov" suffix. Here is how you can do it: ```python import requests from PIL import Image import torch from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration # Load the model in half-precision model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto") processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf") # Get three different images url = "https://www.ilankelman.org/stopsigns/australia.jpg" image_stop = Image.open(requests.get(url, stream=True).raw) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image_cats = Image.open(requests.get(url, stream=True).raw) url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg" image_snowman = Image.open(requests.get(url, stream=True).raw) # Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not conversation_1 = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"}, ], }, { "role": "assistant", "content": [ {"type": "text", "text": "There is a red stop sign in the image."}, ], }, { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What about this image? How many cats do you see?"}, ], }, ] conversation_2 = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"}, ], }, ] prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True) prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True) prompts = [prompt_1, prompt_2] # We can simply feed images in the order they have to be used in the text prompt inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16) # Generate generate_ids = model.generate(**inputs, max_new_tokens=30) processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) ['user\n\nWhat is shown in this image?\nassistant\nThere is a red stop sign in the image.\nuser\n\nWhat about this image? How many cats do you see?\nassistant\ntwo', 'user\n\nWhat is shown in this image?\nassistant\n'] ``` ### Video inference LLaVa-OneVision also can perform inference with videos as input, where video frames are treated as multiple images. Here is how you can do it: ```python import av import numpy as np from huggingface_hub import hf_hub_download import torch from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration # Load the model in half-precision model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto") processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf") def read_video_pyav(container, indices): ''' Decode the video with PyAV decoder. Args: container (`av.container.input.InputContainer`): PyAV container. indices (`List[int]`): List of frame indices to decode. Returns: result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3). ''' frames = [] container.seek(0) start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) # Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos, up to 32 frames) video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset") container = av.open(video_path) total_frames = container.streams.video[0].frames indices = np.arange(0, total_frames, total_frames / 8).astype(int) video = read_video_pyav(container, indices) # For videos we have to feed a "video" type instead of "image" conversation = [ { "role": "user", "content": [ {"type": "video"}, {"type": "text", "text": "Why is this video funny?"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(videos=list(video), text=prompt, return_tensors="pt").to("cuda:0", torch.float16) out = model.generate(**inputs, max_new_tokens=60) processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True) ["user\n\nWhy is this video funny?\nassistant\nThe video appears to be humorous because it shows a young child, who is wearing glasses and holding a book, seemingly reading with a serious and focused expression. The child's glasses are a bit oversized for their face, which adds a comical touch, as it's a common trope to see children wearing"] ``` ## Model optimization ### Quantization using bitsandbytes The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a GPU/accelerator that is supported by the library. bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend). We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links. Simply change the snippet above with: ```python from transformers import LlavaOnevisionForConditionalGeneration, BitsAndBytesConfig # specify how to quantize the model quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, ) model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto") ``` ### Use Flash-Attention 2 to further speed-up generation First make sure to install flash-attn. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with: ```python from transformers import LlavaOnevisionForConditionalGeneration model = LlavaOnevisionForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, use_flash_attention_2=True ).to(0) ``` ## LlavaOnevisionConfig [[autodoc]] LlavaOnevisionConfig ## LlavaOnevisionProcessor [[autodoc]] LlavaOnevisionProcessor ## LlavaOnevisionImageProcessor [[autodoc]] LlavaOnevisionImageProcessor ## LlavaOnevisionVideoProcessor [[autodoc]] LlavaOnevisionVideoProcessor ## LlavaOnevisionForConditionalGeneration [[autodoc]] LlavaOnevisionForConditionalGeneration - forward