[docs] Updated llava_next model card

2025-07-03 12:50:06 +06:00 · 2025-06-29 22:48:08 -04:00 · 2025-06-29 22:48:08 -04:00 · 124afd759a
commit 124afd759a
parent e088b9c7e7
1 changed files with 42 additions and 5 deletions
--- a/docs/source/en/model_doc/llava_next.md
+++ b/docs/source/en/model_doc/llava_next.md
@ -29,8 +29,6 @@ rendered properly in your Markdown viewer.
 You can find all the original LLaVA‑NeXT checkpoints under the [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) collection.

 > [!TIP]
-> This model was contributed by [nielsr](https://huggingface.co/nielsr).
->
 > Click on the LLaVA‑NeXT models in the right sidebar for more examples of how to apply Llava-NeXT to different multimodal tasks.

 The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
@ -121,10 +119,49 @@ viz("<image> What is shown in this image?")
 ## Notes

 * Different checkpoints (Mistral, Vicuna, etc.) require a specific prompt format depending on the underlying LLM. Always use [`~ProcessorMixin.apply_chat_template`] to ensure correct formatting. Refer to the [Templates](../chat_templating) guide for more details.
- The example below demonstrates inference with multiple input images.

-   ```py
-   add code snippet here
+* Set `padding_side="left"` during batched generation for more accurate results.
+
+```py
+processor.tokenizer.padding_side = "left"
+```
+
+* LLaVA-NeXT uses different numbers of patches for images and pads the inputs inside the modeling code except when padding is done during processing. The default setting is *left-padding* if the model is in `eval()` mode, otherwise it is *right-padding*.
+
+* LLaVA models after v4.46 raises warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}`, and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add these attributes to the processor if you own the model checkpoint or open a PR if it isn't.
+
+  Adding these attributes means LLaVA will try to infer the number of image tokens required per image and expand the text with the same number of `<image>` token placeholders. There are usually ~500 tokens per image, so make sure the text is not truncated because it will cause a failure when merging the embeddings. The attributes can be found in `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`.
+
+  The `num_additional_image_tokens` should be `1` if the vision backbone adds a `CLS` token or `0` if nothing extra is added.
+
+* The example below demonstrates inference with multiple input images.
+
+```python
+from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
+from PIL import Image
+import requests, torch
+
+processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
+model = LlavaNextForConditionalGeneration.from_pretrained(
+    "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
+).to("cuda")
+
+# Load multiple images
+url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png"
+url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_comparison.png"
+
+image1 = Image.open(requests.get(url1, stream=True).raw)
+image2 = Image.open(requests.get(url2, stream=True).raw)
+
+conversation = [
+    {"role": "user", "content": [{"type": "image"}, {"type": "image"}, {"type": "text", "text": "Compare these two images and describe the differences."}]}
+]
+prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
+inputs = processor([image1, image2], prompt, return_tensors="pt").to("cuda")
+
+output = model.generate(**inputs, max_new_tokens=100)
+print(processor.decode(output[0], skip_special_tokens=True))
+```


 ## LlavaNextConfig