[docs] Updated llava_next model card

2025-07-24 06:48:58 +06:00 · 2025-06-29 22:48:08 -04:00 · 2025-06-29 22:48:08 -04:00 · 124afd759a
commit 124afd759a
parent e088b9c7e7
1 changed files with 42 additions and 5 deletions
--- a/docs/source/en/model_doc/llava_next.md
+++ b/docs/source/en/model_doc/llava_next.md
@ -29,8 +29,6 @@ rendered properly in your Markdown viewer.
 You can find all the original LLaVA‑NeXT checkpoints under the [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) collection.
 > [!TIP]
 > This model was contributed by [nielsr](https://huggingface.co/nielsr).
 >
 > Click on the LLaVA‑NeXT models in the right sidebar for more examples of how to apply Llava-NeXT to different multimodal tasks.
 The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
@ -121,10 +119,49 @@ viz("<image> What is shown in this image?")
 ## Notes
 * Different checkpoints (Mistral, Vicuna, etc.) require a specific prompt format depending on the underlying LLM. Always use [`~ProcessorMixin.apply_chat_template`] to ensure correct formatting. Refer to the [Templates](../chat_templating) guide for more details.
- The example below demonstrates inference with multiple input images.
+
 * Set `padding_side="left"` during batched generation for more accurate results.
 ```py
-   add code snippet here
+processor.tokenizer.padding_side = "left"
 ```
 * LLaVA-NeXT uses different numbers of patches for images and pads the inputs inside the modeling code except when padding is done during processing. The default setting is *left-padding* if the model is in `eval()` mode, otherwise it is *right-padding*.
 * LLaVA models after v4.46 raises warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}`, and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add these attributes to the processor if you own the model checkpoint or open a PR if it isn't.
  Adding these attributes means LLaVA will try to infer the number of image tokens required per image and expand the text with the same number of `<image>` token placeholders. There are usually ~500 tokens per image, so make sure the text is not truncated because it will cause a failure when merging the embeddings. The attributes can be found in `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`.
  The `num_additional_image_tokens` should be `1` if the vision backbone adds a `CLS` token or `0` if nothing extra is added.
 * The example below demonstrates inference with multiple input images.
 ```python
 from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
 from PIL import Image
 import requests, torch
 processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
 model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
 ).to("cuda")
 # Load multiple images
 url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png"
 url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_comparison.png"
 image1 = Image.open(requests.get(url1, stream=True).raw)
 image2 = Image.open(requests.get(url2, stream=True).raw)
 conversation = [
    {"role": "user", "content": [{"type": "image"}, {"type": "image"}, {"type": "text", "text": "Compare these two images and describe the differences."}]}
 ]
 prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
 inputs = processor([image1, image2], prompt, return_tensors="pt").to("cuda")
 output = model.generate(**inputs, max_new_tokens=100)
 print(processor.decode(output[0], skip_special_tokens=True))
 ```
 ## LlavaNextConfig