mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 12:50:06 +06:00
[docs] Updated llava_next model card
This commit is contained in:
parent
e088b9c7e7
commit
124afd759a
@ -29,8 +29,6 @@ rendered properly in your Markdown viewer.
|
||||
You can find all the original LLaVA‑NeXT checkpoints under the [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) collection.
|
||||
|
||||
> [!TIP]
|
||||
> This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
||||
>
|
||||
> Click on the LLaVA‑NeXT models in the right sidebar for more examples of how to apply Llava-NeXT to different multimodal tasks.
|
||||
|
||||
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
||||
@ -121,10 +119,49 @@ viz("<image> What is shown in this image?")
|
||||
## Notes
|
||||
|
||||
* Different checkpoints (Mistral, Vicuna, etc.) require a specific prompt format depending on the underlying LLM. Always use [`~ProcessorMixin.apply_chat_template`] to ensure correct formatting. Refer to the [Templates](../chat_templating) guide for more details.
|
||||
- The example below demonstrates inference with multiple input images.
|
||||
|
||||
```py
|
||||
add code snippet here
|
||||
* Set `padding_side="left"` during batched generation for more accurate results.
|
||||
|
||||
```py
|
||||
processor.tokenizer.padding_side = "left"
|
||||
```
|
||||
|
||||
* LLaVA-NeXT uses different numbers of patches for images and pads the inputs inside the modeling code except when padding is done during processing. The default setting is *left-padding* if the model is in `eval()` mode, otherwise it is *right-padding*.
|
||||
|
||||
* LLaVA models after v4.46 raises warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}`, and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add these attributes to the processor if you own the model checkpoint or open a PR if it isn't.
|
||||
|
||||
Adding these attributes means LLaVA will try to infer the number of image tokens required per image and expand the text with the same number of `<image>` token placeholders. There are usually ~500 tokens per image, so make sure the text is not truncated because it will cause a failure when merging the embeddings. The attributes can be found in `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`.
|
||||
|
||||
The `num_additional_image_tokens` should be `1` if the vision backbone adds a `CLS` token or `0` if nothing extra is added.
|
||||
|
||||
* The example below demonstrates inference with multiple input images.
|
||||
|
||||
```python
|
||||
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
|
||||
from PIL import Image
|
||||
import requests, torch
|
||||
|
||||
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
|
||||
model = LlavaNextForConditionalGeneration.from_pretrained(
|
||||
"llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Load multiple images
|
||||
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png"
|
||||
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_comparison.png"
|
||||
|
||||
image1 = Image.open(requests.get(url1, stream=True).raw)
|
||||
image2 = Image.open(requests.get(url2, stream=True).raw)
|
||||
|
||||
conversation = [
|
||||
{"role": "user", "content": [{"type": "image"}, {"type": "image"}, {"type": "text", "text": "Compare these two images and describe the differences."}]}
|
||||
]
|
||||
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
|
||||
inputs = processor([image1, image2], prompt, return_tensors="pt").to("cuda")
|
||||
|
||||
output = model.generate(**inputs, max_new_tokens=100)
|
||||
print(processor.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
|
||||
## LlavaNextConfig
|
||||
|
Loading…
Reference in New Issue
Block a user