mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-24 06:48:58 +06:00
[docs] Updated llava_next model card
This commit is contained in:
parent
e088b9c7e7
commit
124afd759a
@ -29,8 +29,6 @@ rendered properly in your Markdown viewer.
|
|||||||
You can find all the original LLaVA‑NeXT checkpoints under the [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) collection.
|
You can find all the original LLaVA‑NeXT checkpoints under the [LLaVA-NeXT](https://huggingface.co/collections/llava-hf/llava-next-65f75c4afac77fd37dbbe6cf) collection.
|
||||||
|
|
||||||
> [!TIP]
|
> [!TIP]
|
||||||
> This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
|
||||||
>
|
|
||||||
> Click on the LLaVA‑NeXT models in the right sidebar for more examples of how to apply Llava-NeXT to different multimodal tasks.
|
> Click on the LLaVA‑NeXT models in the right sidebar for more examples of how to apply Llava-NeXT to different multimodal tasks.
|
||||||
|
|
||||||
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
||||||
@ -121,10 +119,49 @@ viz("<image> What is shown in this image?")
|
|||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
* Different checkpoints (Mistral, Vicuna, etc.) require a specific prompt format depending on the underlying LLM. Always use [`~ProcessorMixin.apply_chat_template`] to ensure correct formatting. Refer to the [Templates](../chat_templating) guide for more details.
|
* Different checkpoints (Mistral, Vicuna, etc.) require a specific prompt format depending on the underlying LLM. Always use [`~ProcessorMixin.apply_chat_template`] to ensure correct formatting. Refer to the [Templates](../chat_templating) guide for more details.
|
||||||
- The example below demonstrates inference with multiple input images.
|
|
||||||
|
* Set `padding_side="left"` during batched generation for more accurate results.
|
||||||
|
|
||||||
```py
|
```py
|
||||||
add code snippet here
|
processor.tokenizer.padding_side = "left"
|
||||||
|
```
|
||||||
|
|
||||||
|
* LLaVA-NeXT uses different numbers of patches for images and pads the inputs inside the modeling code except when padding is done during processing. The default setting is *left-padding* if the model is in `eval()` mode, otherwise it is *right-padding*.
|
||||||
|
|
||||||
|
* LLaVA models after v4.46 raises warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}`, and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add these attributes to the processor if you own the model checkpoint or open a PR if it isn't.
|
||||||
|
|
||||||
|
Adding these attributes means LLaVA will try to infer the number of image tokens required per image and expand the text with the same number of `<image>` token placeholders. There are usually ~500 tokens per image, so make sure the text is not truncated because it will cause a failure when merging the embeddings. The attributes can be found in `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`.
|
||||||
|
|
||||||
|
The `num_additional_image_tokens` should be `1` if the vision backbone adds a `CLS` token or `0` if nothing extra is added.
|
||||||
|
|
||||||
|
* The example below demonstrates inference with multiple input images.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
|
||||||
|
from PIL import Image
|
||||||
|
import requests, torch
|
||||||
|
|
||||||
|
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
|
||||||
|
model = LlavaNextForConditionalGeneration.from_pretrained(
|
||||||
|
"llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16
|
||||||
|
).to("cuda")
|
||||||
|
|
||||||
|
# Load multiple images
|
||||||
|
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_ocr.png"
|
||||||
|
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_comparison.png"
|
||||||
|
|
||||||
|
image1 = Image.open(requests.get(url1, stream=True).raw)
|
||||||
|
image2 = Image.open(requests.get(url2, stream=True).raw)
|
||||||
|
|
||||||
|
conversation = [
|
||||||
|
{"role": "user", "content": [{"type": "image"}, {"type": "image"}, {"type": "text", "text": "Compare these two images and describe the differences."}]}
|
||||||
|
]
|
||||||
|
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
|
||||||
|
inputs = processor([image1, image2], prompt, return_tensors="pt").to("cuda")
|
||||||
|
|
||||||
|
output = model.generate(**inputs, max_new_tokens=100)
|
||||||
|
print(processor.decode(output[0], skip_special_tokens=True))
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
## LlavaNextConfig
|
## LlavaNextConfig
|
||||||
|
Loading…
Reference in New Issue
Block a user