mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-04 13:20:12 +06:00
4.8 KiB
4.8 KiB
Pixtral
Overview
The Pixtral model was released by the Mistral AI team on vLLM, where a version of the code can be found!
Tips:
- Pixtral is a multimodal model, taking images and text as input, and producing text as output.
- This model follows the Llava family, meaning image embeddings are placed instead of the
[IMG]
token placeholders. The model uses [PixtralVisionModel
] for its vision encoder, and [MistralForCausalLM
] for its language decoder. - The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).
- The format for one or mulitple prompts is the following:
"<s>[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]"
Then, the processor will replace each [IMG]
token with a number of [IMG]
token that depends on the height and the width of the image. Each row of the image is separated by a [IMG_BREAK]
token, and each image is separated by a [IMG_END]
token.
This model was contributed by amyeroberts and ArthurZ. The original code can be found here.
Usage
Here is an example of how to run it:
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
IMG_URLS = [
"https://picsum.photos/id/237/400/300",
"https://picsum.photos/id/231/200/300",
"https://picsum.photos/id/27/500/500",
"https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"
inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
EXPECTED_GENERATION = """
Describe the images.
Sure, let's break down each image description:
1. **Image 1:**
- **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
- **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.
2. **Image 2:**
- **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
- **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.
3. **Image 3:**
- **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
- **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.
4. **Image 4:**
- **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
- **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.
Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""
PixtralVisionConfig
autodoc PixtralVisionConfig
PixtralVisionModel
autodoc PixtralVisionModel - forward
PixtralImageProcessor
autodoc PixtralImageProcessor - preprocess
PixtralProcessor
autodoc PixtralProcessor