transformers/docs/source/en/model_doc/idefics3.md
Andrés Marafioti f2c388e3f9
Add Idefics 3! (#32473)
* Add Idefics 3!

* fixes to make both pipelines identical

* fix for quantized models

* First pass at the review

* remove vocab size from the main config (it's still in the text_config)

* hot fix for merve

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* re-add model_type for text_config

* remove support for old_cache

* remove hidden_size from main config

* rename idefics3 HF repo

* few changes suggested in the PR

* fix to input_data_format computation

* remove overwrite of _autoset_attn_implementation following @zucchini-nlp suggestion

* improve example

* few improvements from amy's review

* big change to enable processing input images as numpy arrays

* Changes to the code to uniformize processor kwargs

* image processing tests

* image processing tests fixes and some bugs they discovered

* addressed review comments from Yoni

* fix modeling tests

* remove special tokens that are not special

* fixes tests

* skip failing tests - they also fail for idefics2

* added paper and readded the tests with multi gpu, who knows

* Update docs/source/en/model_doc/idefics3.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* review amy until image_processing_idefics3

* last comments from Amy

* review amy

* Update src/transformers/models/idefics3/image_processing_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/idefics3/modeling_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/idefics3.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* doc improvement - amy review

* fix runtime error during fine-tuning

* amy's review

* Update src/transformers/models/idefics3/image_processing_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/idefics3/image_processing_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/idefics3/modeling_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* ruff

* amy's comment on the order

* ruff ruff

* fix copies

* square images when they are not splitted

* ruff :(

* Update src/transformers/models/idefics3/image_processing_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/idefics3/test_processing_idefics3.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix small bug introduced in refactor

* amy's image processing changes

* fixes peft tests and ruff

* modify to_pil_image from transformers. and review from emanuele.

* add modified to_pil_image

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2024-09-25 21:28:49 +02:00

3.7 KiB
Raw Blame History

Idefics3

Overview

The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

  • It uses Llama3 for the text model.
  • It uses an updated processing logic for the images.
  • It removes the perceiver.

The abstract from the paper is the following:

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

Usage tips

Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.

If do_resize is set to True, the model resizes images so that the longest edge is 4*364 pixels by default. The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 364} is the default, but you can change it to a different value if needed.

Heres how to control resizing and set a custom size:

image_processor = Idefics3ImageProcessor(do_resize=True, size={"longest_edge": 2 * 364}, max_image_size=364)

Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 364 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.

This model was contributed by amyeroberts and andimarafioti.

Idefics3Config

autodoc Idefics3Config

Idefics3Model

autodoc Idefics3Model - forward

Idefics3ForConditionalGeneration

autodoc Idefics3ForConditionalGeneration - forward

Idefics3ImageProcessor

autodoc Idefics3ImageProcessor - preprocess

Idefics3Processor

autodoc Idefics3Processor - call