transformers/docs/source/en/model_doc/florence2.md
Duc-Viet Hoang 8e7aa374cf update
2025-05-20 22:55:40 +07:00

3.4 KiB

Florence2

Overview

The Florence2 model was proposed in Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks by Microsoft.

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

The abstract from the paper is the following:

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

This model was contributed by hlky. The original code can be found here.

Florence2Config

autodoc Florence2Config

Florence2Processor

autodoc Florence2Processor

Florence2ForConditionalGeneration

autodoc Florence2ForConditionalGeneration - forward

Florence2LanguageForConditionalGeneration

autodoc Florence2LanguageForConditionalGeneration - forward

Florence2LanguageModel

autodoc Florence2LanguageModel - forward

Florence2Vision

autodoc Florence2Vision - forward

Florence2VisionModel

autodoc Florence2VisionModel - forward

Florence2VisionModelWithProjection

autodoc Florence2VisionModelWithProjection - forward