mirror of https://github.com/huggingface/transformers.git synced 2025-07-13 17:48:22 +06:00

Refactoring of ImageProcessorFast (#35069 )

* add init and base image processing functions

* add add_fast_image_processor to transformers-cli

* add working fast image processor clip

* add fast image processor to doc, working tests

* remove "to be implemented" SigLip

* fix unprotected import

* fix unprotected vision import

* update ViTImageProcessorFast

* increase threshold slow fast ewuivalence

* add fast img blip

* add fast class in tests with cli

* improve cli

* add fast image processor convnext

* add LlavaPatchingMixin and fast image processor for llava_next and llava_onevision

* add device kwarg to ImagesKwargs for fast processing on cuda

* cleanup

* fix unprotected import

* group images by sizes and add batch processing

* Add batch equivalence tests, skip when center_crop is used

* cleanup

* update init and cli

* fix-copies

* refactor convnext, cleanup base

* fix

* remove patching mixins, add piped torchvision transforms for ViT

* fix unbatched processing

* fix f strings

* protect imports

* change llava onevision to class transforms (test)

* fix convnext

* improve formatting (following Pavel review)

* fix handling device arg

* improve cli

* fix

* fix inits

* Add distinction between preprocess and _preprocess, and support for arbitrary kwargs through valid_extra_kwargs

* uniformize qwen2_vl fast

* fix docstrings

* add add fast image processor llava

* remove min_pixels max_pixels from accepted size

* nit

* nit

* refactor fast image processors docstrings

* cleanup and remove fast class transforms

* update add fast image processor transformers cli

* cleanup docstring

* uniformize pixtral fast and  make _process_image explicit

* fix prepare image structure llava next/onevision

* Use typed kwargs instead of explicit args

* nit fix import Unpack

* clearly separate pops and gets in base preprocess. Use explicit typed kwargs

* make qwen2_vl preprocess arguments hashable

2025-02-04 17:52:31 -05:00

4.5 KiB

Raw Blame History

BLIP

Overview

BLIP モデルは、BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation で Junnan Li、Dongxu Li、Caiming Xiong、Steven Hoi によって提案されました。。

BLIP は、次のようなさまざまなマルチモーダルタスクを実行できるモデルです。

視覚的な質問応答
画像とテキストの検索（画像とテキストのマッチング）
画像キャプション

論文の要約は次のとおりです。

視覚言語事前トレーニング (VLP) により、多くの視覚言語タスクのパフォーマンスが向上しました。ただし、既存の事前トレーニング済みモデルのほとんどは、理解ベースのタスクまたは世代ベースのタスクのいずれかでのみ優れています。さらに、最適ではない監視ソースである Web から収集されたノイズの多い画像とテキストのペアを使用してデータセットをスケールアップすることで、パフォーマンスの向上が大幅に達成されました。この論文では、視覚言語の理解と生成タスクの両方に柔軟に移行する新しい VLP フレームワークである BLIP を提案します。 BLIP は、キャプションをブートストラップすることでノイズの多い Web データを効果的に利用します。キャプショナーが合成キャプションを生成し、フィルターがノイズの多いキャプションを除去します。画像テキスト検索 (平均再現率 +2.7%@1)、画像キャプション作成 (CIDEr で +2.8%)、VQA ( VQA スコアは +1.6%)。 BLIP は、ゼロショット方式でビデオ言語タスクに直接転送した場合にも、強力な一般化能力を発揮します。コード、モデル、データセットがリリースされています。

このモデルは ybelkada によって提供されました。元のコードはここにあります。

Resources

Jupyter ノートブックカスタムデータセットの画像キャプション用に BLIP を微調整する方法

BlipConfig

autodoc BlipConfig - from_text_vision_configs

BlipTextConfig

autodoc BlipTextConfig

BlipVisionConfig

autodoc BlipVisionConfig

BlipProcessor

autodoc BlipProcessor

BlipImageProcessor

autodoc BlipImageProcessor - preprocess

BlipImageProcessorFast

autodoc BlipImageProcessorFast - preprocess

BlipModel

autodoc BlipModel - forward - get_text_features - get_image_features

BlipTextModel

autodoc BlipTextModel - forward

BlipVisionModel

autodoc BlipVisionModel - forward

BlipForConditionalGeneration

autodoc BlipForConditionalGeneration - forward

BlipForImageTextRetrieval

autodoc BlipForImageTextRetrieval - forward

BlipForQuestionAnswering

autodoc BlipForQuestionAnswering - forward

TFBlipModel

autodoc TFBlipModel - call - get_text_features - get_image_features

TFBlipTextModel

autodoc TFBlipTextModel - call

TFBlipVisionModel

autodoc TFBlipVisionModel - call

TFBlipForConditionalGeneration

autodoc TFBlipForConditionalGeneration - call

TFBlipForImageTextRetrieval

autodoc TFBlipForImageTextRetrieval - call

TFBlipForQuestionAnswering

autodoc TFBlipForQuestionAnswering - call

4.5 KiB Raw Blame History