mirror of https://github.com/huggingface/transformers.git synced 2025-07-07 23:00:08 +06:00

* First draft

* Fix READMEs

* Update return_dict

* Add more tests

* Fix docstrings

* Address comments

* Address more comments

* Address more comments

* Address more comments, fix test

* Fix test

2023-08-29 10:03:52 +01:00

2.2 KiB

Raw Blame History

ViTDet

Overview

The ViTDet model was proposed in Exploring Plain Vision Transformer Backbones for Object Detection by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. VitDet leverages the plain Vision Transformer for the task of object detection.

The abstract from the paper is the following:

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors.

Tips:

For the moment, only the backbone is available.

This model was contributed by nielsr. The original code can be found here.

VitDetConfig

autodoc VitDetConfig

VitDetModel

autodoc VitDetModel - forward

2.2 KiB Raw Blame History

ViTDet

Overview

VitDetConfig

VitDetModel

2.2 KiB

Raw Blame History