
* chore: ran codegen script * test: test_image_processor_properties * test: test_image_processor_from_dict_with_kwargs * test: wip - test_padding * test: test_padding * test: test_keep_aspect_ratio * wip * test * test: wip * test: wip * test: test_call_segmentation_maps, wip * chore: tidy up * test: test_call_segmentation_maps * fix: test_save_load_fast_slow * test: reduce labels * chore: make fixup * chore: rm comment * chore: tidy * chore remove comment * refactor: no need to infer channel dimesnion * refactor: encapsulate logic for preparing segmentation maps * refactor: improve readability of segmentation_map preparation * improvement: batched version of pad_image * chore: fixup * docs * chore: make quality * chore: remove unecessary comment * fix: add SemanticSegmentationMixin * feat: add post_process_depth_estimation to fast dpt image processor * chore: fix formatting * remove max_height, max_width * fix: better way of processin segmentation maps - copied from Beit Fast processor * chore: formatting + remove TODO * chore: fixup styles * chore: remove unecessary line break * chore: core review suggestion to remove autodocstring * fix: add do_reduce_labels logic + refactor - refactor preprocess logic to make it consistent with other processors - add missing reduce labels logic * refactor: remove deprecated mixin * chore: fixup * use modular for dpt + final nit changes * fix style --------- Co-authored-by: Samuel Rae <samuelrae@Samuels-Air.fritz.box> Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
5.1 KiB
DPT
Overview
The DPT model was proposed in Vision Transformers for Dense Prediction by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. DPT is a model that leverages the Vision Transformer (ViT) as backbone for dense prediction tasks like semantic segmentation and depth estimation.
The abstract from the paper is the following:
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.
DPT architecture. Taken from the original paper.
This model was contributed by nielsr. The original code can be found here.
Usage tips
DPT is compatible with the [AutoBackbone
] class. This allows to use the DPT framework with various computer vision backbones available in the library, such as [VitDetBackbone
] or [Dinov2Backbone
]. One can create it as follows:
from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation
# initialize with a Transformer-based backbone such as DINOv2
# in that case, we also specify `reshape_hidden_states=False` to get feature maps of shape (batch_size, num_channels, height, width)
backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"], reshape_hidden_states=False)
config = DPTConfig(backbone_config=backbone_config)
model = DPTForDepthEstimation(config=config)
Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT.
-
Demo notebooks for [
DPTForDepthEstimation
] can be found here.
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
DPTConfig
autodoc DPTConfig
DPTFeatureExtractor
autodoc DPTFeatureExtractor - call - post_process_semantic_segmentation
DPTImageProcessor
autodoc DPTImageProcessor - preprocess
DPTImageProcessorFast
autodoc DPTImageProcessorFast - preprocess - post_process_semantic_segmentation - post_process_depth_estimation
DPTModel
autodoc DPTModel - forward
DPTForDepthEstimation
autodoc DPTForDepthEstimation - forward
DPTForSemanticSegmentation
autodoc DPTForSemanticSegmentation - forward