transformers/docs/source/en/model_doc/zoedepth.md
Henrik Matthiesen 1fed6166c0
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
New model PR merged notification / Notify new model (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
added fast image processor for ZoeDepth and expanded tests accordingly (#38515)
* added fast image processor for ZoeDepth and expanded tests accordingly

* added fast image processor for ZoeDepth and expanded tests accordingly, hopefully fixed repo consistency issue too now

* final edits for zoedept fast image processor

* final minor edit for zoedepth fast imate procesor
2025-06-04 22:59:17 +00:00

4.8 KiB

PyTorch

ZoeDepth

ZoeDepth is a depth estimation model that combines the generalization performance of relative depth estimation (how far objects are from each other) and metric depth estimation (precise depth measurement on metric scale) from a single image. It is pre-trained on 12 datasets using relative depth and 2 datasets (NYU Depth v2 and KITTI) for metric accuracy. A lightweight head with a metric bin module for each domain is used, and during inference, it automatically selects the appropriate head for each input image with a latent classifier.

drawing

You can find all the original ZoeDepth checkpoints under the Intel organization.

The example below demonstrates how to estimate depth with [Pipeline] or the [AutoModel] class.

import requests
import torch
from transformers import pipeline
from PIL import Image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
pipeline = pipeline(
    task="depth-estimation",
    model="Intel/zoedepth-nyu-kitti",
    torch_dtype=torch.float16,
    device=0
)
results = pipeline(image)
results["depth"]
import torch
import requests
from PIL import Image
from transformers import AutoModelForDepthEstimation, AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(
    "Intel/zoedepth-nyu-kitti"
)
model = AutoModelForDepthEstimation.from_pretrained(
    "Intel/zoedepth-nyu-kitti",
    device_map="auto"
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt").to("cuda")

with torch.no_grad():
  outputs = model(inputs)

# interpolate to original size and visualize the prediction
## ZoeDepth dynamically pads the input image, so pass the original image size as argument
## to `post_process_depth_estimation` to remove the padding and resize to original dimensions.
post_processed_output = image_processor.post_process_depth_estimation(
    outputs,
    source_sizes=[(image.height, image.width)],
)

predicted_depth = post_processed_output[0]["predicted_depth"]
depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
depth = depth.detach().cpu().numpy() * 255
Image.fromarray(depth.astype("uint8"))

Notes

  • In the original implementation ZoeDepth performs inference on both the original and flipped images and averages the results. The post_process_depth_estimation function handles this by passing the flipped outputs to the optional outputs_flipped argument as shown below.
     with torch.no_grad():
         outputs = model(pixel_values)
         outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
         post_processed_output = image_processor.post_process_depth_estimation(
             outputs,
             source_sizes=[(image.height, image.width)],
             outputs_flipped=outputs_flipped,
         )
    

Resources

  • Refer to this notebook for an inference example.

ZoeDepthConfig

autodoc ZoeDepthConfig

ZoeDepthImageProcessor

autodoc ZoeDepthImageProcessor - preprocess

ZoeDepthImageProcessorFast

autodoc ZoeDepthImageProcessorFast - preprocess

ZoeDepthForDepthEstimation

autodoc ZoeDepthForDepthEstimation - forward