ApDepth: Aiming for Precise Monocular Depth Estimation Based on Diffusion Models

Overview

We present Apdepth, a diffusion model, and associated fine-tuning protocol for monocular depth estimation.

Based on Marigold, its core innovation lies in addressing the deficiency of diffusion models in feature representation capability. Our model followed Marigold, derived from Stable Diffusion and fine-tuned with synthetic data: Hypersim and VKitti, achieved ideal results in object edge refinement.

We first fine-tuned the model, resolving the issue of excessive inference time by replacing multi-step reasoning with single-step reasoning. Then, we introduced pre-trained models to assist diffusion models in deep estimation learning. we only used Depth Anything V2 small as our "Teacher Model" to aid our model to generate more accurate depth maps, and achieve great result. Finally, we introduced a new training strats, combine MSE_loss and Grad_loss to enhance the model's ability to capture edge information.

How it works

Fine-tuning protocol

Starting from a pretrained Stable Diffusion model, the input RGB image is first processed by a pretrained depth estimation network to obtain a predicted depth map. Both the RGB and the predicted depth maps are then encoded into their latent representations using the frozen VAE encoder, producing and. These two latent tensors are concatenated and fed into a modified Latent Diffusion U-Net, which performs a single denoising step to predict the target latent. The ground-truth depth is also encoded into its latent form , and the training objective minimizes the mean squared error.

Single-Step Inference

During inference, the input RGB image is first processed by a pretrained depth estimation model to generate a coarse predicted depth map. Both the RGB image and the predicted depth map are encoded into the latent space using the frozen VAE encoder to obtain the corresponding latent representations. These two latents are concatenated and fed into the Latent Diffusion U-Net, which performs only a single denoising step to generate the predicted latent representation. Finally, the latent decoder reconstructs the refined depth map from the predicted latent, producing the final output depth.

Through experimental validation, our model requires only 4GB of GPU(Geforce RTX 4090) memory and 5 minutes of inference time on the Nyu Depth v2 dataset.

@article{haruko26apdepth, author = {Haruko386 and Yuan, Shuai}, title = {ApDepth: Aiming for Precise Monocular Depth Estimation Based on Diffusion Models}, journal = {Under review}, year = {2026}, }

ApDepth: Aiming for Precise Monocular Depth Estimation Based on Diffusion Models

Apdepth achieved ideal results in object edge refinement

Overview

Gallery

How it works

Fine-tuning protocol

Single-Step Inference

BibTeX