Seed Research | Depth Anything 3: A Single-Transformer Architecture for Space Reconstruction from Any Views

Seed Research | Depth Anything 3: A Single-Transformer Architecture for Space Reconstruction from Any Views

Date

2025-11-27

Category

Models

Humans can quickly construct a 3D representation of a space in their minds from a single photo or video. This ability stems from our understanding and reasoning of spatial geometric correlations from different views. For machines to understand and interact with the physical world, they need to enhance spatial intelligence and achieve accurate scene reconstruction.

Most existing visual space reconstruction models rely on complex architectures and multi-task learning techniques, using separate network modules for various tasks such as depth estimation, camera pose estimation, and multi-view geometric reconstruction. However, these methods often face challenges such as oversized architectures and tightly coupled tasks, reducing the model's flexibility in handling multi-task scenarios.

Recently, the ByteDance Seed team has unveiled Depth Anything 3 (DA3), a pioneering visual space reconstruction model.

As the latest open-source model of the Depth Anything series, it breaks new ground by extending monocular depth estimation to arbitrary views. Unlike conventional multi-view geometric reconstruction methods, DA3 replaces complex, specialized architectures with a simple, efficient transformer model to achieve accurate spatial perception across views.

DA3 has the following two key features:

(1) Simple architecture: Powered by just a single transformer—without any bespoke architecture, the model delivers strong visual pretraining capability. Its lightweight architecture makes training and deployment more efficient.

(2) Unified representation: The model employs the unified "depth-ray" representation, eliminating the need for additional multi-task training methods. This allows the model to focus on the learning of fundamental spatial geometric correlations and deliver more stable performance across tasks, improving both accuracy and efficiency.

DA3 can quickly complete 3D reconstruction across diverse scenes 

Guided by its philosophy of "simple design and precise focus," DA3 sets a new state-of-the-art (SOTA) in visual space reconstruction. It surpasses prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric reconstruction accuracy. Notably, DA3 also raises the benchmark in monocular depth estimation, a task where DA2 had already delivered SOTA performance.


The tech report and model code of DA3 are now publicly available:

Tech Report:

https://arxiv.org/abs/2511.10647

Code Link:

https://github.com/ByteDance-Seed/depth-anything-3

GitHub:

https://depth-anything-3.github.io

HF Demo:

https://huggingface.co/spaces/depth-anything/depth-anything-3


 1. Distilling the Underlying Mechanisms of Space Reconstruction for Simplification 

In spatial intelligence, 3D perception and reconstruction technology serves as a pivotal bridge between 2D visual data and the 3D physical world, powering applications such as autonomous driving, robotics, VR, and architectural surveying. However, this technology's further development is significantly hindered by two major challenges in mainstream models: overly complex architectures and inefficiently coupled output representations.

DA3 responds by revisiting the essence of spatial geometric modeling. Embracing "minimal modeling" as its core design principle, it aims to redefine the key fundamental capabilities that a model should possess.

DA3 employs a single transformer, introduces an input-adaptive cross-view self-attention mechanism, and leverages a dual-DPT head to predict depth and ray maps from visual tokens.

  • Focusing on geometric essence with a single architecture

In conventional methods, visual space reconstruction models often require separate network modules and interfaces for diffet tasks, resulting in complex architectures and high computational demands. The DA3 team observed that geometric space reconstruction fundamentally relies on cross-view feature correlation and global spatial modeling. These tasks can be unified and handled under a single modeling mechanism without the need to design task-specific modules.

Through task abstraction, the team revisited the underlying mechanism of model capabilities: With a self-attention mechanism inherently capable of capturing long-range dependencies, a transformer can flexibly process an arbitrary number of input views and dynamically exchange cross-view information. It eliminates the need for additional feature fusion modules, aligning perfectly with the core requirements of spatial geometric modeling.

Experimental results further demonstrated that while complex, customized architectures may yield marginal performance gains on specific tasks, the overall benefits are limited. Such designs significantly increase training and deployment costs and hinder applications on mobile and embedded platforms.

Based on the above insights, DA3 uses a single transformer as its foundational architecture, achieving a streamlined model design that delivers stable and efficient inference performance across various 3D tasks.
  • Achieving minimal yet sufficient outputs in visual space reconstruction

In 3D reconstruction tasks, output formats are tightly coupled with specific tasks. The lack of a unified logical connection between these formats makes it difficult to reuse and share information. Additionally, redundant outputs increase the training burden, hindering the learning and reasoning of core geometric information.

To address this, the DA3 team revisited a fundamental question in geometric modeling: What exactly does a model need to output to fully describe a 3D space? The team ultimately proposed a unified "depth-ray" representation. Here, depth indicates the distance from a pixel to the camera, while the ray represents the pixel's projection direction in 3D space. This representation fully and concisely encompasses the core geometric information required for 3D reconstruction, eliminating the need for additional, redundant task-specific representations.

Existing mainstream methods typically use 3D point clouds as the core representation and predict point clouds from different views within the same coordinate system to enforce spatial geometric consistency. However, point clouds couple spatial geometry with camera motion, forcing the model to learn both types of information simultaneously, which increases optimization difficulty. To alleviate this coupling, current mainstream methods are forced to incorporate additional prediction heads to separately output depth, camera motion information, and point clouds in image coordinates, further increasing system complexity.

In contrast, the "depth-ray" representation naturally decouples geometry from motion. This method not only makes it easier for the model to obtain high-precision geometric information and camera motion details, but also supports rapid recovery of 3D point clouds, thus achieving a minimal and efficient output representation.


 2. High-Precision Output in a Single Step, Eliminating the Need For Iterative Optimization 

Employing the aforementioned strategy, DA3 can generate high-precision outputs through a single forward pass, bypassing the complex iterative optimization process required in conventional methods. This design significantly enhances inference speed, simplifies training and deployment, and ensures accuracy and efficiency in 3D reconstruction tasks.

In tasks at the "Parkour" difficulty level, DA3 demonstrates excellent spatial perception capabilities. As shown in the figure below, when provided with a 28-second video featuring rapid camera movement and highly dynamic scenes, DA3 captures camera motion, human movement trajectories, and detailed scene changes in a single pass.

In large-scale environments, accurate visual geometry estimation is crucial for simultaneous localization and mapping (SLAM). As demonstrated below, DA3 maintains superior performance in SLAM reconstruction tasks across ultra-large-scale scenes spanning 10 kilometers. This advantage indicates that DA3's architectural approach is not only suitable for small-scale, static scenes but also provides reliable geometric reconstruction in large-scale dynamic environments.

According to our experiments, DA3 can quickly recover 3D structures from any input images in feed-forward 3D Gaussian reconstruction and novel view rendering tasks, while exhibiting strong generalization capabilities in novel view synthesis.

When a vehicle captures multiple images from different viewpoints, DA3 seamlessly estimates stable and integrable depth maps. In this way, DA3 helps autonomous driving systems more accurately understand the 3D structure of their surroundings, enhancing the vehicle's perception and decision-making capabilities in complex scenarios.


 3. Team Vision: Achieving Human-Level Visual Recognition Capabilities 

In cognitive neuroscience, the human cerebral cortex processes information through two visual pathways: one is responsible for understanding semantic information, such as object identity and color, while the other handles spatial geometric information, including object location, motion, and spatial relationships. Humans efficiently accomplish visual recognition tasks through the collaboration of these two pathways. The original intention behind the Depth Anything series of research is to endow AI with human-like spatial perception capabilities.

Currently, visual language models have made significant breakthroughs in semantic understanding, matching or even surpassing human performance on specific tasks. However, spatial geometric perception remains a core challenge. AI is still in its preliminary stages in this field and requires further technological breakthroughs.

DA3, with its advancements, opens up new possibilities for visual space reconstruction. It overcomes the technical bottlenecks of monocular depth estimation and multi-view reconstruction, taking a further step towards accurate spatial geometric perception. As spatial intelligence evolves, we envision that AI will not only understand semantic information within images but also integrate semantic recognition with spatial perception, ultimately achieving comprehensive visual recognition.