Seed News - ByteDance Seed Team

Humans can quickly construct a 3D representation of a space in their minds from just a single image or video clip. This ability arises from our understanding and reasoning of spatial geometric relationships across different viewpoints. For machines to comprehend and engage with the physical world, they must enhance spatial intelligence and achieve accurate scene reconstruction.

Most existing visual spatial reconstruction models rely on complex architectures and multi-task learning techniques, using separate network modules for various tasks such as depth estimation, camera pose estimation, and multi-view geometric reconstruction. However, these methods often face challenges such as oversized architectures and tightly coupled tasks, reducing the model's flexibility in handling multi-task scenarios.

Recently, the ByteDance Seed team unveiled Depth Anything 3 (DA3), a visual spatial reconstruction model.

As the latest open-source model of the Depth Anything series, it breaks new ground by extending monocular depth estimation to arbitrary viewpoints. Unlike conventional multi-view geometric reconstruction methods, DA3 replaces complex, specialized architectures with a streamlined, efficient transformer model to deliver precise spatial perception across diverse viewing angles.

DA3 has the following two key features:

(1) Simple architecture: Powered by just a single transformer—without any bespoke architecture, the model delivers strong visual pretraining capability. Its lightweight architecture makes training and deployment more efficient.

(2) Unified representation: The model employs the unified "depth-ray" representation, eliminating the need for additional multi-task training methods. This enables the model to focus on learning fundamental spatial geometric correlations, delivering more stable performance across tasks and improving both precision and efficiency.

DA3 can quickly complete 3D reconstruction across diverse scenes

Guided by its philosophy of "simple design and precise focus," DA3 sets a new state-of-the-art (SOTA) in visual spatial reconstruction. It surpasses prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric reconstruction accuracy. Notably, DA3 also raises the benchmark in monocular depth estimation, a task where DA2 had already delivered SOTA performance.

The tech report and model code of DA3 are now publicly available:

Tech Report：
https://arxiv.org/abs/2511.10647
Code Link：
https://github.com/ByteDance-Seed/depth-anything-3
GitHub：
https://depth-anything-3.github.io
HF Demo：
https://huggingface.co/spaces/depth-anything/depth-anything-3

1. Simplifying the Fundamentals: Distilling the Underlying Mechanisms of Spatial Reconstruction

In spatial intelligence, 3D perception and reconstruction technology serve as a pivotal bridge between 2D visual data and the 3D physical world, powering applications such as autonomous driving, robotics, VR, and architectural surveying. However, the further development of this technology is significantly hindered by two major challenges in mainstream models: overly complex architectures and inefficiently coupled output representations.

DA3 responds by revisiting the essence of spatial geometric modeling. Embracing "minimalist modeling" as its core design principle, it aims to redefine the key fundamental capabilities that a model should possess.

DA3 employs a single transformer, introduces an input-adaptive cross-view self-attention mechanism, and leverages a dual-DPT head to predict depth and ray maps from visual tokens.

Focusing on geometric essence with a single architecture

In conventional methods, visual space reconstruction models often require separate network modules and interfaces for different tasks, resulting in complex architectures and high computational demands. The DA3 team observed that spatial geometric reconstruction fundamentally relies on cross-view feature correlations and global spatial modeling. These tasks can be unified and handled under a single modeling mechanism without the need to design task-specific modules.

Through task abstraction, the team revisited the underlying mechanism of model capabilities: With a self-attention mechanism inherently capable of capturing long-range dependencies, a transformer can flexibly process an arbitrary number of input views and dynamically exchange cross-view information. It eliminates the need for additional feature fusion modules, aligning with the core requirements of spatial geometric modeling.

Experimental results further demonstrate that, while complex and customized architectures may yield marginal performance gains on specific tasks, the overall benefits are limited. Such designs significantly increase training and deployment costs, hindering applications on mobile and embedded platforms.

Based on the above insights, DA3 uses a single transformer as its foundational architecture, achieving a streamlined model design that delivers stable and efficient inference performance across various 3D tasks.

Achieving minimal yet sufficient outputs in visual spatial reconstruction

In 3D reconstruction tasks, output formats are tightly bound to specific tasks. The lack of a unified logical connection between these formats makes it difficult to reuse and share information. Additionally, redundant outputs increase the training burden and interfere with the learning and inference of core geometric information.

To address this, the DA3 team returned to the fundamental question in geometric modeling: What exactly does a model need to output to fully describe a 3D space? The team ultimately proposed a unified "depth-ray" representation. Here, depth indicates the distance from a pixel to the camera, while the ray represents the pixel's projection direction in 3D space. This representation fully and concisely encompasses the core geometric information required for 3D reconstruction, eliminating the need for additional, redundant task-specific representations.

Existing mainstream methods typically use 3D point clouds as the core representation and predict point clouds from different views within the same coordinate system to enforce spatial geometric consistency. However, point clouds couple spatial geometry with camera motion, forcing the model to learn both types of information simultaneously, which increases optimization difficulty. To alleviate this coupling, current mainstream methods are forced to incorporate additional prediction heads to separately output depth, camera motion information, and point clouds in image coordinates, further increasing system complexity.

In contrast, the "depth-ray" representation naturally decouples geometry from motion. This method not only makes it easier for the model to obtain high-precision geometric information and camera motion details, but also supports rapid recovery of 3D point clouds, thus achieving a minimal and efficient output representation.

2. High-Precision Output in a Single Step, Eliminating the Need For Iterative Optimization

By employing the aforementioned strategy, DA3 can generate high-precision outputs through a single forward pass, thereby bypassing the complex iterative optimization process required by conventional methods. This design significantly enhances inference speed, simplifies training and deployment, and ensures accuracy and efficiency in 3D reconstruction tasks.

In tasks at the "Parkour" difficulty level, DA3 demonstrates excellent spatial perception capabilities. As shown in the figure below, when provided with a 28-second video featuring rapid camera movement and highly dynamic scenes, DA3 captures camera motion, human movement trajectories, and detailed scene changes in a single pass.

In large-scale environments, accurate visual geometric estimation is crucial for simultaneous localization and mapping (SLAM). As demonstrated below, DA3 maintains superior performance in SLAM reconstruction tasks across ultra-large-scale scenes spanning 10 kilometers. This advantage indicates that DA3's architectural approach is not only suitable for small-scale, static scenes but also provides reliable geometric reconstruction in large-scale dynamic environments.

According to our experiments, DA3 can quickly recover 3D structures from input images in feed-forward 3D Gaussian reconstruction and novel view rendering tasks, while exhibiting strong generalization in novel view synthesis.

When a vehicle captures multiple images from different viewpoints, DA3 seamlessly estimates stable and fusible depth maps. In this way, DA3 enables autonomous driving systems to more accurately understand the 3D structure of their surroundings, enhancing the vehicle's perception and decision-making capabilities in complex scenarios.

3. Team Vision: Achieving Human-Level Visual Recognition Capabilities

In cognitive neuroscience, the human cerebral cortex processes information through two visual pathways: one is responsible for understanding semantic information, such as object identity and color, while the other handles spatial geometric information, including object location, motion, and spatial relationships. Humans efficiently accomplish visual recognition tasks through the collaboration of these two pathways. The original goal behind the Depth Anything series of research is to endow AI with human-like spatial perception capabilities.

Currently, vision-language models have made significant breakthroughs in semantic understanding, matching or even surpassing human performance on specific tasks. However, spatial geometric perception remains a core challenge. AI is still in its preliminary stages in this field and requires further technological breakthroughs.

With its advancements, DA3 opens up new possibilities for visual spatial reconstruction. It addresses the technical bottlenecks of monocular depth estimation and multi-view reconstruction, taking a further step toward accurate spatial geometric perception. As spatial intelligence evolves, we envision that AI will not only understand semantic information within images but also integrate semantic recognition with spatial perception, ultimately achieving comprehensive visual recognition.

Seed Research | Depth Anything 3: A Single-Transformer Architecture for Space Reconstruction from Any Views

1. Simplifying the Fundamentals: Distilling the Underlying Mechanisms of Spatial Reconstruction

2. High-Precision Output in a Single Step, Eliminating the Need For Iterative Optimization