
Multimodal Interaction & World Model
The Seed Multimodal Interaction and World Model team is dedicated to developing models that have human-level multimodal understanding and interaction capabilities. The team is working to advance the exploration and development of multimodal assistant products.
Latest advancements
Selected papers
Apr 22, 2026
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0 [16], with
substantial improvements across generation fidelity, simulation-ready capabilities, and application
coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from
high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression
and more efficient decoding. For texture and material generation, we replace the cascaded
pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo
and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic
conditioning for improved material precision and visual fidelity. Beyond single-object generation,
Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware
decomposition, and training-free articulation generation, enabling coherent scene construction and
part-level physical interaction across physics and graphics engines. A large-scale human preference
study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates
of 69.0% to 89.9% in textured 3D asset generation.
We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0 [16], with
substantial improvements across generation fidelity, simulation-ready capabilities, and application
coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from
high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression
and more efficient decoding. For texture and material generation, we replace the cascaded
pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo
and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic
conditioning for improved material precision and visual fidelity. Beyond single-object generation,
Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware
decomposition, and training-free articulation generation, enabling coherent scene construction and
part-level physical interaction across physics and graphics engines. A large-scale human preference
study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates
of 69.0% to 89.9% in textured 3D asset generation.
We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0 [16], with
substantial improvements across generation fidelity, simulation-ready capabilities, and application
coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from
high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression
and more efficient decoding. For texture and material generation, we replace the cascaded
pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo
and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic
conditioning for improved material precision and visual fidelity. Beyond single-object generation,
Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware
decomposition, and training-free articulation generation, enabling coherent scene construction and
part-level physical interaction across physics and graphics engines. A large-scale human preference
study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates
of 69.0% to 89.9% in textured 3D asset generation.
Computer Vision
May 20, 2025
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Computer Vision
May 13, 2025
Seed1.5-VL Technical Report
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
LLM
Featured roles
Research Scientist - Seed Multimodal Interaction and World Model
San Jose
Experienced Hiring
Apply Now
Research Scientist Graduate- (Multimodal Interaction and World Model) - 2026 Start (PhD)
San Jose
Campus Recruitment
Apply Now
Student Researcher [Seed – Multimodal Interaction & World Model - RL Focused] – 2026 Start (PhD)
San Jose
Internship
Apply Now