Multimodal Interaction & World Model
The Seed Multimodal Interaction and World Model team is dedicated to developing models that boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products.

Research topics

Foundations and applications of multimodal understanding models
Develop integrated models that understand audio-visual and linguistic inputs, enhance fundamental understanding of images and videos such as text, layout, grounding, and spatial relation as well as multimodal reasoning capabilities. Improve the efficiency of model training and inference, achieve long-term memory retention for users, and optimize the model's performance across various devices for better experience.
Multimodal
Foundation
Multimodal
Foundation

Multimodal agent and inference
Achieve advanced capabilities for multimodal models including multimodal RAG, visual CoT, and agent, building general multimodal agents for GUI/games in the virtual world.
Multimodal
Foundation
Agent
Multimodal
Foundation

Unified models for generation and understanding
Explore unified representation and training methods for both continuous and discrete signals, and develop models that can interleave both generation and understanding.
Multimodal
World Model
Multimodal
World Model

World Model
Employ pre-training and simulation technologies to model various environments of the virtual and physical world, providing foundational capabilities for multimodal interactive exploration.
Multimodal
World Model
Multimodal
World Model
Selected Papers

May 13, 2025
Seed1.5-VL Technical Report
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
Seed Multimodal Team
LLM
2025.05.13
Seed1.5-VL Technical Report
Seed Multimodal Team
LLM

Jan 21, 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
Computer Vision
2025.01.21
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
Computer Vision

Jul 28, 2024
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at this https URL[https://github.com/LLaVA-VL/LLaVA-NeXT].
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
Computer Vision
2024.07.28
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
Computer Vision
Learn More
Technical applications

Seed-VLM
Seed-VLM is an advanced visual assistant designed for Doubao's scenarios. It ensures dependable performance through post-training and enhances the user experience with its comprehensive features, incorporating visual chain of thought (Visual CoT).
Visual-Language Model
Featured Jobs
Research Scientist/Engineer - Multimodal Interaction & World Model
Research Scientist- Foundation Model, Vision and Language
Research Scientist, Multimodal Interaction & World Model - 2025 Start
Research Scientist Graduate- (Foundation Model, Vision and Language) - 2025 Start (PhD)
Research Scientist Intern - Multimodal Interaction & World Model - 2025 Start
Student Researcher (Seed - Foundation Model - Vision and Language) - 2025 Start (PhD)
Research Scientist/Engineer - Multimodal Interaction & World Model
Singapore
Experienced Hiring
Apply Now
Research Scientist- Foundation Model, Vision and Language
San Jose / Seattle
Experienced Hiring
Apply Now
Research Scientist, Multimodal Interaction & World Model - 2025 Start
Singapore
Campus Recruitment
Apply Now
Research Scientist Graduate- (Foundation Model, Vision and Language) - 2025 Start (PhD)
San Jose / Seattle
Campus Recruitment
Apply Now
Research Scientist Intern - Multimodal Interaction & World Model - 2025 Start
Singapore
Internship
Apply Now
Student Researcher (Seed - Foundation Model - Vision and Language) - 2025 Start (PhD)
San Jose / Seattle
Internship
Apply Now