首页研究成果团队动态 Seed Edge Top Seed 加入我们

EN

中文

首页研究成果团队动态 Seed Edge Top Seed 加入我们

Multimodal Interaction & World Model

Seed-多模态交互与世界模型团队致力于研发具备人类水平的多模态理解与交互能力的模型，并推动多模态助手类产品的探索和研发

课题方向

多模态理解基础模型与应用

构建融合视听、语言的理解模型，提升图像与视频中文字、layout、定位、空间关系等基础理解能力，并强化多模态推理能力。提升模型训练与推理效率；实现用户长期记忆，优化模型在各类终端设备上的使用体验

Multimodal

Foundation

Multimodal

Foundation

多模态 Agent 与推理

突破包括多模态 RAG，视觉 CoT 与 Agent 等在内的多模态模型进阶能力，构建GUI/游戏等虚拟世界的通用多模态Agent

Multimodal

Foundation

Agent

Multimodal

Foundation

生成与理解统一模型

探索连续与离散信号统一的表示与训练方法，建设交织生成与理解的模型

Multimodal

World Model

Multimodal

World Model

世界模型

利用预训练、仿真等技术对虚拟/现实世界的各类环境进行建模，提供多模态交互探索的基本能力

Multimodal

World Model

Multimodal

World Model

精选论文

Emerging Properties in Unified Multimodal Pretraining

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

Computer Vision

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

Computer Vision

Seed1.5-VL Technical Report

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Seed Multimodal Team

Seed1.5-VL Technical Report

Seed Multimodal Team

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

Computer Vision

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

Computer Vision

查看更多论文

技术能力展示

Seed1.5-VL

视觉-语言多模态大模型，在视觉推理、图像问答、图表理解与问答、视觉定位/计数、视频理解、GUI 智能体等任务中均表现突出。

UI-TARS

基于视觉-语言模型构建的开源多模态智能体，能够在虚拟世界中高效执行各类任务。

BAGEL

开源的理解与生成统一模型，具备图像生成、图像编辑、风格迁移、扩图等多项能力，能够实现精确、准确且逼真的输出。

热招岗位

多模态世界模型算法研究员/专家-Seed

北京/上海/杭州/深圳

社招

多模态世界模型算法工程师/专家-Seed

北京/上海/杭州/深圳

社招

视觉大模型算法专家-Top Seed

北京/上海/杭州/深圳

校招

多模态世界模型算法实习生-Seed

北京/上海/杭州/深圳

实习

多模态世界模型算法研究员/专家-Seed

北京/上海/杭州/深圳

多模态世界模型算法工程师/专家-Seed

北京/上海/杭州/深圳

视觉大模型算法专家-Top Seed

北京/上海/杭州/深圳

多模态世界模型算法实习生-Seed

北京/上海/杭州/深圳

查看更多岗位

模型成果

Seed1.8 Seed1.6 Seed1.5-VL Seedance 1.0 Seedream 4.0 SeedEdit 3.0 Seed LiveInterpret 2.0 Seed Realtime Voice Seed Music

研究团队

LLM Infrastructures Vision Speech Multimodal Interaction & World Model AI for Science Robotics Responsible AI

了解更多

研究成果团队动态 Seed Edge Top Seed 加入我们

模型成果

Seed LiveInterpret 2.0

Seed Realtime Voice

研究团队

Infrastructures

Multimodal Interaction & World Model

了解更多

欢迎加入字节跳动 Seed

Copyright © 2025 Bytedance Seed

用户协议隐私政策

欢迎加入字节跳动 Seed

欢迎加入字节跳动 Seed

Copyright © 2025 Bytedance Seed

用户协议隐私政策