Multimodal Interaction & World Model
Seed-多模态交互与世界模型团队致力于研发具备人类水平的多模态理解与交互能力的模型,并推动多模态助手类产品的探索和研发
课题方向
多模态理解基础模型与应用
构建融合视听、语言的理解模型,提升图像与视频中文字、layout、定位、空间关系等基础理解能力,并强化多模态推理能力。提升模型训练与推理效率;实现用户长期记忆,优化模型在各类终端设备上的使用体验
Multimodal
Foundation
多模态 Agent 与推理
突破包括多模态 RAG,视觉 CoT 与 Agent 等在内的多模态模型进阶能力,构建GUI/游戏等虚拟世界的通用多模态Agent
Multimodal
Foundation
Agent
生成与理解统一模型
探索连续与离散信号统一的表示与训练方法,建设交织生成与理解的模型
Multimodal
World Model
世界模型
利用预训练、仿真等技术对虚拟/现实世界的各类环境进行建模,提供多模态交互探索的基本能力
Multimodal
World Model

精选论文

2025.01.21
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
Computer Vision
2024.07.28
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at this https URL[https://github.com/LLaVA-VL/LLaVA-NeXT].
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
Computer Vision
2024.03.10
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.
Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan
Computer Vision
查看更多论文
技术应用
Seed-VLM
Seed-VLM 围绕豆包相关场景打造体验前沿的视觉助手。通过用户偏好对齐的后链路训练,确保了高可用性的响应,并结合视觉链条思维(Visual CoT)提供了更丰富的功能体验。
Visual-Language Model

热招岗位

多模态世界模型算法研究员/专家-豆包大模型
北京/上海/杭州/深圳
社招
立即投递
多模态世界模型算法工程师/专家-豆包大模型
北京/上海/杭州/深圳
社招
立即投递
视觉大模型算法专家-豆包大模型(Top Seed)
北京/上海/杭州/深圳
校招
立即投递
多模态世界模型算法实习生-豆包大模型
北京/上海/杭州/深圳
实习
立即投递