首页模型博客&论文加入我们
EN
中文
首页模型博客&论文加入我们

2025-05-13

Seed1.5-VL Technical Report

Download PDF
上一篇下一篇

摘要

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

作者

Seed Multimodal Team

期刊/会议

arXiv

模型成果
Seed2.0Seedance 2.0Seedream 5.0 LiteSeed LiveInterpret 2.0Seed Realtime VoiceSeed Music
研究团队
LLMInfrastructuresVisionSpeechMultimodal Interaction & World ModelAI for ScienceRoboticsResponsible AI
了解更多
研究成果团队动态Seed EdgeTop Seed加入我们
模型成果
Seed2.0
Seedance 2.0
Seedream 5.0 Lite
Seed LiveInterpret 2.0
Seed Realtime Voice
Seed Music
研究团队
LLM
Infrastructures
Vision
Speech
Multimodal Interaction & World Model
AI for Science
Robotics
Responsible AI
了解更多
研究成果
团队动态
Seed Edge
Top Seed
加入我们
追求智能上限,创造社会价值
欢迎加入字节跳动 Seed
Copyright © 2026 Bytedance Seed
网站声明
联系我们 : seed.feedback@bytedance.com
欢迎加入字节跳动 Seed
Copyright © 2026 Bytedance Seed
网站声明