首页模型研究加入我们
EN
中文
首页模型研究加入我们

2025-05-13

Seed1.5-VL Technical Report

Download PDF
上一篇下一篇

摘要

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

作者

Seed Multimodal Team

模型成果
Seed1.8Seed1.5-VLSeedance 1.5 proSeedream 4.5Seed LiveInterpret 2.0Seed Realtime VoiceSeed Music
研究团队
LLMInfrastructuresVisionSpeechMultimodal Interaction & World ModelAI for ScienceRoboticsResponsible AI
了解更多
模型研究加入我们Top SeedSeed Edge
模型成果
Seed1.8
Seed1.5-VL
Seedance 1.5 pro
Seedream 4.5
Seed LiveInterpret 2.0
Seed Realtime Voice
Seed Music
研究团队
LLM
Infrastructures
Vision
Speech
Multimodal Interaction & World Model
AI for Science
Robotics
Responsible AI
了解更多
模型
研究
加入我们
Top Seed
Seed Edge
追求智能上限,创造社会价值
欢迎加入字节跳动 Seed
Copyright © 2026 Bytedance Seed
用户协议隐私政策
联系我们 : seed.feedback@bytedance.com
欢迎加入字节跳动 Seed
Copyright © 2026 Bytedance Seed
用户协议隐私政策