Latest Releases
July 24, 2025
Seed LiveInterpret 2.0 Official Release
A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.
July 24, 2025
Seed LiveInterpret 2.0 Official Release
A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.
July 22, 2025
Seed GR-3
A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks
GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.
July 22, 2025
Seed GR-3
A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks
GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.
June 25, 2025
Introduction to Seed1.6 Model Technology
The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.
June 25, 2025
Introduction to Seed1.6 Model Technology
The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.
Latest Releases
July 24, 2025
Seed LiveInterpret 2.0 Official Release
A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.
July 24, 2025
Seed LiveInterpret 2.0 Official Release
A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.
July 22, 2025
Seed GR-3
A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks
GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.
July 22, 2025
Seed GR-3
A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks
GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.
June 25, 2025
Introduction to Seed1.6 Model Technology
The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.
June 25, 2025
Introduction to Seed1.6 Model Technology
The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.

Seed1.5-VL
Vision-Language Multimodal Large Model

UI-TARS
Next-Generation Native GUI Agent Model

BAGEL
Unified Multimodal Model
Selected Papers

Jul 24, 2025
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
Seed Speech Team
Speech
Speech&Audio
2025.07.24
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Seed Speech Team
Speech&Audio

Jun 11, 2025
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Notable advances in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still confront critical challenges in synergistically balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient pre-training paradigm that enables multiple features or functions such as interleaved multimodal positional encoding, native multi-shot generation capacity, and multi-task modeling; (iii) carefully-designed post-training optimization leveraging fine-grained supervised fine-tuning, video-specific RLHF with multi-dimensional reward mechanisms for considerable performance improvements; (iv) excellent model acceleration achieving 10× inference speedup through multi- stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds. Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation with superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation, and ultra-fast inference.
Seed Vision Team
Vision
Computer Vision
2025.06.11
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Seed Vision Team
Computer Vision

Jun 05, 2025
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Vision
Computer Vision
2025.06.05
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Computer Vision

May 21, 2025
MMaDA: Multimodal Large Diffusion Language Models
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at:
https://github.com/Gen-Verse/MMaDA
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
LLM
Computer Vision
2025.05.21
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
Computer Vision

May 17, 2025
Model Merging in Pre-training of Large Language Models
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.
Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu
LLM
LLM
2025.05.17
Model Merging in Pre-training of Large Language Models
Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu
LLM

May 13, 2025
Seed1.5-VL Technical Report
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
Seed Multimodal Team
Multimodal
LLM
2025.05.13
Seed1.5-VL Technical Report
Seed Multimodal Team
LLM
View more