Home Top Seed Seed Edge ·Research News Join Us

EN

中文

Home Top Seed Seed Edge Research News Join Us

Latest Releases

July 24, 2025

Seed LiveInterpret 2.0 Official Release

A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.

July 24, 2025

Seed LiveInterpret 2.0 Official Release

A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.

July 22, 2025

Seed GR-3

A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks

GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.

July 22, 2025

Seed GR-3

A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks

GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.

June 25, 2025

Introduction to Seed1.6 Model Technology

The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.

June 25, 2025

Introduction to Seed1.6 Model Technology

The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.

Latest Releases

July 24, 2025

Seed LiveInterpret 2.0 Official Release

A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.

July 24, 2025

Seed LiveInterpret 2.0 Official Release

A true end-to-end simultaneous interpretation model. It supports real-time voice cloning and delivers high-quality, low-latency speech-to-speech interpretation, even in challenging scenarios such as multi-speaker dialogue, disfluent speech, and long-form audio.

July 22, 2025

Seed GR-3

A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks

GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.

July 22, 2025

Seed GR-3

A Generalizable Vision-Language-Action Model for Long-Horizon and Complex Manipulation Tasks

GR-3 is a large-scale vision-language-action (VLA) model. It showcases strong capabilities in generalization to novel objects, environments, and instructions involving abstract concepts.

June 25, 2025

Introduction to Seed1.6 Model Technology

The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.

June 25, 2025

Introduction to Seed1.6 Model Technology

The latest general-purpose Seed1.6 combines multimodal capabilities and supports adaptive deep thinking, multimodal understanding, and graphical interface interactions, while also enabling deep reasoning with 256K long context.

Seed1.5-VL

Seed1.5-VL

Vision-Language Multimodal Large Model

UI-TARS

UI-TARS

Next-Generation Native GUI Agent Model

BAGEL

BAGEL

Unified Multimodal Model

Selected Papers

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

Seed Speech Team

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Seed Speech Team

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Notable advances in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still confront critical challenges in synergistically balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient pre-training paradigm that enables multiple features or functions such as interleaved multimodal positional encoding, native multi-shot generation capacity, and multi-task modeling; (iii) carefully-designed post-training optimization leveraging fine-grained supervised fine-tuning, video-specific RLHF with multi-dimensional reward mechanisms for considerable performance improvements; (iv) excellent model acceleration achieving 10× inference speedup through multi- stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds. Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation with superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation, and ultra-fast inference.

Seed Vision Team

Computer Vision

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seed Vision Team

Computer Vision

SeedEdit 3.0: Fast and High-Quality Generative Image Editing

We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang

Computer Vision

SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang

Computer Vision

MMaDA: Multimodal Large Diffusion Language Models

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

Computer Vision

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

Computer Vision

Model Merging in Pre-training of Large Language Models

Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

Model Merging in Pre-training of Large Language Models

Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

Seed1.5-VL Technical Report

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Seed Multimodal Team

Seed1.5-VL Technical Report

Seed Multimodal Team

View more

Do great things with great people

Join ByteDance Seed

User Agreement Privacy Policy

Follow ByteDance Seed to keep abreast of the latest technological developments, research achievements, and career opportunities.

Copyright © 2025 Bytedance Seed

Do great things with great people

Join ByteDance Seed

User Agreement Privacy Policy

Copyright © 2025 Bytedance Seed