首页 Top Seed Seed Edge ·研究成果团队动态加入我们

EN

中文

首页 Top Seed Seed Edge 研究成果团队动态加入我们

最新成果

2025.07.24

Seed LiveInterpret 2.0 正式发布

一个真正意义上的端到端同声传译模型。它支持实时声音复刻，即使在多人对话、非流利语音及长音频等复杂场景下，也能实现高质量、低延迟的语音到语音传译。

2025.07.24

Seed LiveInterpret 2.0 正式发布

一个真正意义上的端到端同声传译模型。它支持实时声音复刻，即使在多人对话、非流利语音及长音频等复杂场景下，也能实现高质量、低延迟的语音到语音传译。

2025.07.22

Seed GR-3

一个可泛化、支持长程以及复杂任务的机器人操作大模型

GR-3 是一个大规模的视觉 - 语言 - 动作（VLA）模型。它对新物体、新环境以及含抽象概念的新指令展现出较好的泛化能力。

2025.07.22

Seed GR-3

一个可泛化、支持长程以及复杂任务的机器人操作大模型

GR-3 是一个大规模的视觉 - 语言 - 动作（VLA）模型。它对新物体、新环境以及含抽象概念的新指令展现出较好的泛化能力。

2025.06.25

Seed1.6 系列模型技术介绍

最新通用模型 Seed1.6 系列，融合多模态能力，支持自适应的深度思考、多模态理解、图形界面操作，同时支持 256K 长上下文的深度推理。

2025.06.25

Seed1.6 系列模型技术介绍

最新通用模型 Seed1.6 系列，融合多模态能力，支持自适应的深度思考、多模态理解、图形界面操作，同时支持 256K 长上下文的深度推理。

最新成果

2025.07.24

Seed LiveInterpret 2.0 正式发布

一个真正意义上的端到端同声传译模型。它支持实时声音复刻，即使在多人对话、非流利语音及长音频等复杂场景下，也能实现高质量、低延迟的语音到语音传译。

2025.07.24

Seed LiveInterpret 2.0 正式发布

一个真正意义上的端到端同声传译模型。它支持实时声音复刻，即使在多人对话、非流利语音及长音频等复杂场景下，也能实现高质量、低延迟的语音到语音传译。

2025.07.22

Seed GR-3

一个可泛化、支持长程以及复杂任务的机器人操作大模型

GR-3 是一个大规模的视觉 - 语言 - 动作（VLA）模型。它对新物体、新环境以及含抽象概念的新指令展现出较好的泛化能力。

2025.07.22

Seed GR-3

一个可泛化、支持长程以及复杂任务的机器人操作大模型

GR-3 是一个大规模的视觉 - 语言 - 动作（VLA）模型。它对新物体、新环境以及含抽象概念的新指令展现出较好的泛化能力。

2025.06.25

Seed1.6 系列模型技术介绍

最新通用模型 Seed1.6 系列，融合多模态能力，支持自适应的深度思考、多模态理解、图形界面操作，同时支持 256K 长上下文的深度推理。

2025.06.25

Seed1.6 系列模型技术介绍

最新通用模型 Seed1.6 系列，融合多模态能力，支持自适应的深度思考、多模态理解、图形界面操作，同时支持 256K 长上下文的深度推理。

Seed1.5（Doubao-1.5-pro)

Seed1.5（Doubao-1.5-pro)

新一代主力模型，性能全面升级，在知识、代码、推理等方面表现领先

Seed1.6

融合多模态能力，“自适应思考”平衡模型效果与推理性能

精选论文

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

Seed Speech Team

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Seed Speech Team

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Notable advances in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still confront critical challenges in synergistically balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient pre-training paradigm that enables multiple features or functions such as interleaved multimodal positional encoding, native multi-shot generation capacity, and multi-task modeling; (iii) carefully-designed post-training optimization leveraging fine-grained supervised fine-tuning, video-specific RLHF with multi-dimensional reward mechanisms for considerable performance improvements; (iv) excellent model acceleration achieving 10× inference speedup through multi- stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds. Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation with superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation, and ultra-fast inference.

Seed Vision Team

Computer Vision

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seed Vision Team

Computer Vision

SeedEdit 3.0: Fast and High-Quality Generative Image Editing

We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang

Computer Vision

SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang

Computer Vision

MMaDA: Multimodal Large Diffusion Language Models

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

Computer Vision

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

Computer Vision

Model Merging in Pre-training of Large Language Models

Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

Model Merging in Pre-training of Large Language Models

Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

Seed1.5-VL Technical Report

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Seed Multimodal Team

Seed1.5-VL Technical Report

Seed Multimodal Team

查看更多论文

和优秀的人，做有挑战的事

欢迎加入字节跳动 Seed

用户协议隐私政策

关注字节跳动 Seed 了解最新技术进展、研究成果和招聘信息

Copyright © 2025 Bytedance Seed

和优秀的人，做有挑战的事

欢迎加入字节跳动 Seed

用户协议隐私政策

Copyright © 2025 Bytedance Seed