最新成果
2025.06.11
Seedance 1.0 正式发布
一款支持多任务(文生视频、图生视频)的视频生成模型。它能精准理解复杂指令,生成动作流畅、细节丰富的1080p视频,并原生支持连贯的多镜头叙事。
2025.06.11
Seedance 1.0 正式发布
一款支持多任务(文生视频、图生视频)的视频生成模型。它能精准理解复杂指令,生成动作流畅、细节丰富的1080p视频,并原生支持连贯的多镜头叙事。
2025.06.06
SeedEdit 3.0 正式发布
在真实图像可用率、复杂逻辑生成及人物 ID 保持上,图像编辑能力达到业界 SOTA 水平。
2025.06.06
SeedEdit 3.0 正式发布
在真实图像可用率、复杂逻辑生成及人物 ID 保持上,图像编辑能力达到业界 SOTA 水平。
2025.5.20
多模态统一模型 BAGEL
支持文本、图像和视频的统一理解和生成,并具备自由图像编辑、未来帧预测、操控 3D 空间、世界导航等更强的复杂推理和组合能力。
2025.5.20
多模态统一模型 BAGEL
支持文本、图像和视频的统一理解和生成,并具备自由图像编辑、未来帧预测、操控 3D 空间、世界导航等更强的复杂推理和组合能力。
最新成果
2025.06.11
Seedance 1.0 正式发布
一款支持多任务(文生视频、图生视频)的视频生成模型。它能精准理解复杂指令,生成动作流畅、细节丰富的1080p视频,并原生支持连贯的多镜头叙事。
2025.06.11
Seedance 1.0 正式发布
一款支持多任务(文生视频、图生视频)的视频生成模型。它能精准理解复杂指令,生成动作流畅、细节丰富的1080p视频,并原生支持连贯的多镜头叙事。
2025.06.06
SeedEdit 3.0 正式发布
在真实图像可用率、复杂逻辑生成及人物 ID 保持上,图像编辑能力达到业界 SOTA 水平。
2025.06.06
SeedEdit 3.0 正式发布
在真实图像可用率、复杂逻辑生成及人物 ID 保持上,图像编辑能力达到业界 SOTA 水平。
2025.5.20
多模态统一模型 BAGEL
支持文本、图像和视频的统一理解和生成,并具备自由图像编辑、未来帧预测、操控 3D 空间、世界导航等更强的复杂推理和组合能力。
2025.5.20
多模态统一模型 BAGEL
支持文本、图像和视频的统一理解和生成,并具备自由图像编辑、未来帧预测、操控 3D 空间、世界导航等更强的复杂推理和组合能力。

Seed1.5(Doubao-1.5-pro)
新一代主力模型
精选论文

2025.06.05
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Vision
Computer Vision
2025.06.05
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Computer Vision

2025.05.21
MMaDA: Multimodal Large Diffusion Language Models
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at:
https://github.com/Gen-Verse/MMaDA
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
LLM
Computer Vision
2025.05.21
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
Computer Vision

2025.05.20
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
Multimodal
Computer Vision
2025.05.20
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
Computer Vision

2025.05.17
Model Merging in Pre-training of Large Language Models
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.
Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu
LLM
LLM
2025.05.17
Model Merging in Pre-training of Large Language Models
Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu
LLM

2025.05.13
Seed1.5-VL Technical Report
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
Seed Multimodal Team
Multimodal
LLM
2025.05.13
Seed1.5-VL Technical Report
Seed Multimodal Team
LLM

2025.04.24
Let the Code LLM Edit Itself When You Edit the Code
In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline{\textbf{Positional \textbf{I}ntegrity \textbf{E}ncoding} (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He
Infrastructures
NLP
2025.04.24
Let the Code LLM Edit Itself When You Edit the Code
Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He
NLP
查看更多论文