Vision
The Seed Vision team focuses on foundational models for visual generation, developing multimodal generative models, and carrying out leading research and application development to solve fundamental computer vision challenges in GenAI.
Research topics
Foundational models for visual generation
Researching and developing foundational models for visual generation (images and videos), ensuring high interactivity and controllability in visual generation, understanding patterns in videos, and exploring various visual-oriented tasks based on generative foundational models.
Multimodal
Diffusion Model
Auto Regression Model
Foundation
Multimodal generative models
Integrating various modalities into a unified generative model, generating and understanding joint modeling, supporting interleaved generation and simultaneous generation across various modalities (such as the digital avatar), and enhancing the contextual capabilities and consistency of generative models.
Multimodel
Diffusion Model
Auto Regression Model
Foundation
3D/4D generative models
3D/4D foundational generative models, learning visual world knowledge from video and 3D data, understanding the physical world's 3D space and physical laws, building spatial intelligence and world models, and exploring physics and rendering engines based on generative models.
3D
4D
World Model
Multimodal model design and optimization
Design and optimize multimodal model network architectures, optimize diffusion models, carry out efficient large-scale distributed training and inference, and push for model acceleration and optimization.
Multimodal
Optimization
Distillation
Quantization

Selected Papers

Jun 11, 2025
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Notable advances in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still confront critical challenges in synergistically balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient pre-training paradigm that enables multiple features or functions such as interleaved multimodal positional encoding, native multi-shot generation capacity, and multi-task modeling; (iii) carefully-designed post-training optimization leveraging fine-grained supervised fine-tuning, video-specific RLHF with multi-dimensional reward mechanisms for considerable performance improvements; (iv) excellent model acceleration achieving 10× inference speedup through multi- stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds. Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation with superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation, and ultra-fast inference.
Seed Vision Team
Computer Vision
Jun 05, 2025
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Computer Vision
Apr 15, 2025
Seedream 3.0 Technical Report
We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
Seed Vision Team
Computer Vision
Learn More
Technical capability demonstration
Seedance
A multi-task video generation model for both text-to-video and image-to-video. It precisely follows complex prompts to create fluid, detailed 1080p videos, and natively supports coherent multi-shot storytelling.
SeedEdit
A universal image editing model which enables diverse editing operations on images through simple natural language input. These operations include photo retouching, clothing replacement, beautification, style transformation, as well as adding or removing elements in specified regions.
Seedream
A high-resolution, bilingual (Chinese-English) foundational text-to-image model. It offers significant overall performance, particularly in image structure accuracy, quantity consistency, multi-object attribute relationships, small text generation and layout, visual aesthetics, and realism.

Featured Jobs

Research Scientist, Multimodal Foundation Model
Singapore
Experienced Hiring
Apply Now
Research Scientist- Foundation Model, Video Generation
San Jose
Experienced Hiring
Apply Now
Research Engineer- Foundation Model AI Platform- San Jose
San Jose
Experienced Hiring
Apply Now
Research Scientist Graduate (Foundation Model, Video Generation) - 2025 Start (PhD)
San Jose
Campus Recruitment
Apply Now
Student Researcher (Seed - Foundation Model, Video Generation) - 2025 Start (PhD)
San Jose
Internship
Apply Now
Student Researcher (Seed - Foundation Model AI Platform) - 2025 Start (PhD)
San Jose
Internship
Apply Now