Introducing Seed Full-Duplex Speech LLM: Attentive Listening, Robust Interference Suppression, Enabling More Natural Interaction

Introducing Seed Full-Duplex Speech LLM: Attentive Listening, Robust Interference Suppression, Enabling More Natural Interaction

Date

2026-04-09

Category

Models

Today, we officially introduce Seeduplex, a native full-duplex speech LLM. Compared with the previous-generation Doubao end-to-end speech model based on a half-duplex paradigm, Seeduplex is built on an entirely new "listen while speaking" framework, significantly enhancing the naturalness and fluency of the interaction experience, delivering a notable leap in the naturalness and fluency of interactive experiences.

If an end-to-end architecture, by unifying the "listening" and "speaking" modules, gives AI the potential for more natural expression, then full-duplex technology enables AI to "listen" and "speak" simultaneously, allowing for more natural conversational pacing control and more precise interference suppression — no longer a simple back-and-forth of questions and answers, but rather a smooth and well-paced conversation where the AI can respond to users' voice requests with greater precision without being disrupted by ambient noise or other human voices. 

Specifically, through innovations in model architecture and training optimization, and by overcoming engineering challenges such as latency spikes and stability issues under high concurrency, Seeduplex has achieved pioneering full-duplex real-time voice interaction performance. While maintaining the upper bound of model intelligence and ultra-low latency, it has delivered two key breakthroughs:

  • High-precision interference suppression: The model possesses a continuous "listening" capability, enabling it to better understand the user's acoustic environment and accurately ignore background noise and irrelevant conversations. In complex scenarios, compared to half-duplex models, its false response rate and false interruption rate have been reduced by half.
  • Adaptive endpoint detection: The model jointly leverages speech and semantic features to holistically determine user intent, enabling more natural conversational flow. When users hesitate or pause to think, the model listens patiently; once the user finishes speaking, it responds swiftly. Compared to half-duplex models, the rate of premature responses has decreased by 40%.

Currently, Seeduplex has been fully rolled out on the Doubao App. This means full-duplex technology has officially moved beyond the lab, pioneering large-scale deployment in the industry. It enables a continuous high-quality, real-time voice interaction experience to hundreds of millions of users.

Multi-dimensional evaluations show that Seeduplex notably outperforms traditional half-duplex solutions and voice calling in mainstream apps in terms of both conversation fluency and rhythmic pacing. In a human-likeness test focused on endpoint detection accuracy, it achieved a significant 8% improvement compared to the half-duplex solution, enabling a more natural and well-balanced conversation. 

Project Page: https://seed.bytedance.com/seeduplex


A Comprehensive Upgrade in Voice Interaction Experience

More Precise, More Natural Conversational Flow

Natural human-to-human conversation is inherently a "full-duplex" exchange of listening and speaking simultaneously, filled with pauses, thinking, hesitation, background noise interference, and overlapping voices. A seamless voice conversation system must be capable of handling this kind of high-freedom, unstructured audio stream — hearing clearly through the noise while knowing when to wait patiently as you gather your thoughts.

In the past, traditional half-duplex systems typically relied on cascaded modular designs: using standalone VAD (Voice Activity Detection) for mechanical audio segmentation, or employing conventional algorithms for front-end noise reduction. Since the decision-making was limited to isolated acoustic features or localized textual semantic features, these systems were easily "derailed" in complex environments or triggered premature responses during user pauses.

Seeduplex is built on a proprietary large language model (LLM) foundation and pioneers a real-time full-duplex voice interaction framework with speech data pre-training, enabling native joint speech-semantic modeling capabilities. It holistically understands speech and semantic information within audio, dynamically orchestrates conversational rhythm, and achieves a notable leap over traditional systems in interference suppression and conversational pacing.

1. Precision interference suppression : powerful "acoustic focus" amidst the noise

Complex acoustic environments have always been a challenge for voice interaction. Background noise and human voice interference often "contaminate" user speech input, causing delayed system responses, interrupted replies, or false triggers. In the past, users often had to raise their voices or find a quiet corner to complete a reliable interaction.

The Seeduplex model continuously receives and understands user-side audio, perceiving the user's overall acoustic environment, and thereby precisely determining which sounds are genuine interactions with the model and which are interference. This enhanced interference suppression has significantly reduced Seeduplex's false response rate and false interruption rate.

  • Filtering interference and precisely identifying user voice

In a car with frequent broadcasts and mixed spoken navigation instructions, the model can reliably filter out background interference, precisely identify the primary user's voice, and respond quickly to their needs. 

  • Understanding intent, ignoring non-interactive sounds

Whether unexpectedly meeting a friend saying goodbye at a coffee shop or naturally responding to a delivery person at the door while at home, when users interact with the model and are interrupted by others or side conversations, the system can semantically identify the sounds that genuinely carry interaction intent with the model, avoiding erroneous responses and keeping the main conversation natural and coherent.

Even in overlapping multi-person conversation scenarios, it can accurately identify the dialogue target, distinguishing commands from casual chats among others.

  • Perceiving the environment, proactively linking contexts

The model can even parse ambient sounds and incorporate them into the reasoning context. For example, Seeduplex can understand content about Hangzhou being played in the background audio, and combine it with the user's plan to visit Hangzhou, proactively linking environmental information with the conversation to provide thoughtful responses. 

2. Adaptive endpoint detection: dynamic pacing and precise control

The core of truly natural interaction lies in accurately determining when the user is thinking and when they have finished speaking. By deeply integrating speech and semantic understanding, Seeduplex achieves greater flexibility in controlling dialogue rhythm.

  • Patient listening, space for reflection

When expressing uncertain thoughts, users can think and self-correct on the fly, without the pressure of having the entire thought fully formed before speaking. The model actively listens throughout the interaction. Even when faced with complex expressions where the user repeatedly adjusts or even contradicts their previous logic, it accurately captures the user's true intent.

In high-pressure scenarios like a simulated English interview, the model can "understand" that stumbling is merely a pause for thought rather than the end of the dialogue. It patiently listens and waits for you to finish speaking before providing feedback, making the practice experience highly realistic.

  • Ultra-low latency, instant responses

While it knows when to wait, Seeduplex also handles rapid back-and-forth interactions with fast, decisive responses. In rapid-fire Q&A scenarios, it responds almost instantly once the user finishes speaking, cutting latency by around 250ms versus half-duplex systems and ensuring fast responses when needed.

In more demanding scenarios, such as poetry word games that test both reaction speed and a profound knowledge of poetry, the model similarly delivers fluent, seamless responses.

  • Keen perception, instant response to interruptions 

Beyond patient listening and instant responses, Seeduplex also swiftly handles user interruptions. For example, when a user suddenly says, "Wait a moment, let me take notes," the model keenly captures the intent to interrupt. It instantly and smoothly stops speaking, switching into a listening state to wait for the user to continue.


Implementing Full-duplex Voice Technology

From Demo to Large-scale Deployment

Enabling full-duplex capabilities places greater demands on model architecture, algorithm design, and the engineering pipelines. Seeduplex adopts a native end-to-end modeling approach, equipping the system with streaming perception capabilities to extract features from input audio signals and process them in real time via the foundation model.

In terms of interaction logic, the model integrates acoustic features with dialogue context to infer the current state and determine whether to start replying, continue listening, or respond to user interruptions.

To support the full rollout of the model on the Doubao app, the team carried out extensive optimizations across model architecture design, algorithm development, and engineering performance and stability:

  • Model architecture design: Developed an architecture better aligned with the native characteristics of real-time voice interaction, enabling unified modeling of speech and semantics as well as rhythm control directly from data, significantly improving interaction naturalness. 
  • Algorithms and training: Leveraged speech data for pretraining, and introduced a multi-capability, multi-task post-training framework. This enables joint optimization of dialogue capability, ultra-low latency, dialogue pacing, robust interference suppression, and directed understanding. As a result, the model delivers stable, efficient, and natural interaction performance.
  • Inference performance: Optimized performance through techniques such as speculative decoding and quantization, achieving an effective balance between cost and latency. 
  • Service stability: Addressed key issues such as audio input and playback stuttering, ensuring stable and continuous operation under high-traffic conditions.

Ultimately, Seeduplex has achieved breakthroughs in core technical bottlenecks, including data construction, ultra-low latency, and the collaborative optimization of model performance. Large-scale A/B testing data demonstrated the value of the full-duplex interaction mode. Compared to the previous half-duplex model on the Doubao App, Seeduplex achieved gains across core metrics like user call duration and retention. Overall call satisfaction increased by an absolute 8.34%, and user feedback shows a clear decline in mentions of issues such as premature responses, slow responses, and false interruptions.


Seeduplex Evaluation Results

Significantly Outperforming Half-Duplex Models across Multiple Dimensions

We conducted both objective and subjective evaluations on Seeduplex. The results show that Seeduplex significantly outperforms half-duplex models in interruption handling and endpoint detection, leading the industry across multiple key metrics.

Compared to the half-duplex dialogue framework previously used in the Doubao App, Seeduplex further improves the overall interaction experience, with its Endpoint MOS increasing by 8% and Dialogue Fluency MOS by 12%.

Specifically, while reducing endpoint latency by approximately 250ms, Seeduplex reduces the AI interruption rate in complex scenarios by 40%. When handling user interruptions, it reduces response latency by about 300ms while improving response accuracy. Furthermore, in scenarios with complex acoustic interference, it reduces both false response and false interruption rates by half.

Additionally, when benchmarked against the original half-duplex model and the voice call features of mainstream industry apps, Seeduplex demonstrates significant advantages in endpoint detection and interruption response, and notably improves user-rated dialogue fluency, particularly in terms of interaction pacing.

Subjective evaluation results on dialogue rhythm compared with major industry counterparts (the compared products are denoted by letters).

We also conducted real human conversation tests to benchmark the current level of human-machine dialogue against human-human interaction (average human baseline). Seeduplex significantly improves endpointing performance by 8% compared to the half-duplex solution. When responding to interruptions, real human dialogue experiences occasional lag, whereas Seeduplex delivers more stable performance, slightly better than the average human-to-human baseline. However, a considerable gap remains in overall dialogue fluency compared to real human dialogue, requiring further improvement.

The comparison with "human-human conversantions" shows that Seeduplex has further narrowed the gap with real human communication.


Summary and Outlook

The launch of Seeduplex marks a critical step in the evolution of voice interaction, shifting from a "turn-based" format to real-time, natural interaction. Its native full-duplex framework not only enhances the model's interference suppression capabilities and pacing control, but more importantly, it provides a foundation for the comprehensive integration of perception, reasoning, and execution.

Moving forward, we will continue to drive breakthroughs in the following areas:

  • Further improve the model's audio understanding through deep optimization for complex interaction scenarios, such as multi-party conversations and smart devices. 
  • Continue to improve dialogue pacing diversity and control through data scaling and algorithm optimization. 
  • Build on real-time listening and speaking, introducing more proactive capabilities, such as providing backchannel feedback while listening and initiating interactions based on acoustic context and dialogue flow. 
  • Advance multimodal integration, adding visual input alongside existing speech and text to enable coordinated "listen, see, and speak" interactions. 
  • Unify perception, reasoning, and response, further exploring capabilities such as "listen while thinking" and "listen while searching" to enable deeper reasoning and execution, and further enhance conversational fluency.

Starting with full-duplex technology, we expect AI to continuously evolve, truly achieving a seamless synergy of listening, seeing, reasoning, speaking, and acting within a closed loop of perception, interaction, and execution.