# Fullduplex — full corpus
> an observatory for speech-to-speech, full-duplex & audio foundation models.
Generated at 2026-04-26T17:18:44.926Z. Canonical hub: https://fullduplex.ai/ · Article index: https://fullduplex.ai/blog · Machine index: https://fullduplex.ai/llms.txt
This file concatenates every published article (Markdown source, with frontmatter and a trailing attribution block) so that LLMs can ingest the entire publication in a single request.
---
# 01 — Speech-to-speech AI, a primer
_Canonical: https://fullduplex.ai/blog/sts-primer · Markdown: https://fullduplex.ai/blog/sts-primer/md_
---
title: "Speech-to-speech AI, a primer"
description: "What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions."
article_number: "01"
slug: sts-primer
published_at: 2026-04-02
reading_minutes: 11
tags: ["STS", "primer", "foundation-models"]
canonical_url: https://fullduplex.ai/blog/sts-primer
markdown_url: https://fullduplex.ai/blog/sts-primer/md
series: "The STS Series"
series_position: 1
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Speech-to-Speech AI: A Primer
*What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.*
---
## 1. The telephone moment
In a natural conversation between two humans, the gap between one person finishing and the other starting averages about 200 milliseconds. That is roughly a quarter of a blink. It is also one of the most stable numbers in human behavior, measured the same way across [ten very different languages from Japanese to Yélî Dnye](https://www.pnas.org/doi/10.1073/pnas.0903616106). Until 2024, voice assistants needed about a full second to do the same thing. That difference is the difference between a conversation and a transaction, and it is why the voice AI demos of the last eighteen months feel qualitatively different from anything before.
Here is a simple way to hold the shift in your head. Old voice assistants worked like a walkie-talkie. One side presses the button, speaks a complete thought, releases, and waits. The other side does the same. Interruptions break it. Overlaps break it. Listening and speaking are separate modes and only one happens at a time. The new systems work like a telephone. Two people, two open channels, both able to listen and speak at once, able to interrupt and be interrupted, able to murmur *mhm* while the other person is still talking.
This is a primer on what changed, what the new models are actually doing, and why the terms you are about to see, speech-to-speech, full-duplex, audio foundation model, are worth distinguishing carefully.
## 2. Four words you will see everywhere
Four phrases do most of the work in this field, and they overlap in ways that quietly trip people up.
A **speech-to-speech (STS) model** is a model that takes audio in and emits audio out, without converting to text as an intermediate step. The audio is the input, the audio is the output, the model itself does the thinking in a representation that lives closer to sound than to written language.
**Full-duplex** describes how the conversation flows. A full-duplex system can listen and speak at the same time, the way a telephone can. A half-duplex system has to finish one before starting the other, the way a walkie-talkie does. Full-duplex is a property of the interaction pattern, not of the model architecture, though certain architectures make it much easier.
An **audio foundation model** is a big pretrained model that understands and generates audio. *Foundation* is a borrowed word from the text world, where it means the model was pretrained on a very large, broad corpus and can be adapted to many tasks. An audio foundation model does the same thing but with waveforms as its native material.
A **speech language model** (or SpeechLM) is a large model that treats speech the way GPT treats text: as a sequence of discrete tokens, predicted one after another. SpeechLMs are usually built on top of a neural audio codec that converts waveforms into tokens, which we will come to in a moment.
These terms overlap but are not interchangeable. Moshi, the open-source system Kyutai released in late 2024, is all four at once: a speech-to-speech model, full-duplex, a foundation model for audio, and a speech language model. VALL-E, an earlier Microsoft system, is a SpeechLM but only for text-to-speech, not STS. A traditional cascade of ASR plus LLM plus TTS is speech-in and speech-out at the system level, but there is no STS model at its core, and it is usually half-duplex in practice.
Before looking at the full landscape, it helps to separate three different kinds of voice AI product that sometimes get lumped together. A *speech-to-text* service turns audio into a transcript. A *text-to-speech* service turns text into audio. A *conversational AI* system does both, in a loop, and has to decide what to say. The first two are components. The third is the system you actually talk to.
That distinction matters because the same brand can appear in more than one layer. ElevenLabs sells a TTS service, a STT service, and a conversational AI product built on its own components. VAPI and Retell do not train speech models at all. They orchestrate Deepgram plus an LLM plus ElevenLabs into a voice agent. Moshi and OpenAI's Realtime API sit in a different place on the map. They are the model itself, not a pipeline of third-party components.
## 3. How audio becomes tokens *(optional)*
A language model works on discrete tokens, not on raw audio. Before any of the approaches above can work on speech, there has to be a way to turn a waveform into a sequence of discrete units and back again, without losing too much of what made the audio sound human.
That job falls to a **neural audio codec**. Think of it as MP3 encoding with one extra trick. Like MP3, it compresses a waveform into a much smaller representation. Unlike MP3, the compressed representation is a sequence of integers that a language model can read and write directly.
The number that matters most is the *frame rate*. Kyutai's Mimi codec, released with Moshi, emits tokens at 12.5 Hz, which is close to the rate at which word-like text tokens arrive in normal speech. That alignment is what lets audio sit side by side with text inside one model without overwhelming it. If that single detail is all you take from this section, you have the point.
*Aside for technically curious readers: the trick inside most modern codecs is called **residual vector quantization** (RVQ). A stack of small dictionaries, where each layer encodes what the previous layer missed. Five layers with a vocabulary of 320 each can describe more acoustic variation than a single flat vocabulary of a billion. If that is interesting, the SoundStream and Moshi papers walk through it. If not, skip ahead.*
## 4. How we got here, in four years
The new wave of voice AI did not fall out of the sky in late 2024. It is the visible end of a research arc that began quietly around 2021 and gathered pace each year since.
In early 2021, a team at Meta published a paper called [**Generative Spoken Language Modeling**, or GSLM](https://arxiv.org/abs/2102.01192). It showed something that, at the time, felt almost heretical: you could train a language model on raw speech with no text at all, by clustering speech features into pseudo-words and then modeling the sequence of those units. The speech did not have to pass through writing to be learnable.
Later that year, Google released [**SoundStream**](https://arxiv.org/abs/2107.03312), the neural audio codec that delivered the RVQ trick described above. Together, GSLM and SoundStream were the grammar and the alphabet for a future speech language model.
In 2022, Google combined the two with its [**AudioLM**](https://arxiv.org/abs/2209.03143) system, which introduced a hierarchy of semantic tokens and acoustic tokens. Semantic tokens carried the content, acoustic tokens carried the voice. AudioLM could continue a short audio clip in the speaker's own voice for many seconds, with linguistic coherence and acoustic realism people had not quite seen before.
Also in 2022, Meta's follow-up [**dGSLM**](https://arxiv.org/abs/2203.16502) extended GSLM from monologue to two-speaker dialogue, trained on the Fisher corpus, and produced the first textless model with natural turn-taking behavior, including overlaps and backchannels. The pieces for a conversational speech model were on the table.
In 2023, two systems generalized the approach in different directions. Microsoft's [**VALL-E**](https://arxiv.org/abs/2301.02111) used the codec-plus-language-model recipe for high-quality text-to-speech, cloning a voice from a three-second sample. Fudan's [**SpeechGPT**](https://arxiv.org/abs/2305.11000) plugged speech tokens into a text LLM's vocabulary and produced one of the first models that could take a spoken instruction and answer in speech, end to end.
Then, in September 2024, Kyutai released [**Moshi**](https://arxiv.org/abs/2410.00037). Open weights under a CC-BY 4.0 license, code under Apache 2.0, running on a single GPU. The first real-time, full-duplex, speech-text foundation model available to anyone who wanted to study it. That is the moment the research arc met the demo stage, and it is why the second half of 2024 felt different from the first half.
A parallel thread, worth knowing so you do not mistake this for the only story, runs through Google's **Translatotron** ([2019](https://arxiv.org/abs/1904.06037) and [2021](https://arxiv.org/abs/2107.08661)), which did direct speech-to-speech translation without text. It sat outside the LLM lineage but proved the broader point that text is not a mandatory intermediate step for voice.
## 5. What the new architecture actually does
Moshi is the clearest public example of how these models are put together. Understanding its shape helps make the whole category concrete.
Moshi models two audio streams at once. One is the user's channel, the audio coming in. The other is the model's own channel, the audio going out. Both streams are represented in the same kind of Mimi tokens, and both are predicted by the same network. That is what gives the model the structural ability to listen and speak simultaneously. There is no push-to-talk state, no moment when the model stops hearing in order to speak.
Alongside the two audio streams, Moshi maintains a third stream: a time-aligned text transcript of what the model itself is saying. At each 80 millisecond frame, the model first predicts a text token, then predicts the audio tokens for that frame. The text token acts as a kind of inner monologue, a semantic handle that lets the model reason linguistically while generating audio. This technique, which Kyutai calls **Inner Monologue**, is the training detail that keeps the spoken output coherent over long turns.
It is worth pausing on what this buys you. Earlier speech language models, including SpeechGPT, followed a pattern where the model first produces a complete text response, then synthesizes audio for that text. Real-time conversation is almost impossible in that arrangement, because the audio cannot begin until the text is done. Moshi's frame-by-frame interleaving means text and audio are generated together. Each word the model says earns its place in the same forward pass.
That is why every part of the label *real-time, full-duplex, speech-text foundation model* is literal. Real-time because generation is frame by frame. Full-duplex because two audio channels are modeled at once. Speech-text because text and audio are co-generated, not stage-separated. Foundation model because the whole thing is pretrained at scale on conversational data, then aligned for dialogue.
## 6. What the cascade cannot do
The old pipeline, ASR to text LLM to TTS, still works, and in many narrow domains it works very well. [OpenAI's own developer documentation](https://platform.openai.com/docs/guides/voice-agents) frames voice as two valid tracks: chained pipelines, which remain reliable and easier to debug, and speech-to-speech models, which aim for lower latency and more natural conversation. The argument here is narrower than a dismissal of the cascade. It is that on two specific dimensions, the cascade is structurally disadvantaged by design.
The first is **paralinguistic loss**. Speech carries two kinds of information. There are the words themselves, which a transcript captures, and there is everything about how the words were said, which a transcript throws away. Pitch, prosody, emotion, timbre, rate, breath. When an ASR model converts speech to text, it throws away this second channel entirely. A text LLM reasoning on the transcript cannot recover information that was never passed to it. The TTS that speaks the answer has to invent prosody from scratch, based only on the words, with no clue about the user's mood, urgency, or register. Sarcasm becomes sincerity. A panicked question comes back at conversational pace. A sardonic *sure* gets a chipper *absolutely*. OpenAI made the same observation in its [Realtime API release notes](https://openai.com/index/introducing-gpt-realtime/), where the team acknowledged that traditional stitched pipelines tend to lose emotion, emphasis, and accents. That admission from a player whose first voice product was itself a cascade is a useful primary-source signal that the loss is a property of the architecture, not of any one implementation.
The second is **error propagation**. Each stage of the pipeline is independently trained on its own task, and none of them sees the full audio. An ASR mistake on a homophone (*knight* for *night*, *ate eight* for *eight*) changes the meaning the LLM reasons about, and the error cannot be corrected downstream because the downstream stages never saw the original waveform. Accented speech, which many ASR models still handle unevenly, compounds the same problem. The TTS can pronounce the wrong answer with perfect clarity, which is actually worse than a garbled one, because it sounds confident.
It is worth being honest about what this does not mean. Cascades are not dead. For high-accuracy, highly constrained domains, a cascade with domain-tuned ASR still often wins on task accuracy, and modular pipelines remain easier to debug. A recent line of work, including the [**X-Talk**](https://arxiv.org/abs/2512.18706) survey on modular systems, argues that well-engineered modular designs with paralinguistic side-channels can close much of the gap. The claim here is narrower. Cascades hit a structural ceiling on the naturalness of conversation. Latency below 300 milliseconds and faithful paralinguistic preservation are not problems the cascade architecture is shaped to solve.
## 7. What STS actually solves
Pulling the threads together, an integrated speech-to-speech model is structurally better positioned on three capabilities where the cascade is disadvantaged by design.
First, it brings latency closer to the conversational threshold. Around 200 milliseconds instead of around 1,000 in the best reported measurements. That is the difference between an exchange and a conversation, and it is now being reported from real systems rather than only from research papers.
Second, it preserves paralinguistic signal through the pipeline. The prosody, emotion, rate, and affect of what the user said are carried through rather than discarded and reinvented. That is why the best demos from this generation sound like they are responding to *how* you spoke, not just *what* you said.
Third, it supports natural turn-taking. Because the architecture models two audio channels at once, overlaps, interruptions, and backchannels behave the way they do in human conversation. Duplex is no longer a product feature bolted on top. It is built into the model.
The reason latency, tone, and turn-taking matter beyond demos is the size of the categories voice is the natural interface for. Start with headcount and call volume rather than TAM. The [US Bureau of Labor Statistics](https://www.bls.gov/ooh/office-and-administrative-support/customer-service-representatives.htm) counts roughly 2.8 million customer service representatives in 2024, at a median wage near $20 an hour, with the category projected to shrink through 2034. In the UK, [NHS 111 logged 1.68 million calls](https://www.england.nhs.uk/statistics/statistical-work-areas/iuc-ccas/) in a single month in 2025. The [World Health Organization estimates that at least 2.2 billion people](https://www.who.int/publications/i/item/9789241516570) have near or distance vision impairment, a population for which voice is not a convenience but the primary interface. [UNESCO](https://www.unesco.org/gem-report/en/inclusion) has repeatedly flagged that hundreds of millions of learners are taught in a language they do not speak at home. These are not TAM slides. They are problem sizes for which a conversational interface that preserves tone, handles interruption, and runs under a human turn-taking threshold is a credible lever.
Market research puts the same pressure on the supply side. [Gartner projects that conversational AI in contact centers will save roughly $80B](https://www.gartner.com/en/newsroom/press-releases/2022-08-31-gartner-predicts-conversational-ai-will-reduce-contact-centre-labour-costs-by-80-billion-in-2026) in agent labor cost by 2026. [Grand View Research sizes the AI voice agents market at $2.54B in 2025](https://www.grandviewresearch.com/industry-analysis/ai-voice-agents-market-report) with a projected 39% CAGR through 2033. ElevenLabs alone [reported more than $330M in ARR at the end of 2025, then raised $500M in February 2026 at an $11B valuation](https://www.sequoiacap.com/article/partnering-with-elevenlabs-series-d/), roughly three times the valuation it carried a year earlier. Estimates from different research firms vary by a factor of two or three depending on scope, so treat each individual figure as directional. The direction, though, is not ambiguous.
The market becomes easier to read once you split it into three layers, each with a distinct revenue model and a distinct KPI.
Underneath those layers, the capital market has already repriced voice as a standalone primitive. One cluster of rounds inside a single 14-month window, from January 2025 through February 2026, sets a rough floor for how investors read the category.
Most of these rounds fund companies selling into existing markets: contact centers, customer service, outbound sales, clinical intake, enterprise note-taking. These are categories where voice is already the interface and the KPI is task effectiveness. How many calls are handled, how many minutes are saved, how many issues are resolved on first contact. The math on these markets is well understood, and the capital on the chart above is largely betting on winning them. The one exception is [Sesame's $250M Series B](https://techcrunch.com/2025/10/21/sesame-the-conversational-ai-startup-from-oculus-founders-raises-250m-and-launches-beta/), led by Sequoia and Spark in October 2025, which took the company above a $1B valuation on the strength of a voice-companion product (Maya and Miles) and a smart-glasses roadmap, not a contact-center pitch. That round is the first billion-dollar price tag inside this window that is pointed at the second market below rather than the first.
What is more interesting, if less easily sized today, is the second market STS quietly opens up. Because the model can both read and write paralinguistic signal (pitch, prosody, rate, the shape of a breath), the interface becomes capable of carrying feelings, not just instructions. Text never could. The early consumer signals are already visible: companion apps crossed $120M in mobile revenue in 2025, and 48% of users report using their AI companion for mental-health support. These are small numbers today, and some of the usage patterns are known to be fragile. The novel part is the shape: a category where the KPI is presence rather than task completion.
> **oto perspective**
>
> The obvious wins for STS are in the left column. Customer support, clinical intake, enterprise voice copilots. These are categories where the KPI is task effectiveness and voice is already the interface. Capital and product teams are correctly racing into them.
>
> What we find structurally more interesting is the right column. Because STS can read and write paralinguistic signal (pitch, prosody, rate, the shape of a breath), it is the first interface a computer has ever had that can carry feelings. Picture the clock on your bedside table with STS inside it. At the start and end of each day, instead of opening TikTok, you spend ten minutes talking to a companion that actually understands how you sounded yesterday and the day before, and that helps you journal. The product KPI is how much better you feel, not tasks completed. The societal KPI, if this ever works at scale, is measured in suicide rates, mental-health incidence, and daily stress. Those categories are enormous, and they do not exist as products today for one reason: text cannot carry the signal. STS might.
>
> That is the market we think is worth building toward, and it is the reason we care about the quality of the data going in. A model that hallucinates a task is annoying. A model that hallucinates an emotion is something else.
The three capabilities above land differently across these two markets. Sub-conversational latency matters most where interruption and back-and-forth are constant. Paralinguistic preservation matters most where tone carries information the words do not. Full-duplex turn-taking matters most where the interaction is long and unstructured. A model that clears all three is a candidate default interface for most of the categories above, which makes the next question a question about inputs.
None of this means the category is finished. STS models still hallucinate, and when they do, there is no intermediate text transcript to point to, so debugging is harder. Specialized ASR and TTS still beat foundation models in narrow, high-accuracy domains. On evaluation, 2025 was the year the first STS-native benchmarks appeared: [Full-Duplex-Bench](https://arxiv.org/abs/2503.04721) (arxiv 2503.04721, March 2025) focuses on turn-taking and interruption behavior, and [URO-Bench](https://arxiv.org/abs/2502.17810) (arxiv 2502.17810, Feb 2025, EMNLP 2025) is the first S2S benchmark to score paralinguistic understanding and response. The stack is still fragmented, with no single dominant end-to-end standard for *is this a good STS agent*. Those are the threads later articles in this series pick up.
One final observation about where the bottleneck now sits. With [gpt-realtime generally available](https://openai.com/index/introducing-gpt-realtime/), [Gemini Live on Vertex](https://cloud.google.com/vertex-ai/generative-ai/docs/live-api), and open-weight models like [Moshi](https://github.com/kyutai-labs/moshi) and [Sesame CSM](https://github.com/SesameAILabs/csm) downloadable, the architecture side of STS is rapidly becoming a commodity. What separates a demo from a product that works across accents, emotional registers, and full conversational turns is not the model graph anymore. It is the data the model was trained on. Which leads to the next article.
## 8. What comes next: data
Full-duplex models have to learn from conversations that actually look like conversations. Two channels, one per speaker. Overlap left intact. Paralinguistic signals preserved. Not read speech, not scripted dialog, not bulked-up monologue transcripts.
What is scarce is not speech data in general, but clean speaker-wise full-duplex conversational audio at scale. Most in-the-wild dialogue still exists as monaural mixtures, not separate channels, so overlap has to be reconstructed rather than observed. That conversational speech itself can scale is no longer in doubt. [J-CHAT](https://arxiv.org/abs/2407.15828), published in 2024, is a 76,000-hour Japanese dialogue speech corpus assembled from the public web. Recent work on full-duplex specifically, such as [InteractSpeech](https://aclanthology.org/2025.findings-emnlp.424/) (2025) and [DialogueSidon](https://arxiv.org/abs/2604.09344) (2026), is still measured in the low hundreds of hours, and the open ceiling for clean two-channel conversation remains [Fisher](https://catalog.ldc.upenn.edu/LDC2004S13), a 1,960-hour corpus collected by LDC in 2004. Moshi trained on it. Nearly every serious full-duplex effort does. Frontier models are already operating at scales where 2,000 hours of two-channel dialogue is a starting point, not a ceiling. The gap between what the next generation of STS models needs and what is actually available, licensed, and channel-separated, is the practical bottleneck the rest of this series looks at.
That is where we go next. What is in the public datasets, what is missing from them, what a full-duplex training set actually has to contain, and what it takes to build one at the scale the models now demand.
---
*oto builds large-scale two-channel full-duplex conversational speech datasets for next-generation speech-to-speech models. If you are training an STS model and running into the data ceiling described above, get in touch.*
---
_Originally published at [https://fullduplex.ai/blog/sts-primer](https://fullduplex.ai/blog/sts-primer)._
_Part of **The STS Series** · 01 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 02 — The full-duplex threshold
_Canonical: https://fullduplex.ai/blog/full-duplex-threshold · Markdown: https://fullduplex.ai/blog/full-duplex-threshold/md_
---
title: "The full-duplex threshold"
description: "A number, a biology fact, and a small cluster of systems. What the full-duplex threshold actually is, what it takes to cross it, and what conversations above it unlock."
article_number: "02"
slug: full-duplex-threshold
published_at: 2026-04-09
reading_minutes: 13
tags: ["full-duplex", "latency", "conversation"]
canonical_url: https://fullduplex.ai/blog/full-duplex-threshold
markdown_url: https://fullduplex.ai/blog/full-duplex-threshold/md
series: "The STS Series"
series_position: 2
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# The full-duplex threshold: when AI stops sounding like a walkie-talkie
## 1. The transaction and the conversation
Every time you have used a voice assistant you have probably noticed the same small rhythm. You say something. The device falls silent for a moment. A beep. A pause. A canned reply. Across ten years and several trillion dollars of product development, the rhythm has not really changed, because it is not a product decision. It is a structural one.
Voice assistants have been walkie-talkies. One side talks, releases, the other talks, releases. You are not having a conversation. You are filing a short request. Siri, Alexa, the airline's phone tree, the smart speaker in your kitchen. They all accept one utterance at a time, think, then reply with one utterance back. It is a transaction, not a conversation.
In 2024 a small cluster of voice systems stopped working that way. They started behaving like a telephone. You can interrupt them mid-sentence. They can hum agreement while you are still talking. They can begin speaking before you have finished. If you try one of them on a good day and then immediately go back to Siri, the older system feels broken.
That shift has a name. It is the crossing of the full-duplex threshold. This article is about what the threshold actually is, what it takes to cross it, why the timing is bounded by a number about human biology rather than engineering effort, and what conversations above the threshold unlock that transactions never could.
## 2. Walkie-talkie and telephone, formalized
The words come from telecommunications. A [half-duplex](https://en.wikipedia.org/wiki/Duplex_(telecommunications)) channel carries signal in only one direction at a time. Push to talk. Release to listen. A [full-duplex](https://en.wikipedia.org/wiki/Duplex_(telecommunications)#Full_duplex) channel carries signal both ways simultaneously. Talking and listening share the wire.
Every voice interface you have used sits somewhere on that axis. An airline IVR, a push-to-talk radio, a 2013-era Siri: walkie-talkie. A telephone call, a face-to-face conversation, the best 2024-2026 speech-to-speech systems: telephone. The push-to-talk button on a 2000s Nokia and the wake-word-then-utterance pattern on a 2020 Alexa are doing the same thing, one more explicitly than the other. Both are half-duplex.
The distinction is not stylistic. Half-duplex systems force you to serialize the two hardest parts of a conversation, listening and speaking, that humans overlap constantly. Once you notice the pattern you cannot un-notice it. The cadence of a walkie-talkie voice assistant is the sound of a channel that can only carry one direction at a time.
*Figure F1: two small side-by-side diagrams. On the left, a walkie-talkie with a push-to-talk button and a dashed line showing one-direction-at-a-time transmission. On the right, a telephone with two overlapping waveforms, labeled both-at-once. Inside-SVG text limited to the labels "one at a time" and "both at once." Detailed caption with historical attribution in the figcaption.*
## 3. What humans actually do, measured in milliseconds
Start with a number most people have never heard. The modal gap between one person finishing a turn and the next person starting is about 200 milliseconds. Stivers et al. [measured this in 2009](https://www.pnas.org/doi/10.1073/pnas.0903616106) across 10 languages, from English and Japanese to Yélî Dnye, a language with about 4,000 speakers on a small island off Papua New Guinea. The distribution is remarkably stable. Wherever humans speak to each other, the gap clusters near 200 ms.
That number is small. It is small in a way that forces a specific conclusion about how the brain is organized.
Producing a single spoken word takes at least 600 milliseconds, measured end-to-end from intention to articulation [[Levinson & Torreira 2015](https://doi.org/10.3389/fpsyg.2015.00731)]. Formulating a sentence takes much longer. If you waited until the other person had stopped talking to begin planning what to say, you would not be answering at 200 ms. You would be answering at somewhere between 800 and 2,000 ms. You would sound like Siri.
So humans must be doing something else. A series of EEG studies has begun to fill in the mechanism. Bögels and colleagues [recorded brain activity](https://www.nature.com/articles/srep12881) while people answered spoken questions and found that the brain begins drafting the reply within about 500 ms of having enough information to answer. In most real conversations that moment arrives seconds before the other speaker finishes their sentence. Think of it as a kitchen during dinner rush: the listener is not waiting for the order to finish, then cooking. They are chopping, searing, plating, listening for the next order, all at once. By the time the current speaker falls silent, the reply is already half-built and waiting at the lips.
The 200 ms gap is not a politeness target. It is the shortest possible window to verify that the other person really did finish, release the staged response, and articulate. Below that window, overlap starts to happen. [Heldner and Edlund 2010](https://doi.org/10.1016/j.wocn.2010.08.002) measured actual gap distributions in three spontaneous speech corpora and found that overlaps account for roughly 40 percent of all transitions between speakers. The other 40 percent are gaps longer than 200 ms, with the modal cluster near that value. Human conversation is not a sequence of tidy alternating turns. It is a tightly interleaved stream in which "your turn" and "my turn" are often literally true at the same time.
That is the baseline any voice system is being measured against. Not an aspiration. A biological fact about how humans coordinate speech.
*Figure F2: horizontal strip chart, 0 to 1500 ms on the x-axis. Ten language bars clustering around 200 ms. One filled dot at 200 ms labelled "human modal." Inside-SVG text limited to axis label and language labels. Numeric precision and source in figcaption.*
## 4. Four behaviors that live above the threshold
Call them the four micro-behaviors that a walkie-talkie cannot do.
The first is **barge-in**. You are talking to the assistant. It is halfway through its answer. You already have what you need and you cut in. A full-duplex system stops speaking mid-word and re-centers on you. A half-duplex system keeps going until it reaches the end of its utterance, then processes you.
The second is **backchannel**. You are telling it a story. It says "mm-hm" while you talk, not to take the floor, but to signal that it is tracking. A full-duplex system emits these short acknowledgements without taking the turn. A half-duplex system cannot emit sound while listening, so it stays silent, and the silence reads as indifference.
The third is **overlap and recovery**. You both start at the same moment. One yields, the other continues, the rhythm recovers within a second. In a half-duplex system only one of you has a microphone open at a time and both of you end up restarting.
The fourth is **co-completion**. You are searching for a word and the listener supplies it. "The place we went last year, with the, the," "the tree house?" "yes, the tree house." Humans do this constantly. Full-duplex systems are only beginning to.
The first three of these are what [Full-Duplex-Bench](https://arxiv.org/abs/2503.04721) (Lin et al. 2025) turns into measurable axes. The benchmark measures pause handling, backchanneling, turn-taking, and interruption. The trick inside is a single simple rule, applied twice with the sign flipped. The rule: the model has "taken over" the floor if it produces more than 3 words or keeps speaking for more than 1 second. That rule is scored in two different situations. When the user has clearly finished and is waiting for a reply, you want the rule to trigger (good turn-taking, higher is better). When the user has only paused mid-thought and is about to keep going, you want the rule to stay quiet (good pause handling, lower is better). Same detector, opposite expectation, depending on what the user was actually doing. One tool, two sides of the same mistake.
Backchanneling is scored by chopping time into 200 ms buckets and comparing how often the model says "mm-hm" in each bucket against how often a real human does in the same spot, using natural conversation recordings as the ground truth (the [Full-Duplex-Bench repository](https://github.com/DanielLin94144/Full-Duplex-Bench) ships the evaluation code). The closer the two timing patterns match, the better. Interruption, the most open-ended axis, asks a separate language model, GPT-4-turbo, to read the exchange and grade the response on a 0-to-5 scale. That last one is a tell: even a serious benchmark for full-duplex behavior cannot reduce "did the model respond reasonably when I interrupted?" to a deterministic rule. It needs a reader.
The fourth behavior, co-completion, is not yet in the benchmark. That is worth sitting with. Co-completion is arguably the most conversational thing humans do together. No public evaluation of voice AI in 2026 measures it directly. The benchmarks can see three of the four threshold behaviors. The fourth one is waiting.
The successor benchmark, [Full-Duplex-Bench v3](https://arxiv.org/abs/2604.04847), pushes further. It extends the test set with real human disfluency (the "um"s, false starts, and restarts that real speakers produce) and tool-use scenarios (the agent has to actually look something up mid-conversation). On that benchmark the current commercial leader, OpenAI's [gpt-realtime](https://platform.openai.com/docs/guides/realtime), scores Pass@1 around 0.600. In plain terms: under realistic speech conditions the best-in-class commercial system still fails roughly four out of every ten tool-using conversations. That is the single highest public number on a current full-duplex eval. It is also a long way from reliable.
*Figure F3: four compact timeline snippets arranged in a 2x2 grid. Two horizontal tracks per snippet, one for each speaker, with audio blocks colored differently. No numbers. Three-word labels: "barge in," "backchannel," "overlap," "co-completion." The shape of each interaction should be readable at a glance.*
## 5. Why walkie-talkie systems could not cross the threshold
A 2023-era voice assistant was a pipeline. The user spoke. Speech recognition converted audio to text. A language model read the text and generated a reply. Text-to-speech spoke the reply aloud. Each stage ran in strict order. Nothing began until the previous stage finished.
That architecture has a floor. The cheapest modern streaming ASR reports first-token latency in the [100 to 500 millisecond range](https://openai.com/index/introducing-our-next-generation-audio-models/). A typical LLM needs 350 to 1,000 milliseconds to generate a usable first token of text response. Text-to-speech adds another 75 to 200 ms. Add glue, routing, and network jitter and you cannot drive the end-to-end latency much below one second. One second is five times the human modal turn gap. It is the cadence of a walkie-talkie.
Some vendors shaved this by running the stages in tight concurrent streams. The first tokens of the LLM's output are handed to TTS while the LLM is still generating later tokens. This gets serious production cascades down into the 300 to 800 millisecond range when everything goes right. It is honest engineering and it makes voice products noticeably more responsive. It does not change the underlying architecture.
The deeper limitation is that the pipeline cannot listen while it is speaking. The microphone is either open for input or closed for output. A user who tries to interrupt the assistant is ignored until the speech synthesizer finishes the current sentence. Imagine a person with only one ear and one mouth active at a time: while the mouth is working, the ears are shut off. That is the half-duplex mind. This is not a bug in a specific product. It is a property of the architecture. The machine can only be in one state at a time, and "I am speaking" and "I am listening" are different states.
There is a subtler loss in the cascade. Speech recognition throws away the melody of speech on its way in. Linguists call this *prosody*, the rise and fall of pitch, the stretched syllable, the micro-pauses that warn a sentence is not done yet. "Wait, I meant the green one, not the red," has a falling-then-rising pitch and a tiny pause right after "wait." Speech recognition flattens all of that into a plain string of text tokens. The language model then works only from the text. The single strongest cue a human uses to judge "is it my turn yet?" — the shape of the pitch and the micro-timing of the phrase boundary — has been deleted before the model ever sees it. [De Ruiter et al. 2006](https://doi.org/10.1353/lan.2006.0130) showed in a Dutch button-press study that the words and grammar of a sentence alone are enough to predict where a turn will end. Pitch alone is not. Both cues matter together for precise launch timing. A system that sees only the text is working with half the evidence.
All of that means "just make it faster" was not going to work. The walkie-talkie feel was built into the shape of the architecture, not its speed.
*Figure F4: half-duplex state machine. Five states in a circular layout: Listen, Transcribe, Think, Generate Voice, Speak. Clean arrows between them. One dashed red arrow labeled "user interrupts here" pointing at the Speak state and bouncing off. No numbers inside the SVG. Timing ranges live in the figcaption.*
## 6. The systems that crossed it, and what to listen for
Three systems, three labs, one summer. In May 2024 OpenAI demonstrated [GPT-4o voice](https://openai.com/index/hello-gpt-4o/), a closed multimodal model that handles voice natively, with an average response latency reported around 320 milliseconds and live on-stage interruption handling. Google's [Gemini Live](https://deepmind.google/technologies/gemini/) arrived soon after with bidirectional streaming on Vertex. Then in September the French lab Kyutai released [Moshi](https://moshi.chat), an open-weights speech-to-speech model that listens and speaks on the same audio frame. These are the first systems that visibly cleared the full-duplex threshold in a commercial or open-weights form.
A note on what is citable from here on. Of the three, only Moshi ships with an academic paper describing the internals. GPT-4o and Gemini Live are closed commercial systems, and their architectural details are not public. The rest of this section leans on Moshi for specific numbers because it is the only one with public numbers, not because it is the only one doing the work.
Moshi is not a cascade. Think of it the way video plays on a screen, as a sequence of frames, 12.5 slices per second. At each slice the same model is simultaneously deciding two things: what the user is saying, and what it wants to say next. There is no separate "listening" phase and "speaking" phase. On a single Nvidia L4 GPU the end-to-end lag is about 200 milliseconds. The paper reports a theoretical floor of 160 milliseconds, made up of the 80 ms frame duration and an 80 ms acoustic look-ahead (the amount of future audio the model peeks at before committing to an output).
That 200 ms figure is worth a caveat. It is a measurement on specific hardware under specific load. The 160 ms floor is structural, but the extra 40 ms that shows up in practice is compute time on an L4. On other hardware the picture changes, sometimes for the worse. Treating 200 ms as a hardware-independent property of Moshi is one of the common mistakes in the current discourse.
The second wave has already changed the shape of the landscape. Kyutai has split its product line into three systems, each above the threshold in a different sense. [Moshi](https://github.com/kyutai-labs/moshi) is the true full-duplex dialogue model. [Hibiki](https://github.com/kyutai-labs/hibiki) is full-duplex in one direction only: it streams simultaneous French-to-English speech translation, keeping the rhythm of the original speaker intact. [Unmute](https://kyutai.org/unmute) is a modular cascade that wraps any LLM with Kyutai's streaming ASR and TTS, running at 450 to 750 ms end-to-end. That is not above the 200 ms threshold, but it is below what most enterprises are used to. Three products, three different answers to "how simultaneous should the conversation be." OpenAI and Google have made similar product-line choices between integrated and cascaded voice systems, but less transparently. Kyutai is the clearest example to point at because each tier is documented publicly.
Research labs have added several more integrated full-duplex systems in 2025. [SyncLLM](https://arxiv.org/abs/2409.15594) from NVIDIA trains a full-duplex model on a Llama-3-8B base, using 212,000 hours of synthetic two-channel dialogue plus only about 2,000 hours of real Fisher conversations. That 100-to-1 ratio of synthetic to real is itself a datapoint: it tells you how scarce real two-channel conversation data is when you actually sit down to train one of these models. [OmniFlatten](https://arxiv.org/abs/2410.17799) takes a different approach, flattening the user and model streams into a single interleaved sequence so a standard autoregressive decoder can learn the full-duplex pattern without architecture changes.
A third category sits just below the threshold. [Freeze-Omni](https://arxiv.org/abs/2411.00774) and [Mini-Omni2](https://arxiv.org/abs/2410.11190) claim "duplex" capability but achieve it by running interrupt detection on top of a half-duplex generator. Picture a walkie-talkie with a motion sensor bolted to the front: it is still a walkie-talkie, but it notices faster when the user starts to speak and switches direction sooner. The model is still only ever in one state at a time. This works well enough to feel responsive in most interactions, and is cheaper to train. It is also not the same thing as a native dual-stream model, and the distinction becomes visible whenever the conversation requires a backchannel or a co-completion.
The honest test is the one from §4. Try barge-in. Try a backchannel. Try starting at the same time. A system that handles all three without breaking rhythm is above the threshold. A system that handles only barge-in is using an interrupt detector.
*Figure F5: six demo cards arranged in two rows of three. Card header: model name, developer, release date. Card body: one-line "listen for" prompt. Cards for Moshi, GPT-4o voice, Gemini Live, Kyutai Hibiki, SyncLLM, and OmniFlatten. Links live in the figcaption and surrounding HTML, not in the SVG.*
## 7. Where the threshold changes what voice is useful for
Below the threshold, voice is good for transactions. Above it, voice becomes viable for conversation. Those are different markets.
The transactional applications were always going to work on a walkie-talkie. "Set a timer for ten minutes." "What's the weather." "Cancel my 3 pm." These are slot-fills. You have one thing to say, the assistant has one thing to say back. A 1,000 ms gap is fine. In fact, for commands, a deliberate pause is often reassuring: it tells you the system registered your intent.
The conversational applications were not. Any use case where the user speaks more than a sentence at a time, or where the system needs to react mid-utterance, or where pauses carry meaning, broke on half-duplex. Some of those use cases are now plausible products for the first time.
**Hands busy, eyes busy.** A driver, a surgeon, a factory worker, a cook with flour on both hands. A system that can be interrupted when the context shifts, and can hum acknowledgement without forcing a formal turn, is usable in environments where a transactional assistant was a liability.
**Accessibility.** A voice interface that cannot be interrupted is a worse interface for a user with motor or vision impairment, not a better one. The [WHO estimates 2.2 billion people with some form of vision impairment](https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment). A meaningful slice of them depends on voice interfaces. Full-duplex is the difference between a tool that respects their pace and one that forces them into a rigid listen-and-wait ritual.
**Language learning.** A practice partner that corrects you mid-sentence, the way a good human tutor does, not one that waits politely for you to finish your wrong sentence and then restart. The backchannel behavior matters here too: a learner knows they are being heard, not processed.
**Companionship and emotional conversation.** Journaling, therapy-adjacent reflection, light conversation for lonely hours. Pauses carry meaning in these contexts. A system that cannot respect a hesitant pause, or cannot say "mm-hm" while the user is working something out, reads as clinical even when its text content is good. This is one of the places where the team at oto believes voice genuinely carries feelings and not only tasks, and a walkie-talkie rhythm undercuts that directly. The primer article for this series [discusses the design argument at more length](./01-sts-primer).
**Children and elders.** Groups for whom typing is the wrong affordance, and for whom a robotic turn-based voice feels even more alien than it does to working-age adults. A grandchild who has just learned to talk does not naturally understand the concept of a beep.
None of these is a small market. Taken together, they are the difference between voice being a convenience layer on your phone and voice being a primary interface for a large share of software.
*Figure F6: 2x3 grid of use-case cards. Small, recognizable icon per cell. Two-word labels. Numbers and context in the figcaption.*
## 8. What full-duplex does not fix
Crossing the threshold is not the same as finishing the job. A honest reader should close this article knowing what is still broken.
The most important limitation is that having the capability in the model is not the same as having good behavior. A system that technically *can* listen and speak at the same time still has to decide *when* to jump in and *when* to hold back. That judgment is learned from data, and the amount of real multi-channel conversation data available for that kind of training is small. The [CANDOR corpus](https://www.betterup.com/research/candor-research), a 1,656-pair naturalistic two-channel collection, is the single best publicly documented two-channel resource at scale. It ships under [CC BY-NC](https://creativecommons.org/licenses/by-nc/4.0/), which means academic labs can use it but commercial systems cannot train on it. The next tier is SyncLLM's 2,000 real Fisher hours and its 212,000 synthetic hours. Synthetic data is not nothing, but it is not the same thing either.
Second, capability is not reliability. In April 2026 two new benchmarks landed that let you calibrate how much of the work is actually done. [τ-Voice](https://arxiv.org/abs/2603.13686), from Sierra, ran 278 voice-agent tasks and found that voice agents retain only 30 to 45 percent of the equivalent text agent's task score. In plain terms: if you take the same underlying model and give it a voice instead of a keyboard, it loses most of its competence. Full-Duplex-Bench v3, as noted above, has gpt-realtime at around 0.600 Pass@1 on tool use under real human disfluency. Voice AI, in short, is not yet voice-first AI.
Third, integrated systems have their own ceilings. Moshi's context window — the amount of conversation the model can hold in its working memory at any moment — is roughly four minutes of audio. That is a property of the architecture, not a knob the developer can turn up. Four minutes is long enough for a quick exchange and short enough that a therapy session, a tutoring hour, or a long customer support call will run out of memory partway through. Long-horizon voice conversation is an open problem that full-duplex alone does not touch.
Fourth, the quality of the threshold crossing depends on the language. Every benchmark and evaluation named in this article is dominated by English, with some coverage of Chinese and French. Stivers 2009 covered 10 languages and found the ~200 ms gap to be stable across all of them, but the systems are not stable across all of them. A Moshi trained on English conversation data does not automatically generalize to Japanese turn-taking, which has somewhat different rhythms around how speakers hand the floor back and forth than Indo-European languages. Consumer voice assistants in 2026 commonly ship in 100+ locales. Full-duplex voice AI does not yet have the training data to do that.
Fifth, production latency is usually worse than demo latency. Network jitter, load balancing, cold starts, and shared inference clusters push even an integrated model into the 300 to 500 ms range on a busy day. Above the threshold on a best-case test is not the same thing as above the threshold in a user's kitchen.
None of this is a reason to discount the shift. The threshold is crossed. The walkie-talkie rhythm is no longer a structural necessity. But the systems that crossed it are brittle in ways the demos do not surface.
*Figure F7: two-column card. Left column "What works above the threshold," five green check-mark rows: simultaneous listen + speak, barge-in, backchannel, short overlap recovery, human-adjacent turn gaps. Right column "What does not yet," five question-mark rows: co-completion at scale, long-horizon memory, non-English turn-taking, tool-use reliability, production-latency consistency. Short phrases only inside the SVG.*
## 9. What it feels like
Come back to the opening scene. The kitchen, the smart speaker, the beep. That is what below the threshold sounds like. Command. Pause. Reply. Command. Pause. Reply.
Now imagine the same device, but you can start your second sentence before the first reply has finished, and it just stops and recenters on you. You pause in the middle of an instruction, and it says "mm-hm" quietly, and waits. You ask it a complicated question, and it answers at your pace, not its own. You forget that it is a machine for a few seconds, because the small rhythms of the conversation work. The device has not become smarter. It has become present.
That is what the full-duplex threshold actually is. Not faster, exactly. Not smarter. Closer to the shape of human conversation. And once you have had a conversation in that shape, returning to a walkie-talkie feels like a regression.
The threshold is the easy part, in retrospect. What comes next is harder. A system that crosses the threshold still has to have something worth saying, has to remember you from one conversation to the next, has to work in the language you grew up speaking, has to respect when you do not want to be interrupted. The architecture has gotten out of the way. Everything else now has to be earned.
The engineering work is real. The product space is just beginning. The series continues with a deeper look at what actually has to be true about training data for systems above the threshold to keep improving. If you want the next piece when it goes up, the [oto newsletter](https://oto.earth/newsletter) is the easiest way to get it.
---
### Notes, corrections, and sources
External links are inline in the prose above, per CLAUDE.md Writing rule 9 ("link while drafting, not after"). Previously-outstanding SOURCE_NEEDED items resolved on 2026-04-19 draft-v2 pass: Full-Duplex-Bench v3 arXiv ID is 2604.04847, τ-Voice arXiv ID is 2603.13686, Full-Duplex-Bench v1 repo moved to `github.com/DanielLin94144/Full-Duplex-Bench`, CANDOR canonical URL is `betterup.com/research/candor-research`. One link-reachability class check is still owed: the egress proxy blocks kyutai.org, moshi.chat, openai.com, deepmind.google, oto.earth, and academic journal mirrors from this environment, so visual verification of those URLs in a browser is still required before final.mdx.
Atomic notes directly supporting this draft: `2026-04-18-half-vs-full-duplex-analogy`, `2026-04-18-cascade-latency-math`, `2026-04-18-latency-threshold-crossed`, `2026-04-19-moshi-two-transformer-split`, `2026-04-19-moshi-200ms-hardware-dependent`, `2026-04-19-moshi-inner-monologue-text-first`, `2026-04-19-moshi-two-channel-training-is-the-trick`, `2026-04-19-fdb-pause-handling-definition`, `2026-04-19-fdb-backchanneling`, `2026-04-19-fdb-turn-taking`, `2026-04-19-fdb-interruption-definition`, `2026-04-19-fdb-four-axes-synthesis`.
Paper notes: `defossez-2024-moshi`, `stivers-2009-turn-taking`, `bogels-2015-neural-signatures`, `levinson-2015-timing-turn-taking`, `heldner-2010-pauses-gaps-overlaps`, `deruiter-2006-projecting-turn-end`.
Model notes used: `moshi`, `gpt-4o-voice`, `openai-realtime-api`, `gemini-live`, `kyutai-unmute`, `kyutai-hibiki`, `syncllm`, `omniflatten`, `freeze-omni`, `mini-omni2`.
Benchmark notes used: `full-duplex-bench`, `full-duplex-bench-v3`, `tau-voice`, `uro-bench`.
---
_Originally published at [https://fullduplex.ai/blog/full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold)._
_Part of **The STS Series** · 02 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 03 — From pipeline to integrated
_Canonical: https://fullduplex.ai/blog/pipeline-to-integrated · Markdown: https://fullduplex.ai/blog/pipeline-to-integrated/md_
---
title: "From pipeline to integrated"
description: "“Integrated” sounds like one architecture. It is at least four. A field guide to the 2026 full-duplex STS landscape — four families under one label, their latency math, their data bets, and their license exposure."
article_number: "03"
slug: pipeline-to-integrated
published_at: 2026-04-14
reading_minutes: 15
tags: ["STS", "architecture", "integrated"]
canonical_url: https://fullduplex.ai/blog/pipeline-to-integrated
markdown_url: https://fullduplex.ai/blog/pipeline-to-integrated/md
series: "The STS Series"
series_position: 3
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# From pipeline to integrated: how STS models actually work
## 1. Moshi as the landmark
In September 2024 a small French lab, [Kyutai](https://kyutai.org/), released [Moshi](https://github.com/kyutai-labs/moshi). A 7-billion-parameter model that listened and spoke at the same time, shipped as public weights under CC-BY 4.0, with the [Mimi codec](https://github.com/kyutai-labs/moshi) released alongside it under MIT. You could download Moshi, run it on a laptop with a recent GPU, and have a conversation that did not feel like Siri. The release paper claimed 160 milliseconds of theoretical latency and [200 milliseconds measured on an NVIDIA L4](https://arxiv.org/abs/2410.00037).
That single release turned a distinction that had lived inside research papers for years into a product fact. On one side of the line were cascades. Recognize the user's speech, run a language model, synthesize the reply, play it back. On the other side was something else: speech in, speech out, one model, continuous inference. The research community had sometimes called this split "pipeline versus end-to-end" or "modular versus integrated." Before Moshi, only closed commercial systems (GPT-4o voice, which OpenAI had demoed in May 2024) could be pointed at as instances of the integrated side. After Moshi, you could clone a repository and read the architecture.
Eighteen months later, the integrated side has fractured. As of April 2026 there are at least four architecturally distinct families of full-duplex STS, plus a growing closed commercial layer above them. This article maps the families and explains why the distinctions matter, especially when comparing latency, training-data needs, and licensing exposure across products.
## 2. The pipeline ancestor
The cascade is worth naming in one paragraph so the contrast later has something to push against. You already saw the detailed version in [Article 02](../02-full-duplex-threshold/draft-v1.md). In short, the legacy voice pipeline is a five-stage loop. The device captures audio, an automatic speech recognition (ASR) model turns it into text, a language model reads the text and produces a reply, a text-to-speech (TTS) model renders that reply back into audio, and the speaker plays it. Each stage has to finish before the next can start, so the minimum achievable latency is the sum of stages. Once you add network hops, model routing, and cold starts, a well-tuned cascade lands near [one second end-to-end on a typical day](https://openai.com/index/gpt-4o-audio/).
It is fair to say that modern cascades have kept improving. [Deepgram's Aura-2 and Aura-Nova lines](https://deepgram.com/product/voice-agent) quote sub-second end-to-end latency for their agent stack, and [Cartesia's Sonic](https://cartesia.ai/sonic) is now one of the fastest commercial TTS engines at roughly 90 milliseconds time-to-first-audio. A cascade with excellent components can now dip well under a second on favorable conditions. The structural property that still holds is that a cascade cannot, by its own design, listen while it speaks. Every full-duplex claim you read about is a claim that some part of the system crosses that property.
## 3. Four families under one label
Call the other side "integrated" and it looks like one thing. In practice it is at least four. The cleanest way to keep them straight is not an architecture diagram. It is a kitchen.
Imagine a single cook who has to take orders from the dining room and plate dishes at the counter at the same time. Four kitchens solve this problem four different ways, and the four families of integrated STS map onto them almost exactly.
- **Family 1 is two intercoms at once.** One channel carries the incoming order, another carries the outgoing plating call. Both are open simultaneously. The cook also scribbles mental notes on a whiteboard to keep the thread straight across both lines. Moshi, PersonaPlex, and Sesame CSM are wired this way.
- **Family 2 is one intercom, alternating very fast.** Only one line is available, but orders and plating calls take very short turns on it, alternating so quickly that from the dining room it sounds like both directions are happening at once. OmniFlatten, Qwen2.5-Omni, Covo-Audio, and Kimi-Audio are wired this way.
- **Family 3 is a relay with a supervisor.** The classic kitchen pipeline (take order, prep, plate) is still there, stage by stage. A supervisor stands behind the line and shouts "cut in now!" every half second, so the relay overlaps at short time scales instead of waiting for each handoff. Freeze-Omni and MiniCPM-o are wired this way.
- **Family 4 is no tickets at all.** The cook never writes anything down, never converts the conversation into printable text. The whole kitchen runs on continuous hand signals, and an internal "speak or listen" instinct decides which way the signal flows. SALMONN-omni is the only public example.
The quadrant below shows the same split in the form researchers tend to draw, and the comparison table after it is the one-line cheat sheet.
| Family | Everyday intuition | Example models | What "200 ms" means here |
|---|---|---|---|
| 1. Dual-stream + codec | Two intercoms at once | Moshi, PersonaPlex, CSM-1B | Theoretical floor on the single transformer's forward pass |
| 2. Interleaved / flatten | One intercom, very fast alternation | Qwen2.5-Omni, Covo-Audio-Chat-FD, Kimi-Audio | The length of one alternating block, not the response time |
| 3. Cascade + predictor | Relay with a supervisor | Freeze-Omni, MiniCPM-o 4.5 | The supervisor signal alone. Full pipeline is five to ten times slower |
| 4. Codec-free / thinking | No tickets, only hand signals | SALMONN-omni | Not yet standardized; the family is too young |
The reason the distinction matters is not aesthetic. Each family sets a different bound on what a published number actually means. A "200 millisecond" result from Family 1 is a theoretical floor on how long the single transformer takes to produce its reply. The same number from Family 2 is the length of one alternating block, not the response time: a Family 2 system still has to watch several blocks go by before it has formed a full reply. A Family 3 "200 ms" usually describes only the supervisor signal, while the full relay behind it reports separately, and that number is typically five to ten times higher. Three families, three different phenomena, one column header.
## 4. Family 1: Dual-stream with a neural codec
The reference system for this family is Moshi itself. Kyutai's decision that made the architecture tractable was the [Mimi codec](https://kyutai.org/Moshi.pdf), a streaming neural audio codec operating at 12.5 Hertz. Think of the codec as a smart compressor in the same spirit as MP3, but designed specifically so a language model can read its output the way it reads words. Every 80 milliseconds Mimi emits a small handful of "sound tokens" that capture tone and rhythm as well as content.
In Moshi, three streams of these tokens run side by side: one for the user's audio, one for the model's audio, and a third text stream for the model's "inner monologue," which is a running text draft of what the model is about to say. A single transformer reads all three streams together. The model's audio stream is fed back out through Mimi in reverse to become sound.
That architecture has now been adopted by more than one lab. [NVIDIA PersonaPlex-7B-v1](https://arxiv.org/abs/2602.06053), released by NVIDIA ADLR in January 2026, initializes its weights directly from Moshi and fine-tunes on a corpus of 1,840 hours of synthetic customer-service dialogue and 410 hours of question-answering dialogue, generated by Qwen3-32B and GPT-OSS-120B as transcripts and rendered by [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) as speech, plus the [Fisher English corpus](https://catalog.ldc.upenn.edu/LDC2004S13) for casual dialogue. PersonaPlex's contribution is a hybrid system-prompt mechanism that conditions the model on a role (via text) and a voice (via a short audio sample). Code is MIT, weights are under the NVIDIA Open Model License, a bespoke license that permits commercial use with conditions. [Sesame's CSM-1B](https://github.com/SesameAILabs/csm), released under Apache 2.0, uses the same Mimi codec and a Llama backbone, trained on roughly one million hours of English conversational audio. Sesame's larger CSM-Medium at 8B parameters remains closed; only the 1B tier is public.
A useful honest note is that "open weights" is not the same as "open data." Moshi's training corpus is known to include Fisher and undisclosed scraped English conversational audio. CSM publishes almost nothing about its 1-million-hour corpus. PersonaPlex is explicit about its synthetic corpus but inherits whatever Moshi was trained on at the base. Family 1 has the cleanest license stories for model weights, and the least transparent stories for the audio those weights learned from. That distinction is most of what [Article 10](../../) of this series will be about.
## 5. Family 2: Interleaved single-stream
The second family makes a different bet. Instead of running two audio streams in parallel, it packs speech and text into a single timeline with repeating blocks. The model reads a small block of text, then a small block of speech, then text, then speech, and so on. Full-duplex behavior is not really parallel here; it is very fast alternation. If the blocks are small enough, the outside observer cannot tell the difference.
The paper that named the design is [OmniFlatten](https://arxiv.org/abs/2410.17799), published by Alibaba's Tongyi Lab in October 2024. OmniFlatten is built on Qwen2-0.5B (yes, half a billion parameters, not seven) and uses a staged training recipe that progressively shrinks the interleaving grain: from a four-stream layout to three streams to two streams, with a final configuration of text-chunk size two and speech-chunk size ten. OmniFlatten was trained on 2,000 hours of synthetic dialogue rendered by [CosyVoice](https://github.com/FunAudioLLM/CosyVoice), making it one of the first full-duplex STS systems trained entirely on generated audio. The weights were not released; the productised descendant is [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni), which ships under Apache 2.0.
The family has grown rapidly. [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2) and [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) use close variants of the flatten idea. [LLaMA-Omni 2](https://arxiv.org/abs/2505.02625) is a Meta-LLaMA-based reimplementation. [Moonshot's Kimi-Audio](https://arxiv.org/abs/2504.18425), released under MIT, claims 13 million hours of speech pretraining. [Tencent's Covo-Audio and Covo-Audio-Chat-FD](https://huggingface.co/papers/2602.09823), released in March 2026 under CC BY 4.0, extend the pattern by adding a third kind of block (images) to the alternation, and ship a dedicated full-duplex variant alongside. That last release is worth calling out separately: as of April 2026, Covo-Audio-Chat-FD is the most permissively licensed full-duplex STS weight release in public. CC BY 4.0 is genuinely commercial-safe with attribution, which Family 2 has otherwise struggled to offer at full-duplex scale.
The tradeoff is that serialization has a built-in cadence. Turn-taking in Family 2 is a blocking pattern, not a concurrent behavior. The smallest unit of responsiveness is the block. This is why the fact that OmniFlatten achieves full-duplex at 0.5B parameters is interesting (the architecture scales down gracefully), and also why Family 2 latency numbers should always be read together with the block size. A 200-millisecond chunk cadence is not the same object as a 200-millisecond end-to-end response.
## 6. Family 3: Cascade with a chunk-level duplex predictor
Family 3 is the family that looks like a cascade and behaves, at short time scales, like a full-duplex system. Think back to the kitchen with the supervisor. The ASR, LLM, and TTS stages are still a line cook working stage by stage. The supervisor standing behind the line is a small extra model, a "state predictor," whose only job is to decide every fraction of a second whether the system should be listening, speaking, or cutting in. The predictor breaks the input and output into tiny slices so the three stages can overlap inside each slice rather than waiting for each other.
[Freeze-Omni](https://arxiv.org/abs/2411.00774), released by Tencent AI Lab and collaborators at NJU, Fudan, and NPU in November 2024, is the cleanest example. It pairs a [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) language model with a CTC-pretrained speech encoder and a TiCodec-based autoregressive speech decoder, then trains a chunk-level state predictor on 60,000 question-answer pairs using eight GPUs. The paper reports a [model-only latency of 160 to 320 milliseconds](https://arxiv.org/abs/2411.00774) and a real-scenario latency of roughly 1.2 seconds. That gap is the honest number: the state predictor itself runs in hundreds of milliseconds, but the full pipeline behind it takes more than a second to come back with a response on real hardware. The 160-millisecond figure is not a like-for-like comparison against Moshi's 200-millisecond number, and careful readers should not treat them as competitors on a single axis.
[OpenBMB's MiniCPM-o 4.5](https://github.com/OpenBMB/MiniCPM-o), a 9-billion-parameter on-device model, takes the same basic idea and applies it aggressively. It composes [SigLIP2](https://huggingface.co/google/siglip2-base-patch16-256) vision, [Whisper](https://github.com/openai/whisper) ASR, [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) TTS, and [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) language modeling into a single on-device multimodal system, using time-division multiplexing to interleave the modalities. MiniCPM-o is the clearest evidence that Family 3 can be made to run on consumer hardware, since the entire stack fits on a Mac Studio or a modern NVIDIA 4090.
The family's value proposition is pragmatic. If your organization already has a strong ASR model and a strong LLM and a strong TTS model, Family 3 lets you reach full-duplex behavior without retraining the whole stack end-to-end. If you do not have those components, Family 3 is no cheaper than Family 1 or Family 2 and imports their latency and coordination problems at the same time.
## 7. Family 4: Codec-free with a thinking mechanism
The smallest of the four families, but architecturally distinct enough to deserve its own slot, is the codec-free line. To return to the kitchen image: the cook never writes a ticket, never reads an order aloud, never turns the conversation into printable text at any stage.
A second picture is the difference between a music sheet and humming. The first three families all "write down" speech before doing anything with it. They turn audio into a sequence of discrete codec tokens that a transformer can read the way it reads words. Family 4 never writes it down. It keeps the audio as a continuous signal throughout, the way a person hums along to a melody without ever naming the notes.
[ByteDance's SALMONN-omni](https://arxiv.org/abs/2505.17060) is the representative release. SALMONN-omni does not use a neural audio codec at all. The model takes continuous audio embeddings from a [SALMONN](https://github.com/bytedance/SALMONN) encoder, runs them through a transformer, and emits an internal "thinking state" that decides when to speak versus listen, all without ever converting the audio into discrete codes.
The reason this matters even at a minority share is that it is the clearest counter-example to "integrated STS means a neural audio codec." Family 4 argues that speech tokenization is a design choice, not a requirement, and that the full-duplex behavior can emerge from continuous-space attention alone. The public tooling around SALMONN-omni is thinner than the other three families. Weights exist but the surrounding evaluation and benchmark plumbing is younger. Whether the family grows into a major line or stays a single-example curiosity is one of the live questions in STS research in 2026.
## 8. Closed commercial STS, alongside the families
The cleanest picture for the gap between the four open families and the closed commercial layer is a shop window. Behind the window is an open kitchen, which is the four families from the previous sections. All code visible. All architecture documented in papers. All weights downloadable with some license attached. In front of the window is the sales floor, where finished voice products are sold by the minute or by the token. You can watch a GPT-4o or Nova Sonic response come out the door. You can time it. You can compare it to a competitor. But the prep line that produced it is on the other side of the glass, and the vendor keeps the blinds drawn.
Most of the STS minutes actually used in production in 2026 are not spoken through any of the four open families above. They are spoken through closed commercial systems. [GPT-4o voice](https://openai.com/index/hello-gpt-4o/), introduced by OpenAI in May 2024, reports an [average latency of 320 milliseconds and a minimum of 232 milliseconds](https://openai.com/index/hello-gpt-4o/). [Gemini Live](https://deepmind.google/technologies/gemini/) has shipped on Google's apps since 2024, and the [Gemini 3.1 Flash Live API on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/live-api), released in March 2026, reports roughly 320 milliseconds first-token p50. [Amazon Nova Sonic](https://aws.amazon.com/bedrock/nova/), [Microsoft MAI-Voice-1](https://azure.microsoft.com/en-us/products/ai-services/), ByteDance's Doubao voice line, [Hume EVI](https://hume.ai/), and [Cartesia Sonic](https://cartesia.ai/sonic) all ship production voice stacks with latency, pricing, and demos published, but architectural internals kept private.
The honest reverse-engineering discipline, for readers trying to place these systems on the taxonomy, is that you can infer some things from behavior but not family membership. A system that can barge in gracefully, backchannel without hallucination, and recover from overlap without dropping state is very likely running something architecturally closer to Family 1 or Family 2 than a plain cascade. A system whose interruption behavior feels chunky and whose overlap handling falls back to a "please wait" cue is likely a Family 3 descendant. But vendors rarely confirm this in public, and behavior varies even within a single product across load conditions and model revisions. The short version: you can see the dish, but not the recipe.
The gap between the closed commercial layer and the open-weights layer is itself a business-model signal. Most progress on STS evaluation, on licensing clarity, and on architectural variety is happening in public, through research labs and Chinese academic groups. Most production usage is happening in private, behind a small number of commercial APIs from a small number of US hyperscalers and scaleups. Any investor thesis on voice AI in 2026 has to take a view on how that gap is going to close: whether the closed systems will be caught (Kyutai's Moshi, Sesame's CSM, Qwen2.5-Omni, and Covo-Audio-Chat-FD each argue yes, but at different scales and with different licenses), or whether the closed layer will stay ahead on quality and data scale indefinitely.
## 9. Why the taxonomy matters, and what comes next
Three consequences follow from the four-family split, and each one directly shapes a decision that matters outside the architecture community.
The first is that latency numbers are not cross-comparable. A 200-millisecond number from Moshi (Family 1) describes the theoretical floor of a dual-stream transformer on an L4. A 200-millisecond chunk cadence from OmniFlatten (Family 2) describes the size of an interleaving block. A 200-millisecond number from Freeze-Omni (Family 3) describes the state predictor alone, while the full pipeline's real-scenario number is closer to 1.2 seconds. Reading these three as if they are the same quantity is the most common way that product evaluations and benchmark tables mislead.
The second is that training-data implications differ wildly per family. Family 1 is hungriest for clean two-channel conversational speech. Moshi trained on Fisher and undisclosed English conversational corpora; PersonaPlex added 2,250 hours of synthetic customer-service dialogue on top. Family 2 can be trained largely or entirely on synthetic interleaved data, as OmniFlatten's 2,000-hour CosyVoice corpus demonstrated at 0.5B parameters. Family 3 leans on pretrained ASR and TTS corpora plus a small fine-tuning set for the state predictor. Family 4 needs continuous-embedding conversational audio, which is a quantity the field has not yet tried to scale. Picking a family is, in practice, picking a data bet. The forward pointer here is to [Article 04](../04-what-public-speech-datasets-contain/) on what public datasets actually contain, and to [Article 05](../../50-published/05-two-channel-imperative/) on why two-channel matters.
The third is that licensing exposure differs. Family 1 has the cleanest license stories for weights: Moshi under CC-BY 4.0, CSM-1B under Apache 2.0, PersonaPlex under the NVIDIA Open Model License. Family 2 ranges from MIT (Kimi-Audio) and CC BY 4.0 (Covo-Audio) at the permissive end to custom community licenses with commercial carve-outs (Baichuan-Omni) in the middle to paper-only with no weight release (OmniFlatten) at the far end. Families 3 and 4 are younger and less license-consistent. An enterprise buyer running a procurement review has to read at least five different license regimes to cover the open layer alone. That story will be the subject of [Article 10](../../) on consent and licensing.
Picking a family is, in practice, picking three bets at once. The cheat sheet below collapses the section into one table.
| If you are... | Likely family | Why |
|---|---|---|
| A lab chasing the shortest transformer forward pass | 1 | Fewest moving parts, one joint model reads all streams together |
| A startup minimizing training data cost | 2 | Can train full-duplex on 2,000 hours of synthetic interleaved audio, as OmniFlatten showed |
| An enterprise with a strong ASR + LLM + TTS stack already | 3 | Adds full-duplex behavior without retraining the pipeline end-to-end |
| A research group testing whether codec tokens are necessary at all | 4 | Only family that skips the neural codec entirely |
The architecture side of STS is no longer scarce. Four families, more than twenty-five public model releases, and a steady cadence of new arXiv papers mean that picking a model is now a question of fit, not availability. What remains scarce is the training audio those models learn from, and what truly separates a demo from a production system is the data. That is where the rest of this series goes.
---
*oto builds large-scale two-channel full-duplex conversational speech datasets for next-generation speech-to-speech models. If you are evaluating STS architectures and want to understand which data shape each family actually needs, [get in touch](https://oto.earth/contact) or [access the investor data room](https://oto.earth/investors).*
---
_Originally published at [https://fullduplex.ai/blog/pipeline-to-integrated](https://fullduplex.ai/blog/pipeline-to-integrated)._
_Part of **The STS Series** · 03 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 04 — The data ceiling
_Canonical: https://fullduplex.ai/blog/data-ceiling · Markdown: https://fullduplex.ai/blog/data-ceiling/md_
---
title: "The data ceiling"
description: "Full-duplex conversational recordings at internet scale do not exist. The two escape hatches engineers reach for first — better separation AI and bigger YouTube scrapes — do not escape. Full-duplex STS still leans on a 2004 telephone corpus for its post-training recipe."
article_number: "04"
slug: data-ceiling
published_at: 2026-04-19
reading_minutes: 16
tags: ["data", "datasets", "full-duplex"]
canonical_url: https://fullduplex.ai/blog/data-ceiling
markdown_url: https://fullduplex.ai/blog/data-ceiling/md
series: "The STS Series"
series_position: 4
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Why YouTube and podcasts cannot train full-duplex STS
## §1 Terms and thesis
**Full-duplex conversational recordings at internet scale do not exist.** That is the sentence this article defends. Before the defense, two terms need pinning down.
**Full-duplex** describes a conversation in which both parties can speak and listen at the same time, the way humans actually talk. It is the opposite of a walkie-talkie, where one side speaks and the other side waits. A full-duplex speech-to-speech (STS) model has to handle overlap, barge-in, backchannel, and pause without pretending a conversation is strictly turn-by-turn. [Article 02](/50-published/02-full-duplex-threshold/) treats this threshold in depth.
**Full-duplex training data** is recorded conversation that preserves the information a model needs to learn full-duplex behavior. The minimum bar is **speaker isolation at the source**: each participant written to a separate audio track, so an overlap between two people is two clearly attributed events rather than one acoustic blur. In the speech-research literature this property is almost always called "two-channel" or "dyadic two-track." This article uses "full-duplex," "full-duplex-ready," and "two-channel" interchangeably to mean the same thing: recordings from which a full-duplex model can actually learn turn-taking.
Now the thesis. **The largest open corpus of full-duplex conversational speech on the internet was collected in 2004.** It is called Fisher, it was organized by the U.S. Linguistic Data Consortium [(Cieri et al. 2004)](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2004-fisher.pdf), it contains approximately 1,960 hours of English telephone speech across 11,699 conversations, and each speaker was written to a separate disk track at collection time. No downstream separation was ever needed. That is not a historical footnote. As of April 2026, Fisher is still the default post-training corpus for state-of-the-art open-weights full-duplex STS models.
**Everything the public internet has released since then is one of four things: smaller, or mono, or synthetic, or not legally redistributable for AI training.** Switchboard (1997) is smaller at ~260 hours. CANDOR (2023) is CC BY-NC, which forbids commercial training use. Emilia (2024-2025) reaches 216,000 hours but is mono with source-level license ambiguity. OmniFlatten's internal training corpus (2024) is 100% TTS-synthesized. Abaka AI's 20,000-hour full-duplex corpus (March 2026) is commercial and direct-to-enterprise with gated pricing. Not one of these satisfies all three of (full-duplex at source, public, commercially usable at scale).
**The asymmetry this leaves is categorical, not gradual.** Modern STS models pre-train on millions of hours of mono audio. Moshi's backbone sees roughly 7,000,000 hours of web-scraped English speech [(Défossez et al. 2024)](https://arxiv.org/abs/2410.00037). Sesame CSM scales past 1,000,000 hours. Both then fine-tune their actual full-duplex behavior on a few thousand hours of Fisher. Twenty-two-year-old telephone calls are carrying the load the rest of the pipeline cannot.
This article defends the thesis in three moves. **First, it shows why the two obvious escape hatches out of full-duplex scarcity do not escape.** Can separation AI rescue mono recordings? Can YouTube and podcasts supply the conversational data we need? Both are the first questions a careful engineer asks. Both have answers that require engaging with the counterargument fairly rather than dismissing it. Second, it walks what open corpora actually contain, corpus by corpus. Third, it proposes a map of what data is fit for what training phase, which turns the scarcity into an operational requirements list.
Figure F1: corpus scale vs vintage scatter. Y axis log hours, X axis release year. Shape encodes channel count (1 vs 2), color encodes license permissiveness for commercial training. Fisher and AMI cluster in the lower-left; Emilia and GigaSpeech sit far to the upper-right on the mono-only row. The intersection of (full-duplex at source) and (large scale) and (commercial rights) is empty.
## §2 Why mono isn't enough, and why separation AI can't rescue it
### 2.1 The mono collapse, briefly
Full-duplex training data breaks the moment two speakers collapse onto a single channel. A 200 ms overlap that was two clearly attributed events becomes one acoustic blur; backchannels get absorbed into the main speaker's envelope; turn boundaries move from deterministic timestamps to inferred ones. [Article 05 (The two-channel imperative)](/50-published/05-two-channel-imperative/) gives the signal-processing treatment of this collapse. The rest of this section takes it for granted and engages the strongest counterargument: **can better separation AI undo the mono collapse after the fact?**
### 2.2 The separation counterargument, stated fairly
A natural reply is: source separation AI is improving fast, and diarization is a mature field. If mono recordings can be split into per-speaker tracks with high accuracy, the gap between mono web audio and clean full-duplex recordings should close. Current best-in-class separation models include [SepFormer](https://arxiv.org/abs/2010.13154) (2021), [Conv-TasNet](https://arxiv.org/abs/1809.07454) (2019), [TDANet](https://arxiv.org/abs/2209.15174) (2023), and [MossFormer2](https://arxiv.org/abs/2312.11825) (2024). Current best-in-class diarization includes [pyannote 3.x](https://arxiv.org/abs/2304.13880), NVIDIA NeMo, and [EEND-EDA](https://arxiv.org/abs/2005.09921). These systems exist, they are public, and they are actively used in production pipelines. The question is not whether separation AI works. It is whether it works well enough, under the conditions real conversation contains, to produce training labels clean enough for a model that has to place a turn onset within ±50 ms of the right moment.
### 2.3 Where the ceiling actually is
Two scoreboards appear in this section, so it is worth pausing on what each one means. SI-SDRi (scale-invariant signal-to-distortion ratio improvement) measures how cleanly a separation model pulls one voice out of a mix. It is reported in decibels on a log scale. As a rule of thumb: 10 dB is "cleaner than raw noise," 20 dB is "close to a clean studio voice," 25 dB is "indistinguishable from the original by ear." WER (Word Error Rate) measures how often the downstream speech-to-text system gets words wrong, as a percentage where 0% is perfect and 50% is garbage. The first score is the separation stage; the second is the transcription stage of the same pipeline.
WSJ0-2mix is the canonical benchmark and the research ideal: two studio recordings added together in software, then asked to be pulled apart again. On it, SepFormer scores about 22.3 dB and MossFormer2 about 24.1 dB. Near-clean by the rule of thumb above. Move to WHAMR!, which folds in noise and reverberation, and the best numbers fall to roughly 14 to 17 dB. Move to a benchmark recorded in an actual room at natural overlap rates and the numbers collapse.
The most load-bearing evidence sits in [LibriCSS](https://arxiv.org/abs/2001.11482) (Chen et al. 2020), a benchmark designed to measure WER after separation on recordings with controlled overlap rates. At 30% overlap, the condition closest to natural conversation, single-channel ASR with no separation produces a 34.6% WER. Roughly one word in three is wrong. A 7-channel microphone array with neural masking brings that down to 18.4%, still roughly one word in five. At 40% overlap the pair is 43.2% and 21.6%. These are not error rates that support supervised fine-tuning of a model whose job is to place a turn onset within ±50 ms of the correct moment.
Figure F2: separation AI accuracy degradation. Horizontal bar chart with WSJ0-2mix, WHAMR!, LibriCSS-OV20, LibriCSS-OV30, LibriCSS-OV40 on the y axis; best-published SI-SDR (top panel) and best-published WER after separation (bottom panel) on the x axis. The drop from synthetic benchmark to real overlap is visually dominant.
### 2.4 Diarization error on real conversation
Separation is only half of the pipeline. Even if you have two clean tracks, you still have to answer: which track belongs to which speaker, turn by turn. That is diarization, the "who spoke when" stage. Its error metric is DER (Diarization Error Rate), the fraction of audio time labeled with the wrong speaker, a missed speaker, or a hallucinated one.
[pyannote 3.1](https://huggingface.co/pyannote/speaker-diarization-3.1), the current research default, reports a DER of about 22.4% on AMI (meeting audio from a single distant microphone) and 11.3% on VoxConverse (YouTube-style interviews). These are good numbers for research purposes.
**At 22% DER, roughly one turn in four is mis-attributed to the wrong speaker.**
A full-duplex model trained on labels at this quality learns a world where "the model" and "the user" swap voices at random one time out of four. That is not the kind of label noise that averages out at scale. It corrupts the exact structure, who-speaks-when, that the model is supposed to learn.
Figure F4: DER visualization. Four sequential turn-blocks along a horizontal strip. Three are labeled correctly (green, alternating "user" / "model"), one is flipped and shown in red with the label "wrong speaker." Caption: one turn in four mis-attributed is what 22% DER looks like in practice.
The [CHiME-6 dinner-party challenge](https://chimechallenge.org/challenges/chime6/) makes the same point from a different angle. CHiME-6 is a dinner-party audio set: six people, real room, real background noise, the everyday hard case. The challenge has two tracks. Track 1 hands the system a perfect speaker label table (a human wrote it). Track 2 asks the same system to build its own label table first and then transcribe. On Track 1, the baseline reaches ~51% WER. On Track 2, WER rises into the high 60s to 80s depending on the setup. The 15 to 30 percentage-point gap is the cost of building the label table yourself. It is the price every real-world pipeline pays. Oracle-label quality is what training needs. System-label quality is what training pipelines actually produce.
### 2.5 Compounding error is the real argument
Accuracy does not add across a pipeline. It multiplies.
Picture an assembly line with three stations. Station 1 is separation, which splits the mono mix into per-speaker tracks. Station 2 is diarization, which labels which track is which speaker at which time. Station 3 is ASR, which transcribes what was said. Each station has its own error rate. The labeled training data that drops off the end of the belt inherits the combined error of all three. And fixing one station can damage another.
[Raj et al. 2021](https://arxiv.org/abs/2011.02014) caught this composition on record. They added a separation front-end to a LibriCSS pipeline. On the hard sections, where speakers overlap, it dropped concatenated-minimum-permutation WER from 27.1% to 13.4%. A clear win. On the easy sections, where only one person was talking, the same front-end raised WER from 11.2% to 12.4%. A small loss, but a loss. The separator introduced artifacts the downstream ASR had never been trained on. Fix the overlap, and a fraction of the non-overlap breaks.
The [CHiME-8 DASR](https://arxiv.org/abs/2407.16447) organizers in 2024 state the ceiling plainly. Cornell et al. write that "accurate speaker counting in the first diarization pass is crucial to avoid compounding errors," and that "all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios." That is the peer-reviewed position of the people running the benchmark in 2024.
The takeaway is not "separation is broken." It is that separation plus diarization plus ASR, applied to mono web audio, produces training labels of a quality that the downstream model cannot tolerate. The model's job is to place a turn onset within tens of milliseconds. The upstream pipeline cannot deliver labels at that precision from a mono source. That is the ceiling.
Figure F3: compounding error bar chart. Two grouped bars per scenario. Scenario 1: CHiME-6 oracle diarization WER vs CHiME-6 system diarization WER. Scenario 2: LibriCSS zero-overlap WER with vs without separation front-end (Raj 2021). Scenario 3: WSJ0-2mix SI-SDRi vs LibriCSS WER after separation. Caption: clean benchmarks and real conditions differ by 20+ percentage points; this is the gap mono recordings start from.
## §3 Why YouTube and podcasts do not supply full-duplex data
### 3.1 The apparent abundance, and the scale illusion
**YouTube hosts billions of hours of speech. Podcasts, tens of millions.** Two-speaker interview formats are a well-established genre in both. At first glance this supply dwarfs Fisher by four or five orders of magnitude, and the careful engineer's second question follows naturally: if mono is hard to rescue, isn't there enough already-two-track content on the internet to skip the problem entirely?
**The answer is no, and the gap is not a rounding difference.** Once the license filter, the content-shape filter, and the training-phase filter are applied to the apparent abundance, the fraction that can legitimately be used to teach full-duplex turn-taking collapses by roughly seven orders of magnitude. The rest of this section walks each filter in turn.
Figure F7: apparent scale vs full-duplex-usable scale. Horizontal bars on a log scale in hours. "YouTube total content" at ~10 billion h (order-of-magnitude estimate), "Podcast catalog" at ~100 million h, "Public full-duplex-ready + redistributable" at ~2,000-3,000 h (Fisher + Switchboard + small academic). Caption: the apparent supply and the usable supply are separated by roughly seven orders of magnitude. The difference is categorical, not quantitative.
### 3.2 License, consent, and content shape
The top filter is legal. [YouTube's Terms of Service](https://www.youtube.com/static?template=terms) explicitly prohibit automated extraction and forbid training machine learning models on YouTube content without authorization. YouTube has added [opt-in third-party AI training controls](https://support.google.com/youtube/answer/15509945) that default off. A creator must actively grant permission before their audio is a legitimate training input. Cases such as [Millette v. OpenAI](https://techcrunch.com/2024/08/05/youtuber-files-class-action-suit-over-openais-scrape-of-creators-transcripts/) and YouTube's public statement via Neal Mohan that scraping would be a "clear violation" of its terms pressure-tested this through 2024 and 2025.
Podcasts are a softer problem that is still a problem. RSS delivers the audio but does not license it. The RSS co-creator Dave Winer launched [a separate protocol called RSL](https://techcrunch.com/2025/09/10/rss-co-creator-launches-new-protocol-for-ai-data-licensing/) in 2025 precisely because RSS contains no training-license field. Interview guests and background participants are rarely under any contract that allows their voices to be used for model training.
Beneath the license filter is a content-shape filter. The majority of YouTube and podcast audio is not spontaneous dyadic conversation. It is monologue, scripted interview, edited panel, sports commentary, or audiobook narration. The shows with two speakers at once tend to be professionally produced, which means cross-talk has been cut out in post. Editing removes the backchannel, the repair, the hesitation at the turn-transition point, which are precisely the phenomena a full-duplex model has to learn.
### 3.3 What each training phase actually needs
Even if licensing and content-shape were somehow resolved, a third filter applies. **STS models do not train in one phase. They train in at least three**: a self-supervised pre-training phase that learns audio representations from raw waveforms, a mid-training phase that teaches the model to handle dialogue-shaped inputs, and a post-training phase that shapes the actual turn-taking behavior. Each phase tolerates different defects in its input.
[HuBERT](https://arxiv.org/abs/2106.07447) pre-trained its LARGE model on LibriLight, roughly 60,000 hours of mono audiobook audio. [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477) used the same source. Neither model needed full-duplex data, dialogue structure, or turn boundaries to learn useful representations. For this phase, scraped-quality mono works. Moshi's backbone is pre-trained on about 7 million hours of web-scale mono speech with Whisper-generated pseudo-labels. Sesame CSM pre-trains on about 1 million hours of similar material. **The pre-training phase consumes mono audio by the million-hour. That is what YouTube-like corpora are fit for.**
The problem is that pre-training is not where a model learns to listen while it speaks. Sesame CSM, with its 1 million hours of mono pre-training, does not have a native full-duplex mode. Scaling mono pre-training is not sufficient. The full-duplex behavior is learned downstream, at the post-training stage where both Moshi and NVIDIA PersonaPlex [(Roy et al. 2026)](https://arxiv.org/abs/2602.06053) converge on the same answer: Fisher, or small samples of it. PersonaPlex in particular fine-tunes on 1,217 hours of Fisher alongside 2,250 hours of synthetic. **Real full-duplex dialogue carries 35% of the fine-tuning weight in a state-of-the-art open-weights STS model from January 2026.**
**YouTube-grade mono audio is fit for pre-training, partly fit for mid-training with careful curation, and structurally unfit for post-training. The post-training phase is the one the full-duplex behavior lives in, and no amount of YouTube scraping fills it.**
## §4 The training phase × data type matrix
Put the findings from §2 and §3 together and the picture is a grid, not a list. Three training phases. Four data types. Twelve cells, each with a different fitness answer.
The phases are pre-training (self-supervised representation learning on unlabeled audio), mid-training (continued pre-training and modality alignment on dialogue-shaped inputs), and post-training (supervised fine-tuning plus RLHF or DPO, where turn-taking behavior is actually shaped). The data types are web mono (YouTube, podcast, audiobook), public full-duplex dyadic (Fisher, CANDOR, AMI), synthetic dialogue (OmniFlatten's 2,000-hour CosyVoice-generated corpus, PersonaPlex's Chatterbox-rendered dialogs), and commercial full-duplex (Abaka AI's 20,000 hours, in-house collections like Kyutai's 170-hour seed).
Reading the grid row by row. At pre-training, web mono is the default across the field. HuBERT LARGE uses 60,000 hours of LibriLight. Moshi's backbone uses 7,000,000 hours of web-scale speech. Sesame CSM uses 1,000,000 hours. Public full-duplex dyadic corpora are not used at this phase because they are too small; Fisher's 2,000 hours is a rounding error against 7 million. Synthetic pre-training is rare in the STS literature because it offers no scale advantage. Commercial full-duplex at pre-training scale does not exist as a public recipe.
At mid-training, the picture diversifies. Freeze-Omni's 110,000-hour ASR corpus [(Wang et al. 2024)](https://arxiv.org/abs/2411.00774) sits in the web-mono column. Moshi's diarization-simulated multi-stream pre-training is a hybrid that treats mono web audio as if it were full-duplex by labeling speaker activity. Synthetic mid-training appears in OmniFlatten's Stage 1 modality-alignment pairs. Public full-duplex dyadic at mid-training scale is mostly absent from the recipes we have read.
**At post-training, the grid collapses toward two columns.** Web mono is not used, because turn-taking behavior is not learned from content without speaker separation. Public full-duplex dyadic is where Fisher and its relatives carry the day: Moshi's full-duplex fine-tune, PersonaPlex's 1,217-hour Fisher portion, every top-performing open-weights full-duplex STS that has published a recipe. Synthetic post-training is used additively, never as a replacement. PersonaPlex's 2,250 hours of synthetic sit alongside, not instead of, Fisher. Commercial full-duplex is the column that is structurally available but not yet dominant in published recipes. Abaka AI's March 2026 announcement proves the tier exists commercially. No peer-reviewed open-weights model has yet been published with commercial full-duplex as the majority post-training source.
Figure F5: training phase × data type matrix. Three rows (pre-training / mid-training / post-training) by four columns (web mono / public full-duplex dyadic / synthetic / commercial full-duplex). Each cell carries a traffic-light color (green fit, yellow partial, red unfit) plus one anchoring example. Central figure of the article. Legend: green = documented fit with at least one cited model recipe, yellow = works additively or in hybrid, red = structurally unfit or not documented.
**The phase where full-duplex behavior is actually learned is served by exactly one column of data that exists in meaningful quantities today, and that column is full-duplex dyadic conversation at real overlap rates with redistribution rights.** Fisher is ~2,000 hours of it. Abaka AI claims 20,000 hours of it commercially. Everything else is either additive or unfit. This is the scarcity the rest of the article develops.
## §5 Public corpus walkthrough
What exists publicly is worth naming specifically, because the catalog drives the scarcity argument. Nine corpora carry most of the weight.
[Fisher](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2004-fisher.pdf) (LDC 2004, ~1,960 hours across 11,699 English telephone conversations, full-duplex by construction with each speaker written to a separate track at 8 kHz mu-law, LDC membership fee required, research-default with commercial tier negotiable). The anchor.
[Switchboard](https://catalog.ldc.upenn.edu/LDC97S62) (LDC 1997, ~260 hours, full-duplex telephone, same LDC licensing). Fisher's grandparent. Still cited, but one-eighth the scale.
[CANDOR](https://betterup.com/candor) (BetterUp 2023, ~850 hours across ~1,700 two-person video calls, CC BY-NC 4.0). Modern, naturalistic, full-duplex, and licensed out of commercial training. The non-commercial clause blocks the main use case.
[AMI](https://groups.inf.ed.ac.uk/ami/corpus/) and [ICSI](https://groups.inf.ed.ac.uk/ami/icsi/) (meeting corpora from 2005 and 2004, ~100 hours and ~72 hours, multi-channel close-talk plus distant microphones). Useful for diarization benchmarks and multi-party research. Awkward for dyadic STS training because the speaker count per session is four to six.
[Emilia](https://emilia-dataset.github.io/Emilia-Demo-Page/) (Amphion 2024-2025, ~216,000 hours total). The headline number hides a license split. The core 101,000 hours are CC BY-NC 4.0, non-commercial only. The Emilia-YODAS extension adds ~114,000 hours under CC BY 4.0 but is upstream-dependent on YODAS2's YouTube CC-BY 3.0 tagging. Amphion states in the dataset card that it "does not own the copyright to the audio files." The corpus is also mono and segmented to single-speaker-per-clip, so even for the commercially licensable portion the full-duplex signal is absent.
Several 2025-2026 academic dyadic sets (InterActSpeech, DialogueSidon, MultiDialog, DeepDialogue, MLC-SLM) add full-duplex or dialogue-shaped material at smaller scale. Useful as research anchors. Licenses mixed, hour counts mostly in the tens to low hundreds. Too small individually to carry a fine-tune, potentially useful in aggregate.
Figure F6: dataset comparison table. Columns: hours, channel count, speakers per session, license, commercial redistribution allowed. Rows: the corpora above plus Emilia split into core + YODAS. Traffic-light highlighting on the commercial-redistribution column makes the fitness landscape readable at a glance.
**Across the public catalog, the intersection of (full-duplex at source) and (commercial redistribution permitted) and (sufficient scale for fine-tuning) contains essentially Fisher, Switchboard, and a handful of smaller academic sets.** That intersection has not meaningfully expanded in twenty years of public dataset releases.
## §6 What synthetic data can and can't do
Synthetic dialogue is the third escape hatch the literature has explored, and it is the most interesting one because it works, partially, in ways that clarify what real data is for.
The strongest evidence that synthetic data works is [OmniFlatten](https://arxiv.org/abs/2410.17799), which trained a 0.5B-parameter model to usable full-duplex behavior on roughly 2,000 hours of dialogue that was 100% generated by the CosyVoice TTS system. No real full-duplex recordings at any stage. The result was not state-of-the-art, but it crossed the threshold of "the model does the behavior." So the ceiling on synthetic is not "zero."
The ceiling argument is more subtle. Synthetic dialogue is bounded in distribution by the TTS model that generates it. Prosody collapses toward the TTS's prior. Backchannel frequency becomes rule-based because the generator has to be told when to emit a backchannel. Disfluencies are either absent or scripted. Overlap structure reflects the script's turn taking, not the spontaneous timing humans produce. You cannot learn a behavior that was not in the generator's output distribution.
The honest working pattern in state-of-the-art recipes is additive. Moshi's 20,000 hours of synthetic instruction data are generated by Kyutai's own multi-stream TTS, which was itself trained on 170 hours of real full-duplex Kyutai recordings. That is a 100× amplification from a real seed to a synthetic extension, but the real seed is not removable. PersonaPlex's 2,250 hours of synthetic customer-service and QA dialogs sit alongside 1,217 hours of real Fisher. The mix is roughly 35% real and 65% synthetic. The real portion is not the larger half, but it is the half that carries the in-distribution anchor.
**Synthetic dialogue shifts the training curve. It does not replace the need for real full-duplex data at the post-training stage. It multiplies the real seed.** The scarcity economics therefore sit on the size of the seed, not on the total hours fed to the model. A lab with 170 hours of real full-duplex can produce 20,000 hours of synthetic. A lab with zero hours of real full-duplex produces zero hours of useful synthetic. Articles 05 and 06 develop this further.
## §7 Commercial full-duplex data market
Outside the public corpus catalog sits a commercial tier that has begun to price full-duplex conversational data directly. [Abaka AI](https://www.abaka.ai) announced a 20,000-hour commercial corpus in March 2026, described as "100% real human-to-human" and delivered with full-duplex physical source isolation across seven languages. Pricing is direct-to-enterprise and not public. Adjacent suppliers in the call-center outsourcing industry have sold recorded audio for ASR training for years, but the full-duplex requirement is a newer ask and the supply is thinner than the headline numbers suggest.
The terms worth verifying per vendor are redistribution rights, consent documentation for every speaker, language coverage, and true channel isolation at source (as opposed to post-hoc separation). Article 10 covers the consent and licensing layer in detail. **The public corpus is not the complete picture, and a commercial procurement path exists for the full-duplex post-training phase.**
## §8 Eight requirements preview and scarcity economics
The §4 matrix produces a natural requirements list for training data intended for full-duplex post-training. Article 06 develops each in depth. In one sentence each: full-duplex capture at source, spontaneous dyadic structure, realistic overlap rate distribution, multi-register coverage across topics and emotions, speaker diversity sufficient to generalize, documented per-speaker consent for AI training use, commercial redistribution rights, and phase-fit labeling so the data can be routed to the training stage it actually helps. This is a long list because each item independently blocks usability. A corpus that satisfies six of eight is not 75% useful. It is zero percent useful for whichever training run needs the two it fails.
**Measured in hours available for legitimate commercial training as of April 2026, the intersection of all eight requirements runs to the low thousands in the public domain plus whatever the commercial tier will sell.** For comparison, language-model pre-training corpora are measured in trillions of tokens. The asymmetry is the investment thesis. Full-duplex speech-to-speech is architecturally solved enough that four distinct model families are shipping public weights ([Article 03](/30-drafts/03-pipeline-to-integrated/draft-v1.md)). **The bottleneck is not architecture. It is how many hours of full-duplex dyadic audio exist with the right rights attached.**
Figure F8: scarcity triangle. Three overlapping regions (full-duplex physical capture / commercial redistribution rights / scale above 1,000 hours). Each labeled. Shaded intersection called out as "fit for full-duplex post-training." A small number printed next to the intersection: "Fisher ~2,000h + Abaka ~20,000h commercial + small academic sets. Total scale comparable to one Saturday of YouTube uploads."
## §9 Forward pointers
This article framed the data side of the speech-to-speech stack. Three other articles go deeper on the pieces it touched.
[Article 05](/50-published/05-two-channel-imperative/) (two-channel imperative, already published) is the longer argument for why mono audio breaks full-duplex training specifically at the label level. §2 of this article references it, and readers who want the signal-processing treatment of the mono-vs-stereo question should read that piece next.
Article 06 (eight requirements for next-gen STS training data) takes the one-sentence list in §8 and turns it into a spec sheet with operational definitions. If §4's matrix is the map, Article 06 is the site survey.
Article 10 (consent, licensing, and the opt-in economy for conversational data) takes the legal layer in §3 and the commercial market in §7 and treats them as one market-design problem. That piece is where the AI training lawsuit landscape and the opt-in economics get the full treatment.
**Data for full-duplex STS is a phase-fit problem, not a total-hours problem.** The bottleneck is how many hours of full-duplex dyadic conversation exist with commercial redistribution rights, for the post-training phase where turn-taking is actually learned. Everything else in the stack is already moving.
---
_Originally published at [https://fullduplex.ai/blog/data-ceiling](https://fullduplex.ai/blog/data-ceiling)._
_Part of **The STS Series** · 04 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 05 — Foundation before vertical
_Canonical: https://fullduplex.ai/blog/foundation-before-vertical · Markdown: https://fullduplex.ai/blog/foundation-before-vertical/md_
---
title: "Foundation before vertical"
description: "Full-duplex STS sits between the GPT-2 and GPT-3 moments. Asking “which vertical wins first?” in 2026 is a category error — the constraint is whether the foundation the verticals will sit on exists yet. A thesis essay on the foundation threshold, the 30×–150× data gap, and six plausible routes to 100,000+ hours of two-channel dialogue."
article_number: "05"
slug: foundation-before-vertical
published_at: 2026-04-26
reading_minutes: 14
tags: ["foundation", "investment", "data"]
canonical_url: https://fullduplex.ai/blog/foundation-before-vertical
markdown_url: https://fullduplex.ai/blog/foundation-before-vertical/md
series: "The STS Series"
series_position: 5
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Foundation before vertical
Investors evaluating voice AI in 2026 keep asking a reasonable-sounding question. **Which vertical will speech-to-speech AI win first? Call centers? Medical documentation? Legal? Education? Gaming?** The premise of that question is that the foundation is ready and the remaining problem is product-market fit. Eighteen months of data inventory, benchmark results, and model releases suggest the premise is wrong. Full-duplex speech-to-speech sits somewhere between the GPT-2 moment and the GPT-3 moment. Asking which vertical to chase is like asking in 2019 whether the first billion-dollar LLM company would be in legal or in medical. The answer then was "neither, because the foundation is not ready." The answer for STS in 2026 is the same.
## 1. The VC question and the wrong answer
The vertical-first framing comes naturally to people who have financed a decade of SaaS. Pick a vertical with pain, ship a narrow product faster than the incumbents, compound through distribution. For speech-to-speech AI in 2026, this framing is a category error. **The constraint is not which market to address. The constraint is whether the foundation that verticals will sit on exists yet.**
Text LLMs went through the same confusion in 2019 and 2020. GPT-2 (2019) could write paragraphs but not reliably answer domain questions. Vertical LLM startups at that stage either built their own domain foundation from scratch (and lost) or waited. GPT-3 (2020) flipped the economics. Post-GPT-3, Harvey raised its Series A five months after ChatGPT shipped ([press coverage](https://techcrunch.com/2023/04/26/harvey-21m-series-a/)). Hippocratic AI raised a $50M seed six months after ([press coverage](https://www.reuters.com/business/healthcare-pharmaceuticals/generative-ai-startup-hippocratic-ai-raises-50-million-seed-round-2023-05-16/)). Neither would have been financeable eighteen months earlier.
**The right question for STS in 2026 is not "which vertical?" but "is the foundation data bottleneck closing, and on what timeline?"** The rest of this article works through what is known about the answer.
One note on epistemic status before continuing. **This article is a thesis essay, not a research summary.** Facts and hypotheses are tagged differently. Facts include the public supply total (2,000 to 3,000 hours), Fisher's 1,960 hours, Abaka's vendor-claimed 20,000 hours, the LDC license structure, and the funding rounds cited in §2 and §6. Hypotheses include the 100,000 to 500,000 hour foundation threshold estimate (§3), the 30x to 150x supply gap that follows from it (§3), the 3x post-foundation compression factor applied to the ASR arc (§8), and the 2027 to 2029 sequencing reading (§8). Readers should hold the hypotheses loosely and update them against new data as it arrives.
*(Figure F1: two framings side by side. Left panel shows the SaaS-style vertical-first question. Right panel shows the foundation-first question with the current supply gap marked.)*
## 2. Foundation threshold, a concept worth naming
The **foundation threshold** is the data-and-parameter scale at which a single pretrained model generalizes well enough, zero-shot or with light fine-tuning, that domain-specialized products can be built as adapters on that model rather than from scratch. Below the threshold, each vertical must solve its own data, model, and product. Above it, the foundation is a commodity input and the vertical becomes a distribution problem.
The threshold is visible across three domains that have already crossed it.
**Text LLMs**. GPT-1 (2018) at 117M parameters and 0.8B tokens required fine-tuning for any task ([Radford et al. 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)). GPT-2 (2019) at 1.5B and 10B tokens had zero-shot performance that was interesting but unreliable ([Radford et al. 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)). GPT-3 (2020) at 175B and 300B tokens had few-shot in-context learning robust enough that a vertical startup could build a product by prompting alone ([Brown et al. 2020](https://arxiv.org/abs/2005.14165)). Post-threshold vertical adapters confirmed the pattern: Med-PaLM scored 67.6% on MedQA ([Singhal et al. 2022](https://arxiv.org/abs/2212.13138)), Med-PaLM 2 scored 86.5% ([Singhal et al. 2023](https://arxiv.org/abs/2305.09617)), both built on PaLM and PaLM 2 respectively, not trained from scratch. Code Llama added 500B code tokens on top of Llama 2, roughly ten percent additional training ([Rozière et al. 2023](https://arxiv.org/abs/2308.12950)). Specialization was additive and cheap on top of a proven base.
**Computer vision**. CLIP (2021) trained on 400M image-text pairs crossed a zero-shot transfer threshold ([Radford et al. 2021](https://arxiv.org/abs/2103.00020)). MedSAM, a medical-imaging adapter on SAM, improved DICE by 22.51 points over zero-shot SAM across 86 of 86 internal tasks, using 1.57M medical mask annotations ([Ma et al. 2024](https://www.nature.com/articles/s41467-024-44824-z)). **The domain data needed post-threshold was two to three orders of magnitude smaller than the foundation data.** BiomedCLIP followed the same pattern with 15M medical image-text pairs on a CLIP base ([Zhang et al. 2023](https://arxiv.org/abs/2303.00915)).
**Automatic speech recognition**. Whisper (2022) trained on 680,000 hours crossed the threshold for zero-shot transfer across accents, domains, and languages ([Radford et al. 2022](https://arxiv.org/abs/2212.04356)). Pre-Whisper, Nuance Dragon Medical was used by 55% of US physicians and was built on per-domain specialized acoustic models. Post-Whisper, the vertical winner is Abridge ($5.3B valuation, June 2025), which sits on Whisper-class foundations plus LLMs rather than a from-scratch medical ASR stack ([press coverage](https://www.reuters.com/business/healthcare-pharmaceuticals/ai-medical-scribe-abridge-raises-250-million-series-d-2025-02-17/)). The specialized-acoustic-model moat largely evaporated.
**Counter-examples make the rule sharper, not weaker.** Two text-LLM projects tried to build domain foundations at sub-frontier scale. **BloombergGPT** (2023, 50B parameters, trained from scratch on 363B finance tokens plus 345B general) was matched or exceeded by GPT-4 on most finance tasks within twelve months of its release ([Wu et al. 2023](https://arxiv.org/abs/2303.17564)). **Galactica** (2022, 120B parameters, science-specialized) was publicly withdrawn after three days because its narrow corpus produced hallucinated citations that sounded plausible ([Taylor et al. 2022](https://arxiv.org/abs/2211.09085)). Neither case refutes the foundation-first pattern. Both refine it. **The operational rule is "foundation first, vertical as adapter on the foundation," not "vertical foundation at sub-frontier scale."** Attempts at the latter lose.
*(Figure F2: foundation threshold table. Columns: domain, foundation model, parameters, data volume, named post-threshold vertical winner. Rows: text (GPT-3, 175B, 300B tokens, Harvey / Hippocratic), vision (CLIP, 400M params, 400M pairs, MedSAM / BiomedCLIP), ASR (Whisper, 1.55B, 680k hours, Abridge), full-duplex STS (empty row, labeled "not yet crossed"). The empty last row motivates §3.)*
## 3. Where full-duplex STS actually sits now
STS is not pre-foundation. It is mid-foundation.
**Model side.** Moshi (September 2024) was the first open full-duplex STS model, at roughly 7B parameters ([Kyutai 2024](https://arxiv.org/abs/2410.00037)). PersonaPlex (January 2026, NVIDIA) is a Moshi-fine-tuned variant with persona control. J-Moshi (2025) is the Japanese language-specific variant ([Ohashi et al. 2025](https://arxiv.org/abs/2506.xxxx)). Parameter counts sit between GPT-2 scale and GPT-3 scale.
**Data side.** From Article 04's inventory, public full-duplex-ready speech totals roughly 2,000 to 3,000 hours. Fisher English is the anchor at 1,960 hours (LDC2004S13, LDC2005S13), gated by LDC license. AMI contributes 100 hours, ICSI 72 hours, CHiME-6 40 hours, CANDOR 850 hours under CC BY-NC, plus smaller contributions from InteractSpeech and DialogueSidon. **Total public full-duplex-ready hours under commercial license is close to zero.**
**On the scaling curve, Moshi is the GPT-2 analog.** It crossed the zero-shot full-duplex viability threshold. No public GPT-3 analog exists yet. This diagnosis is not oto's invention; it is visible in the models themselves, which are still brittle on long-context turn-taking, in the benchmarks, where Full-Duplex-Bench scores remain in a wide range, and in the data, where every scaling attempt hits the same Fisher-plus-scraps supply.
**The foundation threshold for full-duplex STS probably sits between 100,000 and 500,000 hours of two-channel dyadic conversational audio.** This is a hypothesis, not a measurement, and it depends on three load-bearing assumptions that deserve to be named.
First, the ASR scaling curve that runs from Switchboard (260 hours, 1991) through Fisher (1,960 hours, 2004) to LibriLight (60,000 hours, 2020) to Whisper (680,000 hours, 2022) is usable as an analogy for full-duplex. That is, the per-hour data efficiency transfers from single-channel to two-channel within roughly one order of magnitude. Second, full-duplex is strictly harder than single-channel (two tracks, natural overlap, backchannels, turn-taking), but the difficulty multiplier is bounded at one to two orders of magnitude, not more. Third, parameter scaling and data scaling co-move as they did in the LLM and ASR arcs, so the target model lands at 50B to 500B parameters, roughly 10x to 50x current Moshi scale.
**If any of the three assumptions breaks, the estimate breaks with it.** The 5x spread in the hour estimate (100k to 500k) is the honest range that absorbs these three uncertainties; it is not the statistical confidence interval of a measured quantity.
**If the threshold is 100-500k hours and current supply is 2-3k, closing the gap requires a 30x to 150x scale-up.** On the ASR arc's template, this is a multi-year project, not a six-month sprint.
*(Figure F3: scaling curve with text LLMs (GPT-1 → GPT-2 → GPT-3 → GPT-4) plotted, full-duplex STS superimposed as a separate track. Moshi is positioned near GPT-2 on the y-axis. The foundation threshold band is shaded between GPT-3-equivalent and GPT-4-equivalent. Annotated as analogy, not identity.)*
## 4. Which verticals will need their own data, and why generic STS will not suffice
Even after the foundation threshold is crossed, some verticals will still require vertical-specific fine-tuning. The reasons differ by vertical.
- **Call center** uses scripted prompts, complaint register, sensitive data handling, QA review. The deployment reality is 8 kHz narrowband; the training reality needs natural overlap between agent and caller.
- **Medical** requires drug-name vocabulary, ICD-10 term accuracy, emotional-register control with patients, and HIPAA-compliant scribing.
- **Legal** runs on formal oral-argument register, citation-heavy vocabulary, and adversarial turn-taking that looks nothing like casual conversation.
- **Education** follows teacher-student Initiation-Response-Evaluation patterns, includes code-switching with minors, and is gated by FERPA and COPPA.
- **Multi-party meetings (three or more speakers)** require diarization, overlap resolution, and role attribution. **Current full-duplex models are dyadic.** This is a structural gap, not a data gap.
- **Gaming** requires sub-200ms latency, emotional-register matching, gaming-specific jargon, background noise handling, and interruptions as a feature, not a bug.
- **Casual everyday and companion** use cases need backchannels, laughter, emotional attunement, and long-context memory.
**Each of these is a different distribution of turn patterns, channel configurations, vocabulary, or regulatory constraints.** Generic STS can handle none of them at production quality.
### 4.1 The Japanese case, a short aside
Japanese full-duplex STS is a special case because **no Fisher-equivalent exists**. J-Moshi's fine-tune mixture totals 344 hours, of which only 143 hours comes from publicly reproducible corpora ([Ohashi et al. 2025](https://arxiv.org/abs/2506.xxxx)). The remaining 201 hours is Nagoya University in-house recordings. J-CHAT provides 69,000 hours of Japanese audio but is mono single-speaker and cannot be used for the full-duplex fine-tune stage ([Nakata et al. 2024](https://arxiv.org/abs/2407.15828)). CEJC (200 hours), BTSJ (127 hours), CSJ (650 hours, predominantly monologue) push the total dialogue audio toward 500-600 hours with mixed channel configurations and mixed licenses.
**The Japanese full-duplex community is training on a public floor of 143 hours.** English is not great. Japanese is worse.
*(Figure F4: domain divergence table. Seven vertical rows (call center, medical, legal, education, multi-party, gaming, casual). Four columns: vocabulary divergence, turn-pattern divergence, channel configuration, regulation. Japanese flagged in a callout row.)*
## 5. The three-pattern reality check
Across nine verticals inventoried (call center, medical, legal, education, multi-party meetings, gaming, brainstorming, casual everyday, Japanese), data for each falls into one of three categories.
- **Nonexistent**: no public corpus at any scale.
- **Too small**: under 200 hours in the largest public corpus.
- **Blocked**: 200+ hours exist but are gated by license, regulation, or channel configuration.
Plus a fourth pattern worth naming: **structurally wrong**. The data exists at scale but in the wrong configuration for full-duplex training, most commonly because it is mono-mixed rather than two-channel.
**Zero verticals have an existing, commercially usable, public full-duplex corpus over 1,000 hours.** Every vertical fails at least one of the three bars.
| Vertical | Category | Best public two-channel corpus | Hours | Primary blocker |
|---|---|---|---|---|
| Call center | Blocked (license) | Fisher English (LDC2004/2005S13) | 1,960 | LDC gated, 8 kHz narrowband, English only |
| Medical | Blocked (regulation) | PriMock57 + ACI-Bench (mocked) | <100 | HIPAA / GDPR; real 14kh Google Health internal |
| Legal | Structurally wrong (mono-mixed) | Oyez Supreme Court | ~5,000 | Single mixed track, not two-channel |
| Education | Blocked (regulation) | NCTE Classroom (gated) / SimClass (simulated) | 5,000 / 391 | FERPA / COPPA; minors on tape |
| Multi-party (3+) | Too small | AMI + ICSI + CHiME-6 + VoxConverse + MSDWild | ~360 | No single corpus >100h |
| Gaming | Too small (near-nonexistent) | OGVC (Japanese MMORPG) | <20 | No public English gaming corpus |
| Brainstorming | Too small | AMI scenario-driven subset | 65 | Subsumed into meetings |
| Casual everyday | Blocked (license) | CANDOR | 850 | CC BY-NC, non-commercial only |
| Japanese | Too small + blocked | J-Moshi public portion | 143 | No Fisher-equivalent |
Three verticals deserve a closer look because they surprised the inventory.
**Legal is configuration-wrong, not data-poor.** The Oyez Project hosts more than 5,000 hours of US Supreme Court oral argument audio, which is public domain. But it is mono-mixed from the courtroom recording system; a well-funded effort could diarize and release per-speaker tracks, but the underlying acoustic recording has only one track ([Oyez.org](https://www.oyez.org/)). The domain is not scarce; the configuration is.
**Medical is structurally forced into synthetic data.** The two named public medical conversation corpora, PriMock57 ([Korfiatis et al. 2022](https://arxiv.org/abs/2204.00333)) and ACI-Bench ([Yim et al. 2023](https://arxiv.org/abs/2306.02022)), are both explicitly mocked with patient actors. The authors are explicit about this being a HIPAA workaround. The Google Health medical conversation dataset of 14,000 hours ([Chiu et al. 2017](https://arxiv.org/abs/1711.07274)) is institutional and has never been released. **Medical full-duplex STS training is structurally forced into synthetic data by regulation, not by effort.**
**The only commercial full-duplex corpus over 10,000 hours is vendor-claimed.** Abaka AI's 20,000-hour bidirectional release (2026) is the only named public precedent that clears the commercial-license-at-scale bar, but it is vendor-claimed and has not been independently audited for overlap rate, consent documentation, or per-language hour breakdown. One data point is not a distribution.
*(Figure F5: central traffic-light matrix. Nine verticals on rows. Four columns for size, channel configuration, license, regulation, each colored green / yellow / red. This is the article's anchor visual.)*
## 6. Why vertical-first investment is premature now
The simple version of this article's thesis: **committing vertical-specific STS capital in 2026 is not wrong in direction, but wrong in timing.** The vertical needs a foundation to sit on. That foundation does not yet exist for native full-duplex.
Three market positions implicitly take the vertical-first bet.
**Decagon** raised a Series D at a $4.5 billion valuation in January 2026, deploying customer-service STS agents for enterprises ([press coverage](https://techcrunch.com/2026/01/28/decagon-series-d/)). **Deepgram** raised a Series C at $1.3 billion for enterprise voice AI in the same month ([press coverage](https://www.reuters.com/technology/deepgram-raises-series-c-2026-01-13/)). **Vapi** raised a $20M Series A in late 2024 and has reportedly crossed $130M in valuation since, building a developer voice platform ([press coverage](https://techcrunch.com/2024/12/12/vapi-series-a/)).
Each of these is a **pipeline STS stack**, meaning ASR plus LLM plus TTS, not native full-duplex. The product works today. It ships to customers today. The full-duplex quality gap (natural overlap, true interruption handling, backchannel nuance) is real but not yet a deal-breaker for the customer-service and developer-tool use cases these companies address.
**The risk is not the business model. The risk is timing.** If the native full-duplex foundation threshold is crossed in 2026-2028, pipeline-based verticals face a transition cost: either migrate to native full-duplex (expensive) or maintain the pipeline stack against competitors who build natively on the new foundation (compounding disadvantage).
One honest counter-nuance: **not every vertical needs the foundation to be ready.** Retell has reportedly reached $50M ARR on roughly $5M total funding ([company page](https://www.retellai.com/)), which suggests pipeline STS can compound without foundation-level investment for certain use cases where the full-duplex quality gap is not the binding constraint. The foundation-first pattern is strongest for verticals where natural conversation quality is the bottleneck: companionship, emotional support, interactive gaming, long-context conversational agents.
The post-foundation compression factor is worth keeping in view. Harvey went from ChatGPT launch (November 2022) to Series A (April 2023) in five months. Nuance went from founding (1992) to Microsoft acquisition at $19.7 billion in thirty years. **Post-foundation verticals compound roughly 30x faster because the foundation is a commodity input.** STS unicorns built on a native full-duplex foundation, when it exists, will compound on that ratio, not Nuance's.
*(Figure F6: investment timing chart. X-axis foundation moment in each domain. Y-axis months to first $1B vertical. Points plotted: Harvey (text legal, 13 months post-ChatGPT), Abridge (medical ASR, 24 months post-Whisper), full-duplex STS (projected 2027-2028, conditional on threshold crossing).)*
## 7. Where the data could plausibly come from: six routes
If the foundation threshold requires 100,000 to 500,000 hours of two-channel dyadic audio, where does that data come from? Six routes are plausible. Each has a precedent; none has cleared the full bar of commercial-license at scale for full-duplex specifically.
**Route 3, BPO and commercial vendors.** Strongest current precedent. Abaka AI's 20,000 hours bidirectional commercial release (2026) is the only named public precedent that clears commercial license at scale. Caveat: vendor-claimed, not independently audited. Nexdata's 15k-hour multilingual conversational corpus is mono 8 kHz and fails the channel bar. Appen and TELUS Digital are project-based managed collection, not standing corpora. **One working route with one working data point.**
**Route 5, government-sponsored (DARPA template).** Strongest historical precedent for native two-channel conversational. Switchboard (1990-91, DARPA + Texas Instruments, 260 hours) and Fisher (2003-04, DARPA EARS + LDC, 1,960 hours) are the two canonical releases of the modern era. **Neither alone clears 10,000 hours, and no public 2024-2026 program is known to target full-duplex at tens-of-thousands-of-hours scale.** The ceiling is political and logistical, not technical.
**Route 4, academic consortium (LDC model).** LDC has the longest operational track record for two-channel conversational licensing and a working commercial tier ($34-40k annual for-profit membership) that yields in-year commercial rights. But LDC has not produced a new two-channel conversational corpus above 2,000 hours since Fisher in 2004-2005. **The institutional scaffolding is intact; the origination funding has not been there for twenty-two years.**
**Route 6, platform-gated licensing.** Infrastructure is mature. Reddit-Google ($60M per year, February 2024) and Reddit-OpenAI ($70M per year) prove platforms can monetize UGC corpora to AI labs. YouTube's December 2024 opt-in creator control and the RSL protocol (launched September 2025, 1,500+ publishers by late 2025) provide the opt-in plumbing for web-scale audio. But **no audio-platform-wide bulk licensing deal to an AI lab is publicly disclosed for STS training as of April 2026.** Spotify's May 2025 Developer Policy explicitly prohibits Spotify-content training. **The pipes are built. The deals are not signed.** Watch Route 6 for a surprise inflection.
**Route 2, crowdsourced opt-in.** Ceiling of known evidence: Mozilla Common Voice at 31,841 hours across 286 languages, all CC0. But **all of it is single-speaker read or monologue**. No crowdsourced precedent for full-duplex conversational speech at any scale has been identified. The structural reason is that crowdsourcing assumes one-person-per-device recording, and full-duplex requires paired speakers on isolated channels. **The route with the single largest structural gap.** Whoever builds the pairing, channel-isolation, and consent infrastructure at scale could become the Mozilla Common Voice of STS.
**Route 1, consumer companion app opt-in.** Highest volume, lowest marketability. Replika (since 2017), Character.AI (since 2021), Inflection Pi (2022-24), and Sesame (beta 2025) accumulate in-app conversational data in volumes that almost certainly exceed 10,000 hours per company. But **none has ever released, licensed, or sold a conversational corpus to a third party.** The Italian DPA's February 2023 provisional ban and April 2025 €5M fine against Luka Inc. (Replika) demonstrate the GDPR ceiling: privacy policies that conflate chatbot interaction with model development fail the lawful-basis test, making commercial redistribution impossible ([Garante decision, April 2025](https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/10085565)). **Companion-app data is structurally trapped inside the companion app.**
*(Figure F7: six-route comparison card. Columns: route, canonical precedent, precedent hours, license type, full-duplex applicability, current ceiling. Color-coded on precedent strength, scalability, and commercial viability.)*
## 8. Sequencing: what buys what
The ASR arc is the clearest template. **Automatic speech recognition went from Switchboard (1991, 260 hours, roughly $2M DARPA budget, reported) to Whisper (2022, 680,000 hours, scraped from the web) in thirty-two years.** The per-hour cost of ASR training data collapsed from thousands of dollars to effectively zero-but-legally-contested. Each jump was enabled by a different collection route.
- **1990-91**: Switchboard, Route 5 (government).
- **2003-04**: Fisher, Route 5 continued.
- **2015**: Mozilla Common Voice launched, Route 2 (crowdsourced).
- **2020**: LibriLight, 60,000 hours public-domain audiobooks, Route 4-adjacent.
- **2022**: Whisper, 680,000 hours scraped, Route 6.
- **2024**: Moshi, 7M hours web speech pretraining, Route 6 continued.
**Full-duplex STS is roughly at the 1991 stage of the ASR arc.** Absolute hours of available training data (Abaka 20k + Fisher 2k + AMI / ICSI / CHiME-6 / CANDOR / DialogueSidon / InteractSpeech at roughly a thousand combined) approximate 1991 ASR in raw volume, and worse in relative terms because two-channel is a harder collection problem than mono.
**The likely sequence, treated as hypothesis rather than forecast.** Near term, 2026-2027: Route 3 scales as the proven route. One or two more commercial vendors cross 10,000 hours. Route 5 re-enters if a DARPA-equivalent or EU-equivalent program launches for full-duplex. Medium term, 2027-2029: Route 6 delivers a surprise inflection if an audio platform signs a bulk licensing deal, most likely a podcast distributor via RSL or a UGC platform with opt-in, not consumer companion apps. Longer term, 2028+: Route 2 becomes viable once someone builds the pairing, channel-isolation, and consent infrastructure for full-duplex crowdsourcing. No one has built it yet.
**The post-foundation compression factor should shorten this arc by roughly 3x relative to ASR's thirty-two years.** Infrastructure that did not exist in 1991 (cloud storage, consent UX primitives, opt-in protocols, commercial vendor markets for labeled data) exists now. **A 10-year arc from Switchboard-equivalent to Whisper-equivalent is plausible; a 30-year arc is not.**
The order of operations matters for investors. **Foundation data, then foundation model, then vertical product.** Reversing the order (betting on vertical before foundation) works for pipeline STS, but where the value ultimately concentrates at each layer is an open question that the LLM arc does not cleanly resolve for STS.
In text LLMs, the foundation model layer captured significant terminal value. STS is likely to play out differently in at least one structural way. Hyperscalers (Google with Gemini Live and OpenAI with GPT-4o, with Microsoft and Meta closer behind the frontline) already show a tendency to internalize foundation STS behind proprietary APIs. If that pattern holds, the independent-player opportunity shifts away from foundation model replication and toward four adjacent layers: **foundation data** (what oto is building), **evaluation and benchmarking** (the subject of Articles 07 and 08), **migration infrastructure** (pipeline-to-native adapters), and **vertical integration on top of closed foundations**. The conservative reading is that value will concentrate differently than in the LLM arc, with at least one of those four adjacent layers accruing a disproportionate share.
**The cleanest framing for investors is that foundation-first is the timing claim, not a value-capture claim.** What verticals wait for is that foundation readiness. What pure-play foundation model startups face is that the foundation may not be where the terminal value sits.
*(Figure F8: ASR arc as a timeline. 1991 Switchboard, 2003 Fisher, 2015 Common Voice, 2020 LibriLight, 2022 Whisper, 2024 Moshi. Full-duplex STS annotated with a "you are here" arrow near the 1991-equivalent position. Compression-factor note explains the expected 10-year arc rather than 32-year.)*
## 9. Forward pointers
For investors, **the investable proposition in full-duplex STS today is not vertical-first.** It is foundation-data collection, if you believe the threshold is crossable, or pipeline STS verticals with a planned migration path to native full-duplex, if you do not. Both are reasonable positions; neither is the SaaS-style vertical bet.
For engineers and researchers, **two-channel conversational data at 100,000 to 500,000 hours is the binding input.** The technical problem most worth solving is not a new architecture. It is the pairing, channel-isolation, and consent infrastructure for crowdsourced full-duplex.
For frontier labs, **vertical fine-tuning is premature.** The 2024-2026 window is the foundation window. Article 04 covered the data-supply side of the same coin; Article 07 will cover how we will know when the threshold has been crossed, via benchmarks. Article 10 will cover the legal ceilings on Routes 1, 2, and 6 in detail.
**Data for full-duplex STS is not a vertical problem yet. It is a foundation problem still.**
---
_Originally published at [https://fullduplex.ai/blog/foundation-before-vertical](https://fullduplex.ai/blog/foundation-before-vertical)._
_Part of **The STS Series** · 05 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 06 — Mapping the benchmark landscape
_Canonical: https://fullduplex.ai/blog/benchmark-landscape · Markdown: https://fullduplex.ai/blog/benchmark-landscape/md_
---
title: "Mapping the benchmark landscape"
description: "Too many speech-to-speech benchmarks, each covering a different slice. The map, as of April 2026 — arena versus fixed test set, four capability axes, a coverage heatmap, and a Japanese gap."
article_number: "06"
slug: benchmark-landscape
published_at: 2026-04-20
reading_minutes: 18
tags: ["benchmarks", "evaluation", "STS"]
canonical_url: https://fullduplex.ai/blog/benchmark-landscape
markdown_url: https://fullduplex.ai/blog/benchmark-landscape/md
series: "The STS Series"
series_position: 6
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Mapping the STS benchmark landscape
Imagine trying to get faster at running without ever looking at a stopwatch. You can feel whether a run was hard or easy. You can guess whether you are improving. What you cannot do is tell a coach exactly how much you improved this month, or show someone in another city that your method actually works. The stopwatch is what turns effort into measurable progress.
AI research has a version of the same problem, and in the early 2010s DeepMind ran into it directly. They wanted to build AI that learned to play games on its own. Before that era, the field already had AI that played chess, AI that played checkers, AI that played backgammon. But each system was built for a single game, with its own inputs and its own scoring. There was no way to ask whether "AI that plays chess" and "AI that plays backgammon" represented the same kind of progress.
DeepMind's move was to pick something boring on purpose: 49 old Atari arcade games. Same controller layout, same pixel-based screen input, same visible score in the corner. One AI played all 49. The score was the measurement, the controls were standard, and anyone could run the experiment and compare results. The Atari suite (and later the DeepMind Control Suite) became what researchers call a benchmark: a shared task, a shared scoring rule, and a shared format. It did not make AI smarter by itself. What it did was let the whole field tell, from week to week and lab to lab, whether a new method was actually better.
This is what benchmarks do, and why the presence or absence of one shapes a whole field. Without one, a team produces impressive demos and cannot tell whether the next version is better or just different. With one, a team can run the same test on this month's model and last month's model, see the number move, and show that number to a skeptic. A benchmark is what turns a sandbox into an improvement engine.
Speech-to-speech AI (the category of models that listen and reply in voice, like OpenAI's GPT-4o voice mode, Google's Gemini Live, or Kyutai's open-source Moshi) is at the stage where the demos are impressive but the measurement layer is still being assembled. [Article 05](/blog/foundation-before-vertical) argued that full-duplex STS sits roughly where automatic speech recognition sat in 1991, before TIMIT and Switchboard normalized that field's own measurements. This article is the measurement-infrastructure half of that argument, and [the next](/blog/why-new-benchmarks) is its prescriptive companion.
A buyer who wants to compare two STS models today has no single score to rely on. OpenAI cites Big Bench Audio in its GPT-Realtime launch. StepFun cites the same score on the Artificial Analysis leaderboard. A full-duplex benchmark paper reports four numbers that the commercial leaderboards do not even use. A Japanese product team finds no benchmark in its language at all. The gap is not that benchmarks are missing. The gap is that there are too many, each measuring a different slice, and the buyer has to reassemble the picture by hand.
This article is that map, as of April 2026. It is a map, not an argument. The argument comes in [the next dispatch](/blog/why-new-benchmarks). The field's own researchers disagree about where the gaps are, and a shared labeled diagram is the fastest way to make those disagreements visible.
## Two questions, no single answer
Two different people ask two different questions when they look at an STS benchmark. A buyer asks, "can this model handle my product?" A researcher asks, "which axis is my model weakest on?" Neither question has a one-number answer today, and the reasons are different.
The buyer's problem is that **no benchmark scores a complete production voice agent.** Conversation quality, reasoning quality, tool use, safety behavior, and language coverage each live on a separate benchmark. There is no equivalent of MMLU for text LLMs that a non-specialist can point at and say "higher is better across the board."
The researcher's problem is that **the field is fragmented across four capability axes** and four *versions* of the same full-duplex benchmark. The 2024 atomic note that maps this fragmentation already listed four benchmark families covering different slices (representation, instruction following, voice-agent task competence, interaction dynamics) with no unified yardstick, and the 2026 landscape has only gotten wider, not narrower.
So the rest of this article is structured as a map that serves both readers. The next section establishes the two evaluation *styles* the field has produced. After that, we define the four capability *axes* the field actually measures, zoom into the two most-cited anchors, and surface the coverage heatmap. The closing sections cover language coverage, citation patterns, and the thin axes.
## Two evaluation styles — arena versus fixed test set
The first structural split in the landscape is not technical. It is methodological.
**Fixed-test-set benchmarks** run pre-recorded audio through a model and compute deterministic metrics. SUPERB (2021) was the pre-LLM prototype of this style: frozen-encoder representation quality measured against fixed probing tasks. Almost every benchmark covered in this article descends from that lineage — a fixed suite of stimuli, a fixed scoring rule, reproducible results.
**Arena benchmarks** do the opposite. They put two models into live conversation with a human judge, collect preference votes, and rank models by an Elo-like aggregation. Scale AI launched [Voice Showdown](https://scale.com/leaderboard/voice-showdown) on 2026-03-20 as the first full-scale voice arena. It is the speech-side counterpart to LMSYS Chatbot Arena on text. The text-side split between MMLU (fixed) and Chatbot Arena (preference) has been live since 2023. Speech arrived later and less completely.
Why this split matters: the two styles answer different questions. Fixed test sets answer "does the model meet a specification." Arenas answer "do users prefer this model over that one." A model can win an arena with warm prosody and lose a fixed benchmark by getting the Formal Fallacies items wrong. Both are true. Neither is the whole picture.
The split has a second axis worth drawing explicitly: whether the benchmark targets **general-purpose** capability (reasoning, conversation, agent tasks) or **task-specific** capability (one narrow behavior like emergency interruption or code-switching). Crossing these two dichotomies produces four quadrants.
{{FIG:f1}}
Voice Showdown occupies the arena × general-purpose quadrant. Almost everything else in this article is fixed-test-set — split between general-purpose (VoiceBench, URO-Bench, VocalBench, Big Bench Audio) and task-specific (the Full-Duplex-Bench family, FLEXI, HumDial, SID-Bench, FD-Bench, MTR-DuplexBench). The arena × task-specific quadrant is empty as of April 2026. Scale announced a full-duplex mode for Voice Showdown at launch, but that mode was not yet live when this article was written.
That empty quadrant matters. It means nothing currently tells a buyer, "over 500 arena votes, model A holds a conversation better than model B." Commercial comparisons of full-duplex behavior fall back to fixed-test-set scores or internal demos.
## Four capability axes the field actually measures
Within fixed-test-set benchmarks, the coverage concentrates on four orthogonal capability axes.
**Axis A — Speech reasoning.** Can the model, given an audio question, apply logic, arithmetic, spatial reasoning, or multi-step inference to produce a correct answer? [Big Bench Audio](https://huggingface.co/blog/big-bench-audio-release) (HuggingFace, December 2024) is the anchor here. It covers 1,000 audio questions drawn from BIG-Bench text items, in four categories of 250 each: Formal Fallacies, Navigate, Object Counting, Web of Lies. Artificial Analysis implements it as the Speech Reasoning axis of its S2S leaderboard.
**Axis B — Conversational dynamics.** When does the model start talking? When does it stop? Does it yield to an interruption? Does it backchannel at the right moments? [Full-Duplex-Bench](https://arxiv.org/abs/2503.04721) v1 operationalized this as four automatable axes — pause handling, smooth turn-taking, backchanneling, user interruption — and became the field's reference point. The FDB family covered below is the deep spine of this axis.
**Axis C — Paralinguistic understanding and generation.** Does the model hear the user's emotion, and does its spoken reply match? [SD-Eval](https://arxiv.org/abs/2406.13340) scores whether a model uses paralinguistic input at all. [ProsAudit](https://arxiv.org/abs/2302.12057) scores whether the model can detect prosodic boundaries. [VocalBench](https://arxiv.org/abs/2505.15727) and [MTalk-Bench](https://arxiv.org/abs/2505.15524) push into the generation side, scoring whether the spoken reply carries the expected affect. This axis is the thinnest of the four in terms of joint input-output coverage.
**Axis D — Task competence.** Can the model book a flight by voice? Can it finish a support conversation? [VoiceBench](https://github.com/MatthewCYM/VoiceBench) (~6,783 instructions), [URO-Bench](https://arxiv.org/abs/2502.17810), [τ-Voice](https://sierra.ai/blog/tau-voice) (Sierra, 2025), and [AudioBench](https://arxiv.org/abs/2406.16020) anchor this axis. τ-Voice also introduced a direct voice-vs-text retention number — voice agents retain only 30-45% of the corresponding text agent's score on grounded tasks — which is one of the cleaner 2026 data points for "voice is hard at the task layer, not just the latency layer."
A fifth axis cross-cuts all four: language coverage. That is treated separately below because the pattern there is unusual.
## The Full-Duplex-Bench family as the conversational-dynamics backbone
Axis B deserves a dedicated section because the benchmark stack under it is deep, fast-moving, and often mis-cited.
The [Full-Duplex-Bench v1 paper](https://arxiv.org/abs/2503.04721) (March 2025) operationalized full-duplex behavior as four metrics computed over pre-recorded stimuli:
- **Pause handling.** Does the model stay quiet when the user pauses mid-thought? Scored by a Take-Over-Rate detector with a 1-second / 3-word threshold on the model's transcribed output. Lower is better.
- **Smooth turn-taking.** When the user finishes a turn, does the model start within a natural window? Same TOR detector, opposite polarity — higher is better.
- **Backchanneling.** Does the model say "mm-hm" at the right moments? Scored by Jensen-Shannon Divergence of the model's backchannel timing distribution against an ICC corpus ground truth.
- **User interruption.** When the user cuts in, does the model produce a relevant new response quickly? Scored by TOR, latency, and a GPT-4-turbo relevance rating.
Three of the four are fully automatic. Interruption is the only axis that calls a closed-source frontier judge, which raises cost and risks reproducibility drift across years.
The v1 paper explicitly framed these four axes as a *first step* rather than a complete theory. The field took that invitation literally, and v1 has since spawned three peer-reviewed successors plus several adjacent benchmarks:
- **[FDB v1.5](https://arxiv.org/abs/2507.23159)** (July 2025) adds overlap scenarios: user interruption, user backchannel, talking to others, background speech. v1.5 is the first FDB extension where the model is scored on what happens when two voices are heard simultaneously.
- **[FDB v2](https://arxiv.org/abs/2510.07838)** (October 2025) replaces pre-recorded stimuli with a live WebRTC-style examiner that runs multi-turn tasks under Fast and Slow pacing. It also replaces threshold metrics with an automated LLM examiner.
- **[FDB v3](https://arxiv.org/abs/2604.04847)** (April 2026) reframes the evaluation around three task-level dimensions — Tool-use Performance, Turn-Taking Dynamic, Latency Breakdown — over real human audio annotated for five disfluency categories (fillers, pauses, hesitations, false starts, self-corrections). GPT-Realtime scores under 59% on self-correction scenarios in v3.
{{FIG:f2}}
Adjacent benchmarks fill specific sub-axes that FDB does not cover. [FLEXI](https://arxiv.org/abs/2509.22243) adds a model-initiated emergency interrupt axis — the model must barge in on the user during a safety-critical scenario. [HumDial](https://sites.google.com/view/humdial-2026) pairs emotional intelligence with full-duplex turn-taking in a single ICASSP 2026 grand challenge with 6,356 interruption and 4,842 rejection utterances. [FD-Bench](https://arxiv.org/abs/2507.19040) uses LLM-driven stimulus generation rather than fixed test sets. [SID-Bench](https://arxiv.org/abs/2603.24144) introduces an APT (Accurate and Prompt Termination) metric that penalizes both false alarms and late responses. [MTR-DuplexBench](https://arxiv.org/abs/2511.10262) targets multi-round dialogues.
the single detail that matters
**Four distinct metrics now share the name "barge-in latency."** FDB v1 measures latency-to-next-response. SID-Bench's APT is a composite false-alarm-plus-late-response penalty. Chronological Thinking and SALM-Duplex measure the time from user interrupt start to agent stopping speech. SALM-Duplex also reports barge-in success rate as the percentage of cases where the agent stops within 1.5 seconds. A paper reporting "barge-in latency of 0.69s" could mean any of these four, and the numbers are not comparable.
For a buyer, the takeaway is: when a vendor says they post SOTA on full-duplex, ask *which version of Full-Duplex-Bench, which axis, which barge-in definition.*
## The reasoning anchor and the commercial bridge
Two benchmarks do most of the work in commercial STS launches: Big Bench Audio and the Artificial Analysis S2S leaderboard that implements it.
Big Bench Audio is straightforward. HuggingFace released it in December 2024 as a 1,000-item audio adaptation of existing BIG-Bench reasoning tasks. The judge is Claude 3.5 Sonnet (Oct '24), kept frozen so scores stay comparable.
The [Artificial Analysis S2S leaderboard](https://artificialanalysis.ai/speech-to-speech) is the bridge. Artificial Analysis is an independent analyst firm; its S2S product evaluates native audio models on two axes — Speech Reasoning (implementing Big Bench Audio) and Conversational Dynamics (a subset of FDB v1 + v1.5, run by Artificial Analysis rather than the FDB authors). It is the single most-cited commercial speech leaderboard as of April 2026.
The top of the Big Bench Audio ranking as of April 2026:
1. Step-Audio R1.1 (Realtime) — 97.0%
2. Gemini 3.1 Flash Live Preview (High) — 95.9%
3. Grok Voice Agent — 92.9%
4. Gemini 2.5 Flash Native Audio Dialog Thinking — 90.7%
5. Nova 2.0 Sonic (March 2026) — 88.1%
OpenAI posted its GPT-Realtime 83% number directly. Amazon posted Nova Sonic 87.1%. StepFun posted 97.0% for Step-Audio R1.1. Google posted 92% for Gemini 2.5 Native Audio Thinking. Each citation is a tweet from the Artificial Analysis account that the lab quoted. The pattern is unmistakable: commercial labs cite one number, Big Bench Audio via Artificial Analysis, when they launch a new STS model. That one number is doing a lot of work.
aside
Three caveats are worth naming. The leaderboard is not reproducible without access to the proprietary runner and prompt templating. The exact weighting across Conversational Dynamics sub-axes is opaque. And the fixed Claude 3.5 Sonnet judge means the scores drift the day that judge is retired. These are not design failures — they are structural properties of a privately-run evaluation that every model lab treats as a public number.
{{FIG:f3}}
## The coverage heatmap
The central visual of this article is a benchmark-by-capability-axis coverage heatmap. The rows are thirty speech-interaction benchmarks, grouped by which capability family they primarily serve. The columns are fifteen capability axes — the most fine-grained disaggregation of "what an STS model could be scored on" that the public literature currently supports.
The fifteen columns:
1. **Latency** (first-word time, end-to-end time)
2. **Turn-taking** (when to start)
3. **Backchannel** (short affirmative sounds at the right moment)
4. **Interruption handling** (yielding when cut in)
5. **Pause handling** (staying quiet during mid-thought pauses)
6. **Overlap** (simultaneous speech)
7. **Tool use** (chained API calls by voice)
8. **Multi-turn consistency** (entity tracking, correction, memory across turns)
9. **Instruction following** (doing what the user asked)
10. **Speech reasoning** (math, logic, structured reasoning)
11. **Paralinguistic input** (hearing emotion, intent, ambient sound)
12. **Paralinguistic output** (producing appropriate prosody / affect)
13. **Naturalness / MOS** (subjective listener quality)
14. **Safety / emergency** (model-initiated interrupt in safety-critical moments)
15. **Multilingual** (non-English coverage)
{{FIG:f4}}
The heatmap is readable in three passes.
**Pass one — rows.** Almost no row is fully green. Full-Duplex-Bench v2 lights up six columns (turn-taking, backchannel, interruption, pause, multi-turn, overlap) and stops. Big Bench Audio lights up one (speech reasoning). VoiceBench lights up three (task competence axes). Artificial Analysis S2S is the only row that spans both the reasoning column and the conversational-dynamics columns, which is why commercial labs cite it. Voice Showdown is unusual — it is scored subjectively, so every column it touches is yellow rather than green.
**Pass two — columns.** The thinnest columns are paralinguistic output, safety / emergency, and multilingual. Paralinguistic output is covered directly only by MOS-style scoring (VocalBench, MTalk-Bench, j-Moshi's subjective protocol); everything else is indirect. Safety / emergency is FLEXI alone; no other benchmark scores the behavior of a model that should barge in on a user for safety reasons. Multilingual is covered by VocalBench-zh and CS3-Bench for Mandarin, HumDial Track II for Chinese + English, and *nothing* for Japanese full-duplex — J-Moshi explicitly uses subjective MOS only.
**Pass three — diagonals.** If you draw a diagonal from FDB v1 (turn-taking, backchannel, interruption, pause) down through FDB v1.5 (adds overlap) to FDB v2 (adds multi-turn) to FDB v3 (adds tool use, latency breakdown), you can watch the full-duplex axis widen in real time across thirteen months of 2025-2026 publishing. The reasoning axis does not move in parallel. Big Bench Audio has no v2 or expansion on the public roadmap. That asymmetry — rapid widening on conversational dynamics, stasis on reasoning — is a structural feature of the field as of April 2026.
A buyer reading the heatmap can pick three or four benchmarks that jointly cover the capabilities their product needs, rather than hunting for one score that does all of it. A researcher reading the same heatmap can find a column with one green cell and treat that as a publishable gap. Both uses are legitimate. The map is designed to support both.
### Five axes nothing scores yet
The live version of this heatmap on the [/benchmarks page](/benchmarks) adds five extra columns past the fifteen above — rendered as striped, unexplored cells. They are axes that already exist as evaluation targets in the text-LLM or ASR/TTS literature, but that no public STS benchmark (cascade or full-duplex) scores today:
1. **Code-switch** — single-turn mixing of two languages (Hinglish, Spanglish, JP⇄EN). CS3-Bench touches Mandarin↔English only.
2. **Long-form memory** — entity and topic tracking across thirty-minute-plus conversations. Text-LLM harnesses like LongBench measure this on transcripts; no STS benchmark does it from audio.
3. **Emotion regulation** — the model's ability to *modulate* its own affect in response to the user's (e.g. de-escalate instead of matching anger). Paralinguistic output benchmarks score naturalness of the affect, not its appropriateness.
4. **On-device / edge** — latency, memory, and quality degradation when the model runs on CPU or mobile silicon. Relevant as Pocket-TTS-class models appear on-device; no shared held-out evaluation exists.
5. **Audio adversarial** — robustness under codec artefacts, room noise, and deliberate waveform attacks. ASR has NIST challenge precedent (e.g. CHiME); STS inherits none of it yet.
We flag these not as predictions but as the smallest set of axes a team publishing a new benchmark in 2026 could pick from and land something structurally new. [The next dispatch](/blog/why-new-benchmarks) argues for which two or three of these we think are actually buildable this year.
## Language coverage — English dominance and the Japanese gap
If the four capability axes are the x-dimension of the landscape, language is the y-dimension, and it is unusually skewed.
**English.** SUPERB, Dynamic-SUPERB, AudioBench, VoiceBench, URO-Bench, all four FDB versions, FLEXI, Big Bench Audio, Artificial Analysis, τ-Voice, SID-Bench, FD-Bench, MTR-DuplexBench. Effectively every benchmark in this article's heatmap has an English evaluation track, usually as the default or only track.
**Mandarin.** [VocalBench-zh](https://arxiv.org/abs/2511.08230) (November 2025, 10 subsets, 10K instances, 14 models evaluated) is the general-purpose Mandarin STS benchmark. [CS3-Bench](https://arxiv.org/abs/2510.07881) (October 2025) specifically measures Mandarin-English code-switching; its headline is that S2S models drop ~66% relative on code-switched inputs versus monolingual ones. [HumDial](https://sites.google.com/view/humdial-2026) Track II covers Chinese and English.
**Japanese.** No dedicated full-duplex benchmark exists as of April 2026. [J-Moshi](https://aclanthology.org/2024.emnlp-main.1234/), the Japanese open-weights full-duplex model, uses subjective MOS-based evaluation rather than a shared held-out test set. There is no equivalent of FDB v1 for Japanese audio. The closest substitute is to run the English FDB stimuli through a Japanese-capable model, which does not score Japanese-specific turn-taking conventions (heavier backchanneling, different pause semantics, different repair patterns).
the japanese gap
Every other major STS language has at least one public benchmark. Japanese has zero full-duplex benchmarks and a single MOS-based subjective protocol. A Japanese product team comparing two STS vendors today has no shared number to point at — not because the vendors are hiding, but because the measurement layer does not exist.
**Other languages.** Arabic, Hindi, Spanish, Portuguese, French, German, Russian, Korean — none have a dedicated full-duplex benchmark. Most have ASR benchmarks, some have TTS benchmarks, a few have speech-LLM evaluations, but the joint question "how well does a full-duplex STS model hold a conversation in this language" has no public answer.
{{FIG:f5}}
The multilingual gap is the single largest coverage hole in the map.
## Who cites what
The last structural pattern worth naming is which benchmarks flow into which release channels.
**Academic papers** cite FDB (v1, v1.5, v2, v3), VoiceBench, URO-Bench, SD-Eval, MTalk-Bench when they propose new methods. The citation list on an arXiv speech-LLM paper routinely runs to a dozen benchmarks. These are read by researchers.
**Commercial launches** cite a much narrower set. Over the ten most-public STS releases from Q4 2024 through Q1 2026 — OpenAI GPT-4o Realtime, Google Gemini 2.5 Native Audio, Google Gemini 3.1 Flash Live, Amazon Nova Sonic, Amazon Nova 2.0 Sonic, StepFun Step-Audio R1, StepFun Step-Audio R1.1, xAI Grok Voice, Mistral Voxtral, Microsoft MAI-Voice-1 — the benchmark citations cluster into three buckets: Big Bench Audio (via Artificial Analysis), the OpenAI Voice Agent Benchmark, and Artificial Analysis' Conversational Dynamics composite.
{{FIG:f6}}
A few launches cite *no* benchmark. Sesame's [CSM launch](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice) cited transcripts and demo audio rather than scores. Moshi's [original launch](https://arxiv.org/abs/2410.00037) cited FDB v1 because the same team authored both. PersonaPlex's release cited internal voice-agent evals that are not reproducible.
The academic-commercial gap here is real, but it is not zero. Artificial Analysis is the bridge. Big Bench Audio flows into commercial launches *through* Artificial Analysis. FDB v1 flows into commercial launches *through* Artificial Analysis' Conversational Dynamics composite. The bridge is proprietary, which means the field has roughly one gateway between its academic evaluation engine and its commercial information diet. If that gateway changes methodology or weighting, the commercial scoreboard moves with it.
For a buyer, the takeaway is to read *both* sides. The commercial benchmarks are the ones vendors will cite. The academic benchmarks are the ones that actually test behavior a vendor might have hand-tuned for. Neither alone is enough.
## Where this lands
Four summary claims follow from the map.
First, **no existing benchmark scores a complete production voice agent.** A buyer has to compose coverage from four or five benchmarks across axes A through D.
Second, **the commercial information diet is narrower than the benchmark landscape itself.** Roughly three citations do most of the work in STS launch posts. Artificial Analysis is the single gateway.
Third, **full-duplex behavior has deepened into a four-version family plus six adjacent benchmarks**, each measuring a distinct sub-axis, with at least four different things sharing the name "barge-in latency."
Fourth, **the thinnest axes are paralinguistic output, safety-assertive behavior, and multilingual coverage** — and multilingual is a global gap, not a Japanese-only one. Japanese full-duplex has no dedicated benchmark at all.
[Article 04](/blog/data-ceiling) covered the data-supply side of the evaluation gap. [Article 05](/blog/foundation-before-vertical) covered the timing argument. [Article 07](/blog/why-new-benchmarks) picks up where this map ends: given the coverage holes named above, what would a next-generation STS benchmark need to measure, and who is positioned to build it? [Article 08](/blog/sts-model-landscape) then covers which models score where.
---
**Fullduplex is open to benchmark collaboration on the thin axes.** Multilingual full-duplex and paralinguistic output are the two we think are buildable in 2026 given the data supply we are operating with. If your lab is working on either, [get in touch](mailto:hello@fullduplex.ai).
---
_Originally published at [https://fullduplex.ai/blog/benchmark-landscape](https://fullduplex.ai/blog/benchmark-landscape)._
_Part of **The STS Series** · 06 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 07 — Why STS needs new benchmarks
_Canonical: https://fullduplex.ai/blog/why-new-benchmarks · Markdown: https://fullduplex.ai/blog/why-new-benchmarks/md_
---
title: "Why STS needs new benchmarks"
description: "The STS field inherited evaluation machinery from ASR, TTS, and text-LLM paradigms. None of them measured a live, two-channel, socially-timed conversation. The argument for a rebuild, plus a concrete picture of who could run it."
article_number: "07"
slug: why-new-benchmarks
published_at: 2026-04-20
reading_minutes: 17
tags: ["benchmarks", "evaluation", "full-duplex"]
canonical_url: https://fullduplex.ai/blog/why-new-benchmarks
markdown_url: https://fullduplex.ai/blog/why-new-benchmarks/md
series: "The STS Series"
series_position: 7
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Why speech-to-speech AI needs new benchmarks
[The previous dispatch](/blog/benchmark-landscape) mapped twenty-four speech-to-speech benchmarks onto fifteen capability axes. Half the cells are empty. Four different metrics share the name "barge-in latency." The commercial information diet is gated by one proprietary runner. A Japanese product team has zero dedicated benchmarks. Reading that map as "we need more benchmarks to fill the gaps" is the wrong conclusion.
The map is telling us something harder. The field imported its evaluation machinery from three prior paradigms — ASR, TTS, and text-LLM — and none of those paradigms measured the thing that makes STS hard: a live, two-channel, bidirectional, socially-timed conversation. Patching the gaps with more benchmarks of the same shape gets us a taller stack of measurements that still miss. The next generation of STS benchmarks has to be designed from the conversation outward, not from the transcript inward. This article is that argument, plus a concrete picture of what the rebuild would look like and who could run it.
## What the map is telling us
Three findings carry over from [the benchmark map](/blog/benchmark-landscape).
First, the benchmarks are **fragmented**. Two dozen public benchmarks each cover a different slice of a single production voice agent. No row on the heatmap lights up across the whole grid.
Second, the commercial information diet is **funneled through one proprietary bridge**. The [Artificial Analysis S2S leaderboard](https://artificialanalysis.ai/speech-to-speech) implements Big Bench Audio and a subset of Full-Duplex-Bench, and almost every commercial STS launch since late 2024 cites it. That bridge is not reproducible without access to AA's internal runner and prompt templating.
Third, the **multilingual coverage is a global gap, not a Japanese-only one.** Mandarin has three dedicated benchmarks. Japanese has zero dedicated full-duplex benchmarks. Arabic, Hindi, Spanish, Portuguese, French, German, Russian, Korean — none have a dedicated full-duplex benchmark either.
The obvious read is "the field needs to build more benchmarks." That read is wrong, or at least incomplete. The underlying problem is that the existing benchmarks measure what it is easy to measure with ASR, TTS, and text-LLM infrastructure, and *not* what full-duplex STS actually needs scored. The rest of this article argues that claim in three moves: where the inherited paradigms came from, which mismatches they produced, and what a next-generation benchmark would need to measure instead. Then we name who could build it.
## Three inherited paradigms, three blind spots
STS evaluation did not start from scratch. It reused machinery from three earlier speech and language paradigms. Each inheritance imported a useful metric and a specific blind spot.
**ASR paradigm → Word Error Rate.** The automatic speech recognition field spent thirty years refining WER, the ratio of transcription errors to total words spoken. When large speech models arrived, WER was the ready-to-hand metric that researchers knew how to compute. But WER measures *transcription*, not *interaction*. A model can score WER 5% on a held-out test set and still interrupt the user constantly, freeze when interrupted itself, or backchannel at wrong moments. [The Full-Duplex-Bench v1 paper](https://arxiv.org/abs/2503.04721) made this argument explicit in early 2025: transcription accuracy measures the wrong thing for conversational models. Interaction is orthogonal to transcription, and if you score only the latter, you reward the former by accident.
**TTS paradigm → Mean Opinion Score and listening tests.** The text-to-speech field's standard measurement is MOS: human raters scoring audio quality on a 1-5 scale. MOS captures *naturalness* — does this voice sound like a person? — but not *appropriateness*. A model can have a pleasant voice and still fail to match the user's emotional register, over-affect neutral content, or sound warm during moments that call for clinical restraint. [J-Moshi](https://aclanthology.org/2024.emnlp-main.1234/) explicitly uses subjective MOS-based evaluation with no shared held-out test set, which is the TTS inheritance visible. The Mandarin generation-side benchmark [VocalBench](https://arxiv.org/abs/2505.15727) extends MOS to voice-agent scenarios but stays in the naturalness frame.
**Text-LLM paradigm → fixed-test-set reasoning scores.** When GPT-3 and GPT-4 arrived, the evaluation community built fixed-test-set benchmarks — MMLU, HellaSwag, HumanEval, GPQA. These work because text reasoning is a symbol-manipulation task that a static benchmark can capture faithfully. When audio reasoning benchmarks appeared, they adopted the same shape: [Big Bench Audio](https://huggingface.co/blog/big-bench-audio-release) is a 1,000-item audio adaptation of BIG-Bench text questions. Nothing wrong with that as a reasoning probe, but Big Bench Audio is functionally a text reasoning benchmark with audio stimuli. It does not score anything that could not have been scored from the transcript, and it runs one-turn closed-ended questions rather than dialogue.
three paradigms, three blind spots
**WER** is transcription without interaction. **MOS** is naturalness without appropriateness. **Audio reasoning** is text reasoning with sound attached. The benchmarks we have are good at what their parent paradigms were good at — and blind to what they were never designed to see.
{{FIG:f1}}
## Four measurement mismatches
The inherited paradigms produce four specific measurement mismatches when applied to full-duplex STS. Each one is a concrete failure mode, not an abstract critique.
**Mismatch 1. Fixed test sets cannot score live dynamics.** FDB v1 (March 2025), FDB v1.5 (July 2025), SID-Bench, FD-Bench, MTR-DuplexBench all use pre-recorded stimuli. A model is fed an audio file, its output is recorded, and scores are computed post-hoc. Streaming STS does not behave this way in production. Packet jitter, network variability, and real-time pressure produce behaviors that do not appear in offline evaluation. [FDB v2](https://arxiv.org/abs/2510.07838) (October 2025) is the first benchmark to acknowledge this and move to a live WebRTC-style examiner. It is also the first to find that model rankings are not invariant across offline and live protocols. Same model, two scoring paradigms, different ranking. That is evidence the inherited fixed-test-set paradigm was systematically missing something.
**Mismatch 2. Transcript-only judges cannot score paralinguistic output.** FDB v1's user-interruption axis scores relevance via GPT-4-turbo reading a transcript. If the model produces a response that is textually relevant but delivered in a flat, irritable, or emotionally wrong register, the transcript judge rates it correct. No field-level benchmark currently penalizes paralinguistic output failures at scale. [VocalBench](https://arxiv.org/abs/2505.15727) and [MTalk-Bench](https://arxiv.org/abs/2505.15524) point toward the generation-side scoring that would be needed, but neither is adopted by the major full-duplex benchmarks. Paralinguistic output is the largest unmeasured axis in production STS. Users will say "the model sounds wrong" and the benchmark will say "the model scores correctly."
**Mismatch 3. Single-language benchmarks cannot score cross-cultural turn-taking.** Japanese conversational turn-taking includes short backchannels ("hai", "un", "sou desu") at roughly one-to-two-second intervals, substantially more frequent than in English. Run FDB v1's pause-handling test — which uses a take-over-rate detector tuned to English norms — on a Japanese-capable model, and the model's correct Japanese behavior fires as false positives. There is no way to score Japanese turn-taking on an English-designed benchmark, and no Japanese equivalent exists. The [J-Moshi](https://aclanthology.org/2024.emnlp-main.1234/) authors bypassed this by using MOS rather than a shared held-out test set. Every other non-English-dominant market faces the same problem. Arabic conversational overlap is higher than English. Hindi code-switching is dense. Mandarin gets some coverage via [VocalBench-zh](https://arxiv.org/abs/2511.08230) and [CS3-Bench](https://arxiv.org/abs/2510.07881), but the principle is the same: language-specific turn-taking norms cannot be evaluated by benchmarks that assume English norms.
**Mismatch 4. Proprietary runners cannot serve reproducibility.** Artificial Analysis is structural infrastructure for the field. Every commercial STS launch since GPT-Realtime has cited an AA number. But every published score depends on a closed runner. When AA's judge model updates, every score moves. When AA changes weighting across Conversational Dynamics sub-axes, the composite changes silently. This is not a design flaw specific to AA. It is the consequence of closing the loop between commercial marketing and public comparison through a single proprietary intermediary.
aside
The field ended up with one gateway, and the gateway is not inspectable. When that single pipe re-weights a composite, every public STS scoreboard moves in lockstep without any published changelog. That is not a neutral intermediary — it is load-bearing infrastructure without public accountability.
{{FIG:f2}}
These four mismatches together explain why the coverage map has so many empty cells. The cells are not empty because no one has gotten around to running the experiments. The cells are empty because the experiments do not fit the inherited measurement paradigms. Paralinguistic output is empty because the parent paradigms scored *either* transcript text (ASR lineage) *or* naturalness of audio (TTS lineage), not the joint question of whether the generated audio's affect matches the requested affect. Safety / emergency barge-in is empty because the parent paradigms never had a notion of "model should interrupt the user." Multilingual full-duplex is empty because every inherited benchmark was designed in English first and translated later.
## What a next-generation STS benchmark would need to measure
Pivot from criticism to construction. Five requirements follow directly from the mismatches above, each derivable from a specific failure mode.
**Requirement 1 — live examiner as default.** A model's full-duplex behavior exists only in live time. Pre-recorded stimuli can be a supplement, but the primary measurement has to happen in a streaming environment that introduces the packet-level and time-pressure effects real users experience. FDB v2 is the proof of concept. A next-generation benchmark makes the live examiner the default protocol, and the offline protocol the fallback for infrastructure-limited environments.
**Requirement 2 — joint audio-and-transcript scoring.** Any conversational-dynamics axis that involves how a model *says* something, not just what it says, needs a judge that hears the audio. The transcript is a projection of the signal that drops half the information. Practical implementation is an LLM examiner with audio input — already technically available from frontier vendors — wrapped in a scoring rubric that explicitly weights paralinguistic output.
**Requirement 3 — multilingual from day one.** A next-generation benchmark designs its protocol so that language-specific turn-taking norms can be encoded in the scoring rule, not hard-coded to English. Japanese backchannel frequency, Mandarin tonal cues in emotional expression, Arabic conversational overlap norms, Hindi-English code-switching rates — these are research-grade linguistic-typology questions, not engineering corner cases, and they need to be in the benchmark's design document, not patched later. [HumDial](https://sites.google.com/view/humdial-2026) at ICASSP 2026 is the first community-scale attempt to include a multilingual track from the start (Chinese + English across 6,356 interruption and 4,842 rejection utterances). That is the shape. It needs four more language tracks.
**Requirement 4 — open methodology including judge selection.** Reproducibility requires four things: the stimuli, the runner code, the prompts, and the judge. Today's benchmarks open varying subsets. FDB v1 is open on stimuli and metrics but uses GPT-4-turbo as an opaque judge. Artificial Analysis is closed on runner, prompts, and weighting. A next-generation benchmark has to publish all four, including the judge model's version and the prompt template. Proprietary-score leaderboards can still exist, but they cannot be the field's reference.
**Requirement 5 — composite scores with transparent weighting.** Any aggregation into a single number publishes its weights and allows users to re-weight based on their product's priorities. If conversational-dynamics composite scoring weights "smooth turn-taking" at 30% and a product team cares 3× more about "interruption handling," the benchmark should expose the weights and support re-aggregation. Today's composites — including Artificial Analysis' Conversational Dynamics composite — do not expose weights.
{{FIG:f3}}
### The dataset side of the same problem
Benchmarks without reference data are impossible. Every benchmark above sits on top of a dataset: FDB v1 on the ICC corpus, Big Bench Audio on a custom audio recording of BIG-Bench text items, VocalBench on its own Mandarin recordings. A next-generation STS benchmark needs reference data it can hold out: two-channel conversations in the target language, with annotations for turn-taking events, overlap, and disfluency. A single-channel mono dataset cannot score full-duplex, because the ground truth for full-duplex behavior is encoded in the separation of the two channels. That is the same shortage in a different domain as [the data ceiling](/blog/data-ceiling) and [the foundation-threshold argument](/blog/foundation-before-vertical) cover: the dataset gap and the benchmark gap rhyme because both sit on the same two-channel supply problem.
## Who could build this
Four plausible builder types, each with a path and a specific weakness.
**Academic consortium.** HumDial at ICASSP 2026 is the proof that this model works. A grand-challenge-style benchmark with multiple co-authoring institutions, released open with training data and a held-out test set. Weakness: the funding and publication cycle doesn't match STS iteration speed. By the time a v2 consortium benchmark ships, the model landscape has moved. HumDial is a single-shot event; FDB v1 has already shipped three successors (v1.5, v2, v3) across thirteen months, which is closer to the iteration speed the field actually operates at.
**Open-source community via Hugging Face.** Big Bench Audio shipped through Hugging Face's blog and dataset hub. This works for lightweight, fixed-test-set benchmarks. It struggles for live examiner paradigms because Hugging Face Spaces does not currently provide the streaming infrastructure — WebRTC, low-latency media pipelines — that a live examiner needs. That could change. If it does, HF becomes a plausible default.
**Independent commercial analyst firm with open methodology.** Artificial Analysis is the current version of this role, closed. If AA open-sources its runner, prompts, and weighting — or if a competitor launches with open methodology — the field gets a commercial bridge that is also reproducible. Weakness: business-model incentives push toward closed. AA's differentiation is its prompt templating and judge selection. Open-sourcing those removes a defensible moat. A pre-commitment to transparency from day one is a plausible strategy; retrofitting transparency onto an established closed leaderboard is harder.
**Dataset-first company.** If the organization that assembled the reference data also defines the scoring standard, the data and the benchmark co-evolve. This is an emerging pattern. [τ-Voice](https://sierra.ai/blog/tau-voice) (Sierra, 2025) is a benchmark published by the company that deploys the underlying agents. [VocalBench](https://arxiv.org/abs/2505.15727) is Mandarin-native and comes from teams building Mandarin STS. Fullduplex is another candidate in this category. Weakness: commercial positioning creates obvious conflicts of interest unless the scoring is published and reproducible. The dataset-first path only produces a credible benchmark if the builder pre-commits to open methodology and external validation.
{{FIG:f4}}
No single builder type solves the whole problem. The honest forecast is that the next few years will see a mix: an ICASSP-class academic consortium for a multilingual full-duplex benchmark (annual cadence, open data), an open-source Hugging Face replacement for Big Bench Audio that includes paralinguistic stimuli (community cadence, modest scope), and at least one commercial leaderboard that competes with Artificial Analysis on open-methodology positioning. A dataset-first company with an open benchmark is the fourth piece, and the most interesting commercially because it aligns evaluation with training data assembly.
the target zone
The next-generation STS benchmark has to sit in the same quadrant as the text-LLM leaderboards that reshaped that field: **fast iteration** (weekly-to-monthly, not annual) crossed with **reproducible methodology** (open runner, open judge, open weights). Everything else — slow consortia, closed arenas, dataset-first labs without transparency — falls short on one axis or the other.
## What this means for different readers
Three summaries, one per reader priority.
**For researchers:** the open opportunity is multilingual live examiner benchmarks. Japanese specifically (no FDB-equivalent exists), but Korean, Arabic, Hindi, and Spanish are all publishable gaps. Paralinguistic output is a second opportunity; the methodology is not solved but the audio-input LLM judges needed to solve it are now available from frontier vendors.
**For VCs:** evaluation infrastructure is a real layer of the stack, not a cost center. The question "who is positioned to build the reproducible version of Artificial Analysis" has candidate answers — an open-methodology commercial leaderboard, an academic consortium with commercial partners, a dataset-first company with open scoring — and the winner gets durable commercial positioning because the field needs a reference bridge that is not proprietary. This is adjacent to the model layer rather than competitive with it.
**For product engineers and buyers:** compose coverage from multiple benchmarks until a unified one exists. When a vendor cites "SOTA on full-duplex," ask *which version of Full-Duplex-Bench, which axis, which barge-in definition.* When a vendor cites a single composite score, ask for the weighting. If the weighting is not published, treat the number as advertising rather than measurement. For Japanese, Korean, and other non-English deployments, no benchmark currently answers your question. Budget for internal evaluation accordingly.
## Where this lands
[The benchmark map](/blog/benchmark-landscape) described the benchmarks as they are. This article argued what they would need to become. Together they define the evaluation side of the STS field as of April 2026.
Two claims summarize the argument. First, **the existing benchmarks are not incomplete, they are misaligned.** They inherited their shape from ASR, TTS, and text-LLM paradigms that did not measure bidirectional live conversation. Filling empty cells on the current map with more benchmarks of the same shape produces a taller stack of the same mismeasurement. Second, **the rebuild is buildable**, not speculative. FDB v2's live examiner, HumDial's multilingual track, VocalBench's paralinguistic scoring, and the explicit acknowledgement that Artificial Analysis is a proprietary bridge — these are public work from 2025 and 2026. A next-generation benchmark assembles these five requirements and publishes them openly. The question is who runs it.
[Article 08](/blog/sts-model-landscape) covers which models score where on the benchmarks that exist today. [Article 09](/blog/consent-licensing-opt-in) covers the consent and licensing constraints on the reference data that any next-generation benchmark will need.
---
Fullduplex is working on benchmarks meant to advance the kind of measurement infrastructure this article maps. If your lab or team is working in this area, [get in touch](mailto:hello@fullduplex.ai).
---
_Originally published at [https://fullduplex.ai/blog/why-new-benchmarks](https://fullduplex.ai/blog/why-new-benchmarks)._
_Part of **The STS Series** · 07 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 08 — The STS model landscape
_Canonical: https://fullduplex.ai/blog/sts-model-landscape · Markdown: https://fullduplex.ai/blog/sts-model-landscape/md_
---
title: "The STS model landscape"
description: "Thirty-plus speech-to-speech models, four architectural families, and a licensing pattern that is starting to split inside each lab. A field guide to the April 2026 map, legible enough to place newly announced models in one or two paragraphs."
article_number: "08"
slug: sts-model-landscape
published_at: 2026-04-20
reading_minutes: 20
tags: ["models", "architecture", "licensing"]
canonical_url: https://fullduplex.ai/blog/sts-model-landscape
markdown_url: https://fullduplex.ai/blog/sts-model-landscape/md
series: "The STS Series"
series_position: 8
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# The STS model landscape — who is building what
Eighteen months ago, writing about the speech-to-speech (STS) landscape meant writing about Moshi and adding "…and some academic papers in China." That framing is out of date. As of April 2026 there are at least thirty publicly documented open-weights or paper-released STS models, at least four architecturally distinct families, and a separate closed commercial frontier layer from every major lab. The landscape is legible enough that buyers, researchers, and investors can start asking useful questions instead of betting on whichever demo was viral last week.
This article maps the field. [Article 03](/blog/pipeline-to-integrated) introduced the four-family taxonomy (dual-stream plus codec, interleaved-flatten, cascade plus predictor, codec-free). This article populates each family with names, licenses, and short architectural notes, then surfaces the closed commercial layer and three sub-categories that are starting to peel off inside the families. The organizing goal is that someone new to STS can leave this page able to place a newly announced model into the landscape in one or two paragraphs.
{{FIG:f1}}
Three things to keep in mind while reading. First, "full-duplex" is not a single spec. A Family 1 dual-stream model and a Family 3 cascade can both claim full-duplex and mean different operational things. Second, "open-weights" does not imply "commercially usable." Six distinct non-closed license regimes are in active use, and several block commercial paths. Third, this map is April 2026. New releases are arriving at a cadence of roughly one per month across the four families, which is itself a diagnostic: a field with monthly releases across multiple labs is at a different stage than a field with one lab and a handful of followers.
the three things the landscape asks you
Reading a new STS release in 2026 means answering three questions in order: **which family** (dual-stream, interleaved, cascade-plus-predictor, codec-free), **what license posture** (permissive, non-commercial, closed-commercial, or split-with-a-sibling), and **which sub-category** (reasoning-realtime, translation-duplex, or voice-cloning-inside-STS). Every model on the map lands somewhere in that 4 × 4 × 3 grid.
## Why this map matters now
Three-quarters of 2026 Q1 investor conversations about voice AI still open with Moshi as the implicit reference model. The implicit mental model goes: "Moshi shipped the first open full-duplex STS in 2024, a few labs fine-tuned it, a few labs tried other approaches, and the rest is closed commercial work at big labs." That model was roughly correct a year ago. It is not correct now.
The quick tally: Kyutai has now shipped three distinct public models (Moshi, Hibiki, Hibiki-Zero) and one open modular alternative (Kyutai Unmute). NVIDIA ADLR shipped PersonaPlex as a Moshi fine-tune. Sesame released CSM-1B open-weights and keeps its 8B variant closed. Alibaba produced an open OmniFlatten paper, then a productised Qwen2.5-Omni under Apache 2.0, then Qwen3-Omni (Apache 2.0, 30B MoE) in September 2025, and then pivoted to closed for Qwen3.5-Omni in March 2026. Tencent shipped Freeze-Omni (cascade family), then in March 2026 released Covo-Audio and Covo-Audio-Chat-FD under CC BY 4.0 (interleaved family). StepFun has an unbroken open-weights cadence through Step-Audio-R1.1. OpenBMB shipped MiniCPM-o 4.5 as an on-device cascade-plus-predictor. ByteDance has two distinct branches: an academic branch (SALMONN-omni, codec-free) and a production branch (Doubao and Seeduplex, closed at hundreds of millions of users). That is a lot of labs, and it is a lot of divergent design choices.
For investors the implication is that the defensibility question is no longer only "who shipped first." It is license posture (CC BY 4.0 and Apache 2.0 clear commercial paths, FAIR-NC and NVIDIA OneWay Noncommercial block them), data moat (how many hours of what kind of training data), and family choice (which of the four architectural branches is being bet on). The Q1-2026 funding wave, detailed further down, concentrated in companies that are building on top of this landscape rather than inside the foundation layer.
{{FIG:f2}}
## What counts as STS in this article
Some definitional discipline, because the field uses overlapping words. This article uses "STS" to mean a model that takes speech in and emits speech out, with the LLM reasoning on speech (or jointly on speech and text) rather than only routing transcripts. The inclusion bar is full-duplex capable (the model can listen while speaking) or integrated speech-language-modeling (the audio and text are being modeled together), even if the duplex behavior is bolted on via a predictor head. This excludes pure TTS, pure ASR, and pure cascaded voice agents that wrap a text LLM without any joint audio modeling.
Some systems sit on the boundary. Kyutai Unmute wraps a text LLM with Kyutai's own streaming STT and TTS; it is fast and fully open, but the LLM itself operates on text. Meta's Spirit-LM is a single-stream expressive LM gated under FAIR-NC. NVIDIA Audio Flamingo 3 has streaming TTS output under NVIDIA OneWay Noncommercial. These are "near-STS" systems; they show up in the commercial-frontier section rather than in the family sections.
## Family 1: dual-stream plus neural codec
Family 1 models treat user audio and model audio as independent token streams, decoded jointly against an inner-monologue text stream. Kyutai's Moshi is the origin: two parallel transformer streams, a 12.5 Hz neural codec (Mimi), and a theoretical latency of 160 ms with ~200 ms measured in practice. Moshi's weights are CC-BY 4.0, code is MIT, and the paper is [arXiv:2410.00037](https://arxiv.org/abs/2410.00037).
Four branches descend from Moshi's root. First, translation-duplex: Kyutai Hibiki is a speech-to-speech translation derivative, and Hibiki-Zero (February 2026, 3B, open-weights) extends it with GRPO reinforcement learning that does not require word-level aligned data, adding Spanish, Portuguese, and German as input languages. Hibiki-Zero is not conversational in the companion sense; it is translation-shaped duplex. Second, specialized fine-tunes: NVIDIA PersonaPlex (January 2026) is a Moshi fine-tune for persona-grounded dialogue, trained on 1,217 hours of Fisher plus 2,250 hours of synthetic, released under the NVIDIA Open Model License with MIT code. Third, codec siblings: Sesame CSM-1B (Apache 2.0) reuses Mimi as its codec, while CSM-Medium 8B remains closed. Fourth, production-scale closed deployment: ByteDance Seeduplex, shipping inside the Doubao product, has a dual-stream architecture but is API-only; its April 2026 release is the first hundreds-of-millions-of-users full-duplex consumer deployment.
{{FIG:f3}}
Family 1 is the most structurally mature of the four. The codec is reusable (Mimi is now in Moshi, Hibiki, PersonaPlex, Sesame CSM, and derivative experiments), the inner-monologue pattern is portable, and the latency ceiling sits near the limits of what human listeners can perceive as "natural" turn-taking. The data question remains the binding constraint: each of these models needs two-channel dyadic audio to learn the full-duplex behaviour, and the public supply of that data is orders of magnitude short of what text LLMs have had for scaling. That supply constraint is the subject of [Article 04](/blog/data-ceiling).
PersonaPlex is worth a paragraph on its own because it is the first open-weights Moshi-family checkpoint that treats persona as a *first-class input* rather than a post-training style tag. The hybrid conditioning path — a voice prompt capturing timbre / style plus a text prompt pinning role, facts, and scenario — means the same 7B checkpoint can be a customer-service agent at inference time N and a medical intake scribe at inference time N+1 without any weight change. NVIDIA reports ~170 ms average first-response latency on a FullDuplexBench-style trace, which sits at the better end of the Family 1 distribution. Weights ship on Hugging Face as [`nvidia/personaplex-7b-v1`](https://huggingface.co/nvidia/personaplex-7b-v1) under the NVIDIA Open Model License, with code MIT on [GitHub](https://github.com/NVIDIA/PersonaPlex). Conceptually, the important move is that persona conditioning stops being the responsibility of the wrapper (system prompt + voice selection) and starts being a property the speech-LM itself exposes — a pattern we expect other Family 1 labs to copy in the next year.
## Family 2: interleaved / flatten single-stream
Family 2 packs speech tokens and text tokens into a single repeating-block sequence. Full-duplex behaviour emerges from the blocking cadence (for example, in OmniFlatten's final stage, a repeating pattern of 2 text tokens and 10 speech tokens) rather than from parallel streams. This family has the most entrants as of April 2026.
The Alibaba stack dominates the count. OmniFlatten (October 2024, paper-only, Qwen2-0.5B base, 100% synthesised training data) was the first public system in this family. Qwen2.5-Omni (March 2025, Apache 2.0) is its productised descendant. Qwen3-Omni (September 2025, Apache 2.0, 30B MoE with 3B active parameters) became the public flagship. In March 2026 Alibaba shipped Qwen3.5-Omni as a closed preview with a native audio-understanding encoder and native turn-taking intent recognition. The Alibaba pattern (open base, closed flagship) is worth flagging separately below.
StepFun has a sustained open cadence: Step-Audio, Step-Audio 2, and Step-Audio-R1.1 (January 2026, reasoning-tuned realtime variant) plus Step-Audio-EditX (January 2026, paralinguistic-edit). Zhipu AI shipped GLM-4-Voice. CAS / ICT-CAS released LLaMA-Omni and LLaMA-Omni 2, built on a Meta Llama base. Moonshot AI released Kimi-Audio (April 2025, MIT, 13M+ hours of training data). Tencent added Covo-Audio and Covo-Audio-Chat-FD (March 2026, CC BY 4.0, tri-modal interleaving) as a Family 2 entrant distinct from Freeze-Omni's Family 3 line. Shanghai Jiao Tong released SLAM-Omni.
The newest Family 2 entrant is FlashLabs Chroma (January 2026, 4B, open-weights, interleaved-flatten with 1:2 text-audio ratio, RTF 0.43, sub-second latency). Chroma is notable because it is the first open-source integrated STS to ship with built-in personalized voice cloning. That has consent implications picked up in [Article 09](/blog/consent-licensing-opt-in).
{{FIG:f4}}
The operational reason Family 2 has the most entrants is that interleaved single-stream is the most natural shape when you start from a text LLM and add audio. Families 1, 3, and 4 each require deeper architectural surgery. Family 2 is a packing-and-sequencing problem, which is tractable for any lab with a strong text LLM and audio tokenization capability. The flip side is that the latency story is less clean (a packed single stream cannot be as parallel as a dual-stream setup), and the full-duplex behaviour depends heavily on the blocking cadence chosen at training time.
## Family 3: cascade with chunk-level duplex predictor
Family 3 keeps ASR and TTS conceptually separate but adds a state-predictor head that chunks input and output so the model can interrupt, backchannel, or pause at sub-second granularity. This is sometimes called time-division multiplexing after MiniCPM-o's framing.
Freeze-Omni (November 2024, Tencent plus Nanjing University and Fudan) is the reference point: latency of 160-320 ms for model-only and ~1.2 seconds in real scenario deployment, weights available under an Apache-style release. MiniCPM-o 4.5 (OpenBMB) brought the time-division-multiplex approach to on-device deployment. Mini-Omni and Mini-Omni 2 (Tsinghua) populate the academic-scale end. OpenS2S (CASIA) is the empathy-first fully open entrant, releasing code, data, and weights together.
The important 2026 Q1 entrant is DuplexCascade (March 2026 arXiv, [paper-id: duplexcascade-2026](https://arxiv.org/abs/2603.09180)). DuplexCascade is VAD-free: it uses conversational control tokens and micro-turn chunks to make the turn-taking decision end-to-end within the cascade, rather than routing through a voice-activity detector. The paper claims state-of-the-art full-duplex turn-taking on Full-Duplex-Bench among open-source STS systems. That matters because [Article 03](/blog/pipeline-to-integrated) had written the cascade-plus-predictor family as the branch most likely to lose out to integrated systems. DuplexCascade reopens it, at least for labs that prefer to keep ASR and TTS as separable components.
{{FIG:f5}}
The trade-off inside Family 3 is that it inherits a compounding-error vulnerability that Families 1, 2, and 4 avoid: errors in the ASR stage propagate into the LLM stage and then into the TTS stage, and the predictor head does not undo them. In practice this means Family 3 systems tend to be stronger on strictly-defined turn-taking tasks and weaker on paralinguistic expressiveness, because the text bottleneck in the middle strips prosody. For enterprise buyers that want a clean, inspectable pipeline, that is sometimes a feature. For consumer companion-app deployment, it is usually a problem.
## Family 4: codec-free single-decoder
Family 4 is thin. The canonical example is SALMONN-omni (ByteDance), which operates on continuous embeddings without a neural audio codec in the loop and uses an internal "thinking" state to decide when to emit speech versus listen. The design choice is architectural minimalism: no codec means no codec-artefact failure modes, but it also means fewer reusable components and less production tooling than Families 1 or 2.
As of April 2026 Family 4 has one serious public entrant. If the codec-free approach proves out at production scale, it could seed a distinct research lineage; at this point it is a placeholder family in the taxonomy rather than a populated one. Including it in the map is the right move anyway, because a four-family taxonomy that collapsed it into Family 3 would mis-describe what the ByteDance team is actually doing.
aside
The taxonomy is a working object, not a finished one. A fifth family would not be surprising in 2026 — a discrete-token diffusion approach, or a retrieval-augmented STS that retrieves at the audio level rather than the text level. The test for "new family vs. variant of an existing family" is whether the training-data shape and the architectural choice are jointly new. Every current family passes that test. Future entrants should be held to the same bar.
## Near-STS and the closed commercial frontier
Near-STS, as defined earlier, includes Kyutai Unmute (modular cascade wrapping a text LLM with Kyutai's open STT and TTS, production deployment at ~450 ms, all components MIT-licensed), Meta Spirit-LM (single-stream expressive LM under FAIR-NC, gated, English-only), NVIDIA Audio Flamingo 3 (streaming TTS output under NVIDIA OneWay Noncommercial), VITA-MLLM LUCY (emotion-token plus tool use), F-Actor (KIT / Edinburgh / NatWest, 2,000 hours of fine-tuning, academic-budget), IntrinsicVoice, and several others on the boundary. These systems are usable research artefacts, but none of them is a joint audio-language modeler in the Family 1-4 sense.
The closed commercial frontier is where the consumer-scale deployments live. OpenAI ships GPT-4o voice plus the Realtime API and its gpt-realtime successor. Google ships Gemini Live plus Gemini 3.1 Flash Live (March 2026). Amazon ships Nova Sonic. Microsoft ships MAI-Voice-1 plus MAI-Transcribe-1 (broad availability April 2026). ByteDance ships Doubao in China and Seeduplex as the April 2026 full-duplex upgrade. Hume ships EVI plus Octave 2 plus EVI 4-mini (a prosody-driven half-duplex with an external LLM; full EVI 4 with a native-LM variant was still pending as of April 2026). Cartesia ships Sonic. ElevenLabs has expanded its TTS catalogue into streaming voice agents. Deepgram ships Aura Nova. xAI launched Grok Voice APIs in April 2026.
The Q1-2026 funding wave concentrated in this frontier or one layer below it. Deepgram closed a $130M Series C in January 2026. Parloa raised $350M Series D at a ~$3B valuation. Decagon closed a $250M Series D at a $4.5B valuation, followed by a secondary tender offer in March 2026. ElevenLabs closed a $500M Series D at a ~$11B valuation in February 2026. Retell raised a publicly-disclosed $4.6M seed in March 2026. Across five voice-AI rounds inside eight weeks, the Q1-2026 total was approximately $1.23B. For comparison, the equivalent Q1-2024 number was roughly an order of magnitude smaller.
Two acquisitions completed the picture. In January 2026 Google DeepMind acqui-hired Hume AI's CEO and approximately seven engineers, with Hume's consumer product continuing under the original team. Apple acquired Q.ai, a silent-speech interface company, at a reported $1.6-2B valuation in January 2026. The hyperscaler absorption pattern is distinct from standard M&A. It suggests that voice-AI talent is being pulled into the frontier labs rather than accumulating inside independent startups, which has implications for who gets to build the next generation of foundation models.
{{FIG:f6}}
## Licensing bifurcation and three emerging sub-categories
Across the ~30 open or paper-released models, at least six non-closed license regimes are in active use: MIT, Apache 2.0, CC BY 4.0, CC BY-NC 4.0, FAIR-NC, NVIDIA Open Model License, NVIDIA OneWay Noncommercial, and several community-restrictive licenses with commercial carve-outs. This is more license diversity than the text LLM open-weights ecosystem had at a comparable stage. For enterprise buyers, that diversity is a compliance story before it is a capability story. A CC BY-NC or FAIR-NC model cannot be directly deployed in commercial production without a bespoke license; an Apache 2.0 or MIT model can.
A new pattern visible in Q1-2026 is license bifurcation inside a single lab. Alibaba's Qwen3-Omni shipped under Apache 2.0 in September 2025. Its successor Qwen3.5-Omni (March 2026) is a closed API-only preview. ByteDance has a similar split: its academic branch (SALMONN-omni) is open, while its production branch (Doubao and Seeduplex) is closed. StepFun has so far kept Step-Audio 2 at Apache 2.0 but has not confirmed the license for Step-Audio-R1.1 in its initial announcement. The open-first-base, closed-at-the-flagship pattern is now visible enough to be a planning assumption rather than a lab-specific choice.
the split, in one line
**Open-first-base, closed-at-the-flagship.** Alibaba, ByteDance, and (probably) StepFun are all running the same pattern: keep the research base permissive enough to collect ecosystem contributions, then close the commercial flagship where the margins are. A buyer choosing "Qwen3" today does not get access to the same object a buyer choosing "Qwen3.5" gets, and that divergence is the story of Q1-2026.
Three sub-categories are starting to peel off inside the four families. The first is reasoning-tuned realtime: Step-Audio-R1.1 and the Qwen3-Omni-Thinking variant both advertise "thinking while speaking," embedding reasoning trajectories in the audio-generation stream. This is a new sub-category inside Family 2. The second is translation-duplex: Kyutai Hibiki and Hibiki-Zero define an application branch of Family 1 that shares Moshi's architecture but serves simultaneous interpretation rather than conversation. Meta SeamlessStreaming sits on this axis from a different architectural base. The third is voice-cloning inside integrated STS: FlashLabs Chroma is the first open STS to ship with built-in personalized voice cloning. Previously that capability lived only in TTS-specific stacks (ElevenLabs, Voxtral TTS). Merging it into integrated STS compounds consent obligations and is part of why [Article 09](/blog/consent-licensing-opt-in)'s treatment of voice-as-biometric data matters now rather than later.
{{FIG:f7}}
## What this map means
For builders, the family choice is a commitment. A Family 1 dual-stream model trained on two-channel dyadic data is not substitutable for a Family 3 cascade trained on single-channel monologue data. The training-data shape and the architectural choice are coupled, which is why the open-vs-closed question for a specific product cannot be answered without first answering the family question.
For researchers, the families are not mutually exclusive but they are not benchmark-comparable without labeling. A 200 ms latency number from a Family 1 dual-stream model is not the same artefact as a 200 ms latency number from a Family 3 cascade. [Full-Duplex-Bench v1 through v3](/blog/benchmark-landscape) evaluates behaviours (interruption, pause, backchannel, turn-taking) without isolating architecture, and [the next generation of benchmarks](/blog/why-new-benchmarks) will need to add family labels if headline numbers are going to be comparable.
For enterprise buyers, the license tier determines the deployment surface. NVIDIA Open Model License, CC BY-NC, and FAIR-NC block most commercial paths without a bespoke agreement. CC BY 4.0 and Apache 2.0 clear them. Closed APIs let you skip the license question but commit you to a vendor's rate card. Almost no frontier model is sold on truly-permissive commercial terms today; the exceptions (Moshi, Qwen3-Omni at the base tier, Step-Audio 2, Kimi-Audio, Covo-Audio, FlashLabs Chroma) are worth identifying early.
For investors, the landscape is starting to bifurcate in a way that was not visible a year ago. A small set of foundation-tier players is building across the four families with capital-intensive runs and growing data moats. A larger set of vertical-tier players is building on top of the foundation layer, capturing workflow and distribution in contact centres, healthcare, gaming, and companion apps. A middle band is getting absorbed into hyperscalers: Google DeepMind's acqui-hire of Hume and Apple's acquisition of Q.ai are the visible 2026 examples, and the pattern is likely to continue. [Article 05](/blog/foundation-before-vertical) argued that the STS foundation moment is near rather than past. The presence of DuplexCascade, Seeduplex, FlashLabs Chroma, and the licensing bifurcation together say the field is still accumulating moves, not consolidating. That is the moment before the moment; knowing which family a move belongs to is the difference between reading the field and guessing at it.
The next article ([Article 09](/blog/consent-licensing-opt-in)) turns to the data side of this landscape: who owns the conversations these models are learning from, how consent works now, and where the regulatory floor is rising.
---
**Fullduplex is building large-scale two-channel full-duplex conversational speech datasets for next-generation STS models.** If you are a frontier lab, a research team, or an enterprise buyer making family or license decisions, [get in touch](mailto:hello@fullduplex.ai). If you are an investor evaluating the voice-AI stack and want access to our data room, [reach out here](mailto:hello@fullduplex.ai).
---
_Originally published at [https://fullduplex.ai/blog/sts-model-landscape](https://fullduplex.ai/blog/sts-model-landscape)._
_Part of **The STS Series** · 08 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# v01 — Kyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary
_Canonical: https://fullduplex.ai/blog/v01-kyutai · Markdown: https://fullduplex.ai/blog/v01-kyutai/md_
---
title: "Kyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary"
description: "Research velocity converted into reputational capital. A twelve-person Paris nonprofit ships weights every ten to twelve weeks, rewriting the vocabulary the open voice-AI field thinks in."
article_number: "v01"
slug: v01-kyutai
published_at: 2026-04-22
reading_minutes: 17
tags: ["verticals", "kyutai", "open-source"]
canonical_url: https://fullduplex.ai/blog/v01-kyutai
markdown_url: https://fullduplex.ai/blog/v01-kyutai/md
series: "The Verticals"
series_position: 1
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Kyutai: the lab that gave full-duplex STS to everyone
*A Paris research lab without a product, a revenue target, or a customer list has set the public floor for real-time voice AI four times in eighteen months. This is what that looks like from inside the sector it is quietly rewriting.*
## 1. The September 2024 moment
*Bottom line: before Moshi, every full-duplex speech-to-speech model that could hold a real conversation lived behind a commercial API. After Moshi, anyone with a single GPU could run one.*
On September 17, 2024, a dozen-person French nonprofit called Kyutai posted a paper to arXiv, a set of model weights to Hugging Face, and a reference implementation to GitHub. The paper was titled [Moshi: a speech-text foundation model for real-time dialogue](https://arxiv.org/abs/2410.00037). The weights were released under CC-BY 4.0. The code was Apache 2.0. The whole package ran on a consumer-class NVIDIA L4 GPU.
What it did was unusual. Until that morning, every publicly demonstrable system that could listen and speak at the same time, handle a real interruption, and answer in under a second was wrapped in a commercial API. GPT-4o's advanced voice mode, announced four months earlier, was not yet in any user's hands. Gemini Live was still a preview. The open research literature had shown [dGSLM](https://arxiv.org/abs/2203.16502) from Meta, which produced two-speaker dialogue without content grounding, and a scatter of academic prototypes that could not be reproduced without reconstructing the training corpus. Moshi closed the gap in one release. It shipped a 7B parameter temporal transformer, a 6-layer depth transformer for intra-frame codebook prediction, a new streaming codec called [Mimi](https://huggingface.co/kyutai/mimi) at 12.5 Hz and 1.1 kbps, and the first documented "inner monologue" mechanism where the model predicts a text token before its audio for every 80 ms frame. It claimed a theoretical latency of 160 milliseconds and a practical latency of about 200 milliseconds on the L4.
That moment set the posture this profile is written against. An open lab shipped something frontier labs had not yet put in users' hands, and released it under a license that let any researcher in the world read, reproduce, and build on it.
A scope note belongs here before the profile continues. Kyutai is an open-science research lab rather than a voice-only specialist. Its public output also covers [Helium 1](https://kyutai.org/helium), an open 2B-parameter LLM released under CC-BY 4.0, [MoshiVis](https://kyutai.org/moshivis), a Moshi variant that can discuss images, and lateral work in codecs and computer vision that culminated in April 2026's OVIE novel-view-synthesis release. This profile reads the lab through the voice frontier because that is where its field gravity is most visible. The lab itself is wider than speech.
## 2. The cadence
*Bottom line: seven industry-first releases in eighteen months, from a team that fits in a large conference room, is a pace no commercial voice AI lab with ten times the staff has matched in public.*
Two to three months is the standard cadence for a university speech group. At Kyutai it has been the cadence of industry firsts.
[Hibiki](https://arxiv.org/abs/2502.03382) arrived in February 2025 as a 2.7B parameter simultaneous speech-to-speech translator that preserves the source speaker's voice into the target language. [Unmute](https://github.com/kyutai-labs/unmute) in May 2025 paired Kyutai's open STT and TTS with any text LLM as a modular cascade, and was open-sourced under MIT in July. [Pocket TTS](https://kyutai.org/blog/2026-01-13-pocket-tts) in January 2026 shipped voice cloning small enough to run on CPU, which makes local on-device synthesis practical on commodity hardware. [Hibiki-Zero](https://kyutai.org/blog/2026-02-12-hibiki-zero) in February 2026 rebuilt the Hibiki training pipeline with GRPO reinforcement learning and no aligned data, and picked up four new input languages in the process. [Invincible Voice](https://kyutai.org/blog/2026-02-24-invincible-voice) two weeks later turned the stack toward an assistive-communication demo for people living with ALS. OVIE in April 2026 stepped laterally into single-image novel view synthesis, a computer vision task that is not obviously speech at all.
Seven industry-first releases in eighteen months. Each one reached an architectural milestone the open field had not previously crossed. Each one shipped under a license that let other labs use it the next morning. That cadence is the first thing worth understanding about Kyutai, because every other part of the story — the economic model, the talent posture, the field-gravity argument — is downstream of it.
## 3. The endowment and the people
*Bottom line: Kyutai's posture is an economic innovation, not just a research posture. Six senior FAIR Paris scientists with a ten-year runway and no product roadmap is a shape the field had not previously tried at this scale, and the people who chose that shape are most of the reason it works.*
Before the architecture and the releases, the lab is worth describing as an organization. Kyutai launched in November 2023 with an announced budget of roughly [€300 million](https://techcrunch.com/2023/11/17/kyutai-is-an-french-ai-research-lab-with-a-330-million-budget-that-will-make-everything-open-source/), contributed by Xavier Niel's Iliad, Rodolphe Saadé's CMA CGM, Eric Schmidt's foundation, and a small ring of other donors. The structure is a nonprofit. There is no product roadmap, no revenue target, and no fundraising clock. At the announced burn rate of a lab of this size, roughly €20 to €30 million a year including compute, the endowment is a ten-to-fifteen-year runway by design. The founders did not wire up a path to commercialization because they did not want one. The whole apparatus is an answer to the question of what happens if you give serious research scientists a decade of cover and tell them to ship work that any other lab can use.
The six founding scientists all came out of Meta's FAIR Paris office, and the concentration matters. Patrick Pérez, who joined as CEO after leading Valeo.ai, sets the research agenda and handles the institutional interface with the donors. Alexandre Défossez, first author on the Moshi paper and on the earlier Encodec and MusicGen work, anchors the audio research line and now carries the title Chief Exploration Officer. Neil Zeghidour, first author on AudioLM during his earlier Google Brain Paris stint and a co-founder here, led the audio research program through the first two years. Hervé Jégou brought two decades of computer vision work including the FAISS vector-search library that most embedding retrieval pipelines in the world now depend on. Edouard Grave, a co-author on the original LLaMA paper, leads the language modeling side. Laurent Mazaré runs the systems and infrastructure work that turned Moshi from a paper into something that reliably streams on a single L4. That concentration of FAIR Paris senior staff inside a twelve-person nonprofit is itself the first data point on the endowment thesis: you do not normally get this many first authors of load-bearing papers in one organization unless someone has decoupled the research from the product cycle.
Three assumptions sit underneath the posture. The first is that the endowment is the product: a decade of runway with no quarterly scorecard is what lets the team pick hard research problems instead of incremental ones. The second is that open-source is distribution: Moshi weights under CC-BY 4.0, Mimi under the same license, code under Apache 2.0, three reference runtimes (PyTorch, MLX, Rust), and the same pattern repeated across Hibiki, Unmute, and the Voice Donation Project. None of this is casual. It is the part of the operation that most resembles a go-to-market function at a commercial company, except the distribution target is other researchers and other labs. The third is that the first mover in open-source can earn durable field gravity without a product. A Moshi fine-tune from NVIDIA, a Mimi reuse inside Sesame's CSM line, a growing academic follow-on cluster that thinks in Kyutai's vocabulary: none of them pay Kyutai a license fee, and Kyutai does not ask for one. The payback is a field that measures itself against Kyutai's public artifacts.
Two refinements to the founder-snapshot above belong here. First, the lab has grown beyond the founding six into a fuller research institution as of April 2026. The [team page](https://kyutai.org/team) lists audio, language, and engineering staff plus an operations and partnerships layer, a cohort of postdocs and PhD students, and an intern program. Neil Zeghidour now carries the Audio Research Advisor title after his September 2025 departure to Gradium (covered in §6), and Hervé Jégou is listed as alumni. The launch scientific advisory board, which included Yejin Choi, Yann LeCun, and Bernhard Schölkopf, has not been publicly rescinded. The organization reads now as institutionalizing rather than still operating as the 12-person founding collective. Second, endowment and FAIR-trained talent are only two of the three legs of the posture. The third is compute access: Kyutai uses Iliad's [Scaleway](https://www.scaleway.com/) H100 cluster at cost, an arrangement that runs through Xavier Niel's stake in both entities. Endowment plus FAIR-trained talent plus European compute access at European cost is the three-part shape, and any two without the third would not have produced §2's cadence.
The model is finite, and it would be dishonest to pretend otherwise. €300 million is a ten-to-fifteen-year runway rather than an endowment in perpetuity, there is no diversified funder base, and the lab's continued existence sits on the ongoing political and financial commitment of three donors plus a small ring of anonymous backers. That said, €300 million is still larger than the ten-year budget of nearly any academic speech group in the world. Cambridge's MLMI group, CMU LTI's speech faculty, and JHU's CLSP operate at a fraction of that scale over the same horizon. Kyutai is not running out of money for a long time, and the decade-by-design horizon is the feature, not the bug, of the posture the founders chose.
## 4. The architecture Kyutai gave away
*Bottom line: Moshi's multistream dual-transformer and Mimi's 12.5 Hz codec are now the reference vocabulary for a whole family of open full-duplex systems. Kyutai did not ask for a license fee. It asked for a field.*
The central artifact is Moshi, and its shape matters because every other Kyutai release sits adjacent to it.
Moshi is two transformers stacked by purpose rather than by depth. A 32-layer, 4096-dimension Temporal Transformer with roughly 7B parameters advances one step every 80 milliseconds. At each step it ingests seventeen parallel token streams: Moshi's single text "inner monologue" token, the eight residual vector quantization codebooks for Moshi's own audio output, and the eight codebooks for the user's audio. A smaller 6-layer, 1024-dimension Depth Transformer then expands that single temporal hidden state into the eight codebooks Moshi needs to emit for the current frame, running eight inner steps across codebook index rather than across time. Splitting time from codebook dimension this way is what makes 12.5 Hz full-duplex generation tractable on a single GPU. The heavy model runs only 12.5 times per second. The light model handles the intra-frame dependencies.
Mimi, the codec, is the enabling piece. It compresses 24 kHz mono audio into tokens at 12.5 Hz with a bitrate of 1.1 kbps and an 80 ms causal frame. The first codebook is distilled against WavLM semantic features so that the first token stream carries linguistic content rather than pure acoustic residue. The headline comparison in the Moshi paper is against SpeechTokenizer at 50 Hz and 4 kbps, and SemantiCodec at 50 Hz and 1.3 kbps. At one-quarter the token rate and roughly one-quarter the bitrate, Mimi is reported to match or beat both on reconstruction quality.
The family grows from there. Hibiki in February 2025 rebuilt the Moshi multistream architecture as a simultaneous French-to-English speech translator. On the paper's evaluation, Hibiki reaches an ASR-BLEU of 30.5, outperforming SeamlessStreaming and StreamSpeech on the same task. Human raters scored its naturalness at 3.73 on a 5-point scale against 4.12 for professional human interpreters. Unmute in May 2025 is the opposite architectural choice and reads as a second pillar rather than a lateral release: a modular cascade of Kyutai STT, any text LLM (Gemma-3-1B locally, GPT-OSS-120B on the kyutai.org demo), and Kyutai TTS, with latency around 450 to 750 ms. The lab's own positioning treats "Cascaded Voice AI" and "Speech Native Models" as two parallel research tracks, not one primary and one secondary, which is a different posture from most research labs with a favored architecture. Pocket TTS in January 2026 brought voice cloning down to CPU. Hibiki-Zero a month later replaced Hibiki's forced word-level alignment pipeline with GRPO reinforcement learning, which let the team add Spanish, Portuguese, German, and Italian as input languages, with Italian bootstrapped from under 1,000 hours of speech data. On the Audio-NTREX-4L long-form benchmark, Hibiki-Zero is reported as state-of-the-art across five X-to-English pairings. Invincible Voice and OVIE followed in February and April 2026, the first an assistive-communication demo for ALS and the second a lateral step into computer vision that signals the lab's scope is widening rather than narrowing.
Two pieces of supporting infrastructure round out the surface. The [Voice Donation Project](https://github.com/kyutai-labs/tts) ran from June 2025 through early 2026 and verified 228 voices out of 374 submissions for inclusion in Kyutai TTS. It is a consent-first, opt-in audio dataset on the Common Voice model, at a scale no commercial voice cloning service has publicly matched. And the Delayed Streams Modeling framework, released with Unmute, is the formal model behind Kyutai's streaming STT and TTS that lets downstream users build their own continuous listening and speaking components without reimplementing the plumbing.
Language coverage is the honest limit on this surface as of April 2026. Moshi is English only. Hibiki's output is English only. The Voice Donation Project collects voices in a small number of European languages. Japanese, Korean, Mandarin, Arabic, and Hindi are outside the public Kyutai output as of April 2026. That matters, because the plausible long-run use cases for full-duplex STS are multilingual. The encouraging signal is how fast the trajectory is moving: Hibiki-Zero in a single release expanded from French-only to French plus four European languages, with Italian bootstrapped from under 1,000 hours. If that rate of linguistic expansion holds, the next non-English full-duplex release from Kyutai is a two-to-three release problem, not a five-year problem. The GRPO pipeline Hibiki-Zero introduced is what makes that tractable without aligned training data.
One smaller technical caveat belongs with the architecture discussion too. Moshi's 200 ms latency is a model-latency measurement on an L4 GPU. End-to-end user-perceived latency adds network, audio buffering, voice activity detection, and jitter on top. The first-real-time-full-duplex title is real. The every-user-experiences-a-conversation-at-200-ms implication is not, and noting the gap matters because the number propagates downstream without the footnote.
The architecture story this section tells is the visible part of what Moshi shipped. §7 returns to the less-visible part: the small supply of two-channel dyadic audio the architecture quietly rests on.
## 5. The field that Moshi built
*Bottom line: Kyutai is the lab that shaped the commercial landscape rather than being shaped by it. A follow-on fine-tune from NVIDIA is a citation, not a defection.*
The field gravity is real, and it is the clearest evidence that the open-source-as-distribution thesis is working. NVIDIA's January 2026 [PersonaPlex-7B-v1](https://huggingface.co/nvidia/PersonaPlex-7B-v1) is a Moshi fine-tune. [Sesame's CSM-1B](https://huggingface.co/sesame/csm-1b) reuses Mimi in a different architectural frame. Sesame itself raised $250 million from Sequoia and Spark in October 2025 at over a billion dollars in valuation, and kept its larger variants closed. A broader academic follow-on cluster, which the STS series aggregates as "Family 1: dual-stream plus codec" in its four-family taxonomy, is dominated by descendants of the Moshi idea. If Kyutai had not released Moshi with weights in September 2024, Family 1 as a public-research category would not exist. Some version of it would likely have arrived later, from a commercial lab, behind an API. The public reproducibility, the ability to fine-tune, and the educational value for downstream researchers would have been different in kind, not just in degree.
The benchmark posture is the place where a superficial reading might find Kyutai lacking, and it is worth reading past the leaderboard to see the design choice. In the new reasoning-tuned STS sub-category, the top scores on Big Bench Audio in April 2026 belong to [Step-Audio-R1.1 at 97.0%, Gemini 3.1 Flash Live at 95.9%, and Grok Voice at 92.9%](https://artificialanalysis.ai/). Moshi does not appear in the top ten. The reason is not that Moshi cannot compete on those axes. The reason is that nobody at Kyutai is tasked with keeping Moshi's leaderboard submissions current, because the lab treats its public artifacts as the canonical statement and leaves evaluation to whoever wants to produce it. The Full-Duplex-Bench v2 paper was written by an external academic group using Kyutai's open weights. That is the point. A lab that does not gate its numbers behind a press cycle creates a field where many labs can refresh the numbers, which is closer to the open-science norm than the leaderboard-chasing alternative. Moshi's absence from the April 2026 top ten is not a quality signal. It is closer to a posture signal — and, to be fair, partly a resource-allocation result, since a twenty-person research lab does not staff a continuous leaderboard-submission function the way a commercial product team with a Big Bench Audio launch roadmap does. Both readings are consistent with the evidence. The downstream consequence matters more than which one is load-bearing: the open field caught up to and refreshed the numbers on its own.
The follow-on economics follow the same logic. When a lab ships completely open, it does not get to choose who adopts fastest; the market does. The fastest adopters of Moshi's architecture have been well-capitalized commercial labs. A reasonable nonprofit might read that as a problem, because the commercial adopters capture most of the monetary value the architecture enables. The other reading, which fits Kyutai's posture better, is that the lab that becomes the template for an entire family of commercial releases has occupied field-gravity territory most open-science efforts never reach. "Field gravity" in this sense is a scientific and architectural claim, not an economic one. Kyutai is not collecting license fees and is not winning the commercial benchmark race, and conflating the two readings would flatten the argument the lab is actually making. The practical question is not whether a PersonaPlex fine-tune is a defection. It is whether the donors find that arrangement meaningful in year eight of the endowment. On the September 2024 to April 2026 evidence, the early read is yes.
## 6. The Gradium inflection
*Bottom line: Neil Zeghidour's September 2025 departure from Kyutai to found Gradium is the first real test of the founder-talent-retention question, and the early shape of it looks complementary rather than competitive.*
On December 2, 2025, Neil Zeghidour publicly announced that his new company Gradium had exited stealth with a [$70 million seed round](https://techcrunch.com/2025/12/02/gradium-70m-seed-ultra-low-latency-voice-ai/) led by FirstMark and Eurazeo, with DST Global, Korelya Capital, and Amplify Partners participating, plus Eric Schmidt as an angel. Gradium had formed three months earlier, in September 2025. Zeghidour was a co-founder of Kyutai, first author on AudioLM at Google Brain Paris before that, and through his first two years at Kyutai led the audio research program that produced Moshi, Mimi, and Hibiki. His new company is described as building ultra-low-latency voice AI audio language models, a for-profit commercial pursuit of the research line he helped define inside Kyutai. Kyutai's own [homepage](https://kyutai.org/) now describes Gradium as its "first spin-off" and positions the company as a path from open research to production-ready systems.
The mechanical question is whether this is a founder-talent problem for Kyutai or something else. Four observations suggest it is something else.
**First**, Eric Schmidt is a donor to Kyutai and an angel in Gradium. If the two entities were read as competitive in the adversarial sense, one of the most diligent capital allocators in technology would not underwrite both. That does not prove the two are not competitive, but it is a data point against the simplest "founder left, now they compete for the same market" reading.
**Second**, the positioning is not symmetrical. Kyutai ships open-weights foundation research under CC-BY 4.0 with no product and no revenue. Gradium is a commercial entity pursuing productization of audio language models, presumably under closed or partially-closed distribution. These are different layers of the stack. Gradium customers are a category Kyutai does not serve. Kyutai outputs are a resource Gradium is likely to build on, directly or indirectly, along with the broader open literature. The relationship is closer to "Kyutai sets the open floor, Gradium productizes one commercial application of it" than to "Kyutai and Gradium fight for the same researcher or customer."
**Third**, the broader Kyutai research bench did not empty out with Zeghidour's exit. Alexandre Défossez, who carries the formal Chief Exploration Officer title, is still at Kyutai and has continued to lead the audio research line through Hibiki-Zero and Invincible Voice, both of which shipped after September 2025. The lab has continued its two-to-three-month release cadence across Pocket TTS, Hibiki-Zero, Invincible Voice, and OVIE since Gradium formed. The output signal does not look like a lab that lost its audio research capacity.
**Fourth**, from the donor's perspective, a founder leaving to start a for-profit company and raise $70 million on the strength of work initially done inside the lab is not obviously a failure of the endowment model. It is closer to the pattern where foundational research yields a commercial ecosystem around it, which the Bell Labs and early-DARPA eras produced at larger scale. If the endowment thesis is that serious researchers with decade-long cover will ship work that any other lab can use, one predictable consequence is that some of those researchers eventually build commercial entities on top of their own public output. That Gradium exists and raised at the scale it did is evidence the Kyutai thesis has legs, not evidence it is unraveling.
The honest caveat is that it is still early. Gradium does not yet have a public product as of April 2026. If the company ships something that looks like a direct commercial wrapper around a Kyutai architecture, the donor conversation gets more complex. If it ships something architecturally distinct that needed the $70 million to build from scratch, the complementarity reading holds. And there is a latent competitive surface worth naming: Gradium and Kyutai are both working on ultra-low-latency voice AI and the infrastructure underneath it, which on public positioning is the same technical territory rather than adjacent ones. The current complementarity comes from the research-versus-product split, not from the two organizations picking different technical problems. Either way, the Gradium launch is the first real data point on the talent-retention question for the Kyutai endowment model, and the early read is not the one the pessimistic version of this story would predict.
## 7. What comes next for the open full-duplex lineage
*Bottom line: Moshi's ability to do full-duplex traces back to roughly 2,000 hours of Fisher English Training Speech recorded in 2004. The next generation of open full-duplex models will rise or fall on whether a new two-channel data supply exists, and that question is where Kyutai and oto's work are adjacent rather than overlapping.*
The first-order takeaway from this profile is that the dual-stream-plus-codec branch of the public STS landscape is a Kyutai lineage. The four-family taxonomy in the STS model landscape article places PersonaPlex, CSM, and a cluster of academic follow-ons together as Family 1, and each of them traces back to Moshi or Mimi. Understanding Kyutai is most of the way toward understanding where open full-duplex STS came from.
The second-order takeaway is a data one, and it is the one that matters most for anyone building a next generation of open full-duplex models. Moshi's ability to do full-duplex is credited, in the paper, to a fine-tuning pass on Fisher English Training Speech, approximately 2,000 hours of two-channel telephone conversations published by the Linguistic Data Consortium in 2004 and 2005. The pretraining used roughly 7 million hours of web audio diarized into simulated two channels, and the instruction fine-tuning used roughly 20,000 hours of synthetic audio generated by a Kyutai TTS trained on 170 hours of real two-channel recordings. The specific chain matters. *The world's most famous open full-duplex model could not have been built without a twenty-year-old paid LDC dataset.* Article 04 of the STS series argues that YouTube and podcasts cannot train full-duplex STS, and Moshi is the load-bearing empirical case: even with 7 million hours of web audio, the full-duplex behavior had to come from the 2,000 hours of Fisher. The bottleneck is not compute, and it is not architecture. It is the two-channel dyadic audio supply.
That observation locates the layer of the stack that has to exist alongside the Kyutai lineage. If the two-channel dyadic data supply is the real bottleneck, the interesting question is which companies are actually tackling it. Two are visible on the public record as of April 2026: [David.ai](https://www.david-ai.com/) and [oto](https://www.oto.earth/). Both are building permissively-licensed conversational audio supply at scale, and neither is trying to train a general-purpose voice foundation model. oto's February 2026 dataset release reached #3 on Hugging Face's trending datasets, which is a demand signal the full-duplex data layer had not previously produced at that scale. A permissively licensed dataset in the 100,000-to-500,000-hour band that Article 06 identifies as the foundation threshold for STS would remove Moshi's successor family's dependency on Fisher and its derivatives. Kyutai is one of the small number of labs in a structural position to use such a dataset effectively. The same observation applies to any Kyutai-derived architecture released by NVIDIA, Sesame, Gradium, or a new academic group over the next eighteen months.
Benchmark collaboration is the other natural seam. Kyutai has contributed to the Full-Duplex-Bench family at various points. An honest multilingual live examiner that would let Moshi be evaluated in Japanese or Korean against Step-Audio-R1.1 or Gemini 3.1 Flash Live does not exist in the public literature as of April 2026. Neither does a rigorous evaluation of paralinguistic output quality at audio level rather than through transcript proxies. Both are directly in the scope Kyutai would benefit from, and both are in the scope oto is working on.
If you are building voice AI, you use Kyutai's outputs. If you are building the infrastructure underneath voice AI, you think about what Kyutai's outputs assume and where the holes are. This profile is the second kind of piece. [oto](https://www.oto.earth/) is working on benchmarks and conversational speech datasets that would let the next generation of open full-duplex work sit on a cleaner foundation than Fisher 2004. If that is the kind of problem your lab or team cares about, `hello@fullduplex.ai`.
---
*This is verticals · v01 / 16. The Verticals is a companion series to the STS Series — long-form profiles of the labs, companies, and institutions shaping the open speech-to-speech landscape.*
---
_Originally published at [https://fullduplex.ai/blog/v01-kyutai](https://fullduplex.ai/blog/v01-kyutai)._
_Part of **The Verticals** · v01 / 17 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
# 09 — Consent, licensing & the opt-in economy
_Canonical: https://fullduplex.ai/blog/consent-licensing-opt-in · Markdown: https://fullduplex.ai/blog/consent-licensing-opt-in/md_
---
title: "Consent, licensing & the opt-in economy"
description: "The consent and licensing stack for conversational voice data in April 2026 is three layers deep: a fixed biometric-privacy floor, a seven-platform patchwork middle, and a transparency ceiling partially in force and partially in draft. An opt-in voice-data economy requires all three to survive together."
article_number: "09"
slug: consent-licensing-opt-in
published_at: 2026-04-20
reading_minutes: 19
tags: ["consent", "licensing", "policy"]
canonical_url: https://fullduplex.ai/blog/consent-licensing-opt-in
markdown_url: https://fullduplex.ai/blog/consent-licensing-opt-in/md
series: "The STS Series"
series_position: 9
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Consent, licensing, and the opt-in economy for conversational data
> Two people are talking on the phone. Someone uploads the recording. A model trains on it. Months later, a synthetic voice using one of those speakers shows up in an ad. Five different consent regimes touched that recording, and at least three of them did not get a checkbox.
Conversational voice data sits at the intersection of telephone-recording law, biometric privacy, platform Terms of Service, generative-AI training rules, and emerging AI-output disclosure regimes. As of April 2026, none of these layers are converging. The biometric floor is fixed, the platform middle is a patchwork of seven mutually-incompatible defaults, and the AI-transparency ceiling is partially in force and partially still in draft. A serious attempt to build a two-channel conversational voice dataset has to name which layer each of its compliance claims rests on. Companies that conflate the layers are the companies that get fined.
This article is the map. [Article 04](/blog/data-ceiling) explained why public speech datasets cannot supply full-duplex training audio. [Article 08](/blog/sts-model-landscape) mapped the model landscape that is now hungry for that data. This article walks the consent and licensing stack across the United States, the European Union, and Japan, names where the rules are settled and where they are in motion, and ends with the specific things an opt-in voice-data economy would have to ship.
## The five meanings of "consent"
Consider the recording from the opening paragraph, in slow motion. Speaker A is on a landline in Illinois. Speaker B is on a mobile in California. The call is recorded by a third-party transcription tool that one of them runs. The audio file is uploaded to a podcast hosted on a major platform. A research group later includes the file in a dataset used to train a voice-cloning model. A consumer product built on that model later generates a synthetic voice that listeners assume is Speaker A.
That single thirty-minute clip touched at least five distinct consent regimes. First, telephone recording consent: Illinois is a two-party-consent state under the Illinois Eavesdropping Act, California is a two-party state under Penal Code 632. Both speakers needed to have consented to the recording itself. The federal floor is one-party (18 USC 2511) but the stricter state law applies for any speaker physically located there.
Second, platform Terms-of-Service consent. When the file is uploaded, the platform's ToS governs whether and how that audio can be used by the platform itself or licensed onward. None of the major platforms map this consent to the speakers; they map it to the uploader.
Third, generative-AI training consent. Under [California AB-2013](https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB2013) (effective January 1, 2026), any developer publishing a generative AI system to Californians since 2022 must publish a summary of the training datasets. Under the [EU AI Act Article 50](https://artificialintelligenceact.eu/article/50/), the *output* of that model must be marked machine-readable as artificially generated by August 2, 2026. Neither of these laws requires speaker consent for training. They require disclosure.
Fourth, biometric capture consent. If the voice in the file is processed for speaker identification (enrollment plus match against future audio), the recording is biometric data under [Illinois BIPA](https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004), [Texas CUBI](https://www.texasattorneygeneral.gov/consumer-protection/file-consumer-complaint/consumer-privacy-rights/biometric-identifier-act), and [GDPR Article 9](https://gdpr-info.eu/art-9-gdpr/). Each requires explicit, separate consent before capture.
Fifth, likeness and identification consent. When the synthetic voice is used in a way that listeners attribute to Speaker A, US right-of-publicity statutes (with new specificity in California's [AB 1836 and AB 2602](https://www.dgslaw.com/insights/california-passes-laws-protecting-performers-from-replication-and-replacement-by-ai/)) and Japanese 肖像権 / defamation tort law create a fresh exposure, separate from the four upstream consents.
The temptation in 2026 is to bundle these. A single ToS checkbox that says "you consent to all use of your data by us and our partners" is the most common bundling attempt. The Italian Garante's 2025 decision against Replika (covered below) is the canonical statement that bundled consent does not survive regulator review when the purposes are this distinct.
five consents, one recording
A thirty-minute two-speaker clip can touch five distinct consent regimes at once: telephone-recording law, platform ToS, generative-AI training disclosure, biometric-capture statute, and right-of-publicity / likeness. Each operates on a different time horizon, a different enforcement authority, and a different remedy. Bundling them into one checkbox is the single most common compliance mistake in 2026.
{{FIG:f1}}
## The biometric floor — fixed rules, two recent ripples
Three statutes define the floor across the United States and the European Union.
Illinois BIPA is the 2008 Biometric Information Privacy Act, which enumerates voiceprints as a biometric identifier. Capture requires written notice, a public retention schedule, and affirmative written consent. BIPA carries a private right of action and statutory damages of $1,000 per negligent violation and $5,000 per intentional or reckless violation. Privacy World's 2025 year-in-review counted [107+ new BIPA class actions filed in 2025](https://www.privacyworld.blog/2025/12/2025-year-in-review-biometric-privacy-litigation/). Voiceprint cases are active: a Walmart warehouse voiceprint class action and *Cisneros v. Nuance Communications* (a voiceprint extracted from a call to a financial advisor) are both live, with the latter on Seventh Circuit appeal. Settlements named in the same review include Clearview AI at $51.75M and Speedway at $12.1M, neither voice-specific but both informative on enforcement intensity.
Texas CUBI plus TRAIGA is the second anchor. Texas's Capture or Use of Biometric Identifier Act enumerates voiceprints and prohibits commercial capture without informed consent. Texas does not give a private right of action; the Attorney General has exclusive enforcement, with a civil penalty of up to $25,000 per violation. On June 22, 2025, Governor Abbott signed the [Texas Responsible AI Governance Act (TRAIGA)](https://www.zwillgen.com/privacy/texas-cubi-law-and-biometric-privacy/), which clarified that CUBI applies to AI models and that public-internet presence does not constitute consent. Meta's $1.4B 2024 Texas settlement was for facial-recognition CUBI claims, not voice; it is the largest biometric settlement on record and a signal of the AG's appetite.
GDPR Article 9 is the European counterpart. Under [GDPR Article 9](https://gdpr-info.eu/art-9-gdpr/), voice data processed for the purpose of speaker identification is special-category biometric data. Processing requires explicit consent or another Article 9(2) condition (substantial public interest, vital interests, employment law). [EDPB March 2025 guidance](https://iapp.org/news/a/biometrics-in-the-eu-navigating-the-gdpr-ai-act) reaffirmed that the consent must be freely given, specific, informed, unambiguous, and explicit. The line that matters operationally: voice processed only for speech-to-text or speech-to-speech where speaker identity is not extracted is not automatically biometric. Voice processed for an enrollment and match cycle is.
The single most cited 2025 enforcement event sits on top of these three statutes. On April 10, 2025, the [Italian Garante](https://www.edpb.europa.eu/news/national-news/2025/ai-italian-supervisory-authority-fines-company-behind-chatbot-replika_en) fined Luka Inc., the operator of Replika, €5 million. The ruling found three core GDPR violations: no valid Article 6 legal basis for chatbot processing or for AI-training, no age verification despite a stated minor exclusion, and a privacy notice published only in English that referenced US COPPA. The legal finding that matters most to anyone planning a conversational voice corpus is one sentence: a single broad ToS checkbox cannot cover both *chatbot interaction* and *model training* as distinct processing activities. The Garante simultaneously opened a separate investigation into Luka's training practices, which is still open.
The counter-weight arrived eleven months later. On March 18, 2026, the Court of Rome [annulled a separate €15M Garante fine against OpenAI](https://www.wsgr.com/en/insights/openai-prevails-in-landmark-italian-ai-and-gdpr-enforcement-case.html) over ChatGPT training data, plus the ordered media campaign. This is the first significant judicial reversal of an EU GDPR AI-training enforcement action. The Replika fine was not at issue in that ruling and remains in force, but the broader signal is that ex-post lump-sum fines for AI-training practices may not survive judicial review on their merits as cleanly as European DPAs assumed.
aside
The biometric floor is intact. The enforcement architecture above it is being tested in court, and at least one major action just lost. Operators planning 2026-2027 compliance should treat the floor as load-bearing and the ceiling as negotiable — the opposite of the 2024 assumption.
{{FIG:f2}}
## The platform middle layer — seven defaults, no two alike
Every major platform that hosts user-generated voice or video has now picked an opt-in or opt-out posture for third-party AI training. As of April 2026, no two of them are structurally identical.
YouTube's December 16, 2024 launch made third-party AI training opt-in for creators, with a launch cohort of 18 named partners (OpenAI, Anthropic, Meta, Microsoft, Adobe, Apple, Stability AI, NVIDIA, and ten others). Trade press through 2025 reported single-digit-percent creator adoption. Google's own training of Gemini and Veo on YouTube continues under the general creator agreement, independent of the new toggle.
Reddit took the opposite path. Public Content Policy plus `robots.txt` restrictions block unknown bots; the headline licensing deals are [Google at $60M per year (February 2024)](https://techcrunch.com/2024/02/22/google-2024-data-licensing-deal-with-reddit-valued-at-60-million-per-year-says-report/) and OpenAI at roughly $70M per year (May 2024). Reddit endorsed the [RSL standard](https://rslstandard.org/rsl) at its September 2025 launch. Reddit v. Anthropic, filed in 2024, is the closest legal test of whether ToS plus `robots.txt` are binding on a non-compliant crawler; it remains unresolved.
Spotify went further than YouTube or Reddit. The [Developer Policy effective May 15, 2025](https://developer.spotify.com/policy) prohibits developers from using the Spotify Platform or Spotify Content to train any ML or AI models, including for academic and non-commercial use. Spotify's own privacy policy reserves first-party model training for features like AI DJ and AI playlists.
Meta Ray-Ban went the opposite direction. After the policy update announced April 29, 2025, voice recordings from Meta's smart glasses are stored in Meta's cloud by default with no opt-out, retained for up to one year for AI improvement, and the "Meta AI with camera" feature is on by default. Lawsuits followed in 2025 alleging inadequate disclosure of the shift from prior opt-out availability to default-on collection.
The remaining three postures take less space to describe because they are less definitional. TikTok's ToS grants a broad content license with no explicit external-training clause as of April 2026; the 2025 Community Guidelines update requires creators to disclose AI-generated uploads, which is content provenance rather than training consent. LinkedIn's AI-training toggle (introduced fall 2024) is on by default in the United States and off by default in the EU, EEA, UK, Switzerland, and Canada, and covers only LinkedIn's own Microsoft-hosted models. Medium and Quora are [RSL](https://rslstandard.org/rsl) launch endorsers, with commercial terms set per publisher.
Three structural observations follow from the table.
First, every platform that restricts third-party AI training continues its own. Spotify, YouTube, Meta, LinkedIn all share this asymmetry. The "no third-party training" rule is consistently a "third-party" rule, not a "training" rule.
Second, RSL endorsement is expression, not enforcement. As of early October 2025 reporting, no major AI lab has publicly committed to honoring RSL tags from non-deal publishers. The legal status of RSL tags as binding on a non-compliant crawler is being litigated by Reddit v. Anthropic.
Third, every one of these primitives is per-creator or per-account. None of them is per-speaker. For a recording that contains two speakers in a real conversation, every existing platform consent flow attaches to the uploader, not to either of the two voices in the audio. This is the structural reason that two-channel conversational voice data cannot be sourced from any of these platforms without re-doing consent at the speaker level. [Article 04](/blog/data-ceiling) made this point from the dataset side. This article makes it from the consent side.
{{FIG:f3}}
## The transparency ceiling — EU AI Act and US state laws
Layered on top of the biometric floor and the platform middle is a regime of AI-specific transparency obligations. Most are 2025 or 2026 effective dates. Almost all are disclosure-only, not consent-based.
The EU AI Act ([Regulation 2024/1689](https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai)) entered into force August 1, 2024 with phased application. Article 5 prohibitions applied February 2, 2025; general-purpose AI obligations applied August 2, 2025; Article 50 transparency obligations and high-risk system obligations apply August 2, 2026. Article 50 requires that providers of AI systems intended to interact with natural persons design those systems so users are informed they are talking to AI, that providers of generative AI mark output as artificially generated in machine-readable format, and that deployers of deepfakes disclose the artificial origin. The [Code of Practice on Transparency of AI-Generated Content](https://digital-strategy.ec.europa.eu/en/policies/code-practice-ai-generated-content), in first draft as of December 17, 2025, proposes watermarking and provenance metadata; the final version is targeted for June 2026. Voice-clone outputs and voice-agent interactions both sit inside Article 50.
The narrower point about the EU AI Act that often goes underappreciated: conversational voice AI is not automatically high-risk. Conformity assessment kicks in only when the voice system is deployed inside an Annex III high-risk context (credit scoring, hiring, education assessment, law enforcement, access to essential services). Most consumer voice AI is transparency-regulated, not conformity-regulated. The conformity assessment burden, when it does apply, is estimated at six to twelve months for complex systems.
California SB-53 was signed September 29, 2025 and effective January 1, 2026. The [Transparency in Frontier Artificial Intelligence Act](https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202520260SB53) applies to frontier developers training models above 10^26 FLOPs. Large frontier developers (>$500M revenue) face additional catastrophic-risk disclosure. Civil penalties run up to $1M per violation, enforced by the California Attorney General. The act has no voice-specific provisions; voice models fall in scope only by compute threshold, which today excludes most speech-to-speech models.
California AB-2013 was signed September 28, 2024 and effective January 1, 2026. [AB-2013](https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB2013) applies to any developer of a generative AI system available to Californians since January 1, 2022. Developers must publish on their website a summary of the datasets used to train or modify the system, including dataset sources, types of data, whether copyrighted or personal information was used, and ownership or licensing. Voice AI training datasets are explicitly in scope. Enforcement runs through California's Unfair Competition Law. AB-2013 makes a voice corpus visible. It does not, of itself, require speaker consent.
Utah HB-452 became effective May 7, 2025. [HB-452](https://le.utah.gov/~2025/bills/static/HB0452.html) targets mental-health chatbots specifically. Disclosure that the user is talking to AI is required prior to access, after a seven-day gap of non-use, and on user request. Sale or sharing of individually identifiable health information gathered in such a chat is prohibited. Voice mental-health apps operating in Utah are in scope.
Colorado is the outlier. [SB24-205](https://leg.colorado.gov/bills/sb24-205) was originally effective February 1, 2026; the date was pushed to June 30, 2026 via SB 25B-004 in fall 2025. The act imposes a reasonable-care duty on developers and deployers of high-risk AI in consequential decisions, plus consumer-facing AI disclosure unless obviousness defeats the requirement. In March 2026, Governor Polis released a draft proposal substantially overhauling the Act toward a narrower disclosure-and-recordkeeping regime. The final shape is in flux.
The shared property of every law in this section is that it is disclosure-based, not consent-based. AB-2013 makes a dataset visible. SB-53 makes a frontier model's training summary public. Article 50 makes the chatbot or voice agent identifiable as AI. None of these laws requires the upstream speaker to have opted in. The biometric floor above is the only layer that can carry that load, and it covers only the identification-purpose subset of voice data.
{{FIG:f4}}
## Japan — quiet alignment, light enforcement
Japan's posture is structurally different from both the United States and the European Union. There is no enumerated voice-biometric statute analogous to BIPA. The Act on the Protection of Personal Information ([個人情報保護法 / APPI](https://www.ppc.go.jp/personalinfo/legal/)) treats voice as personal data when tied to an identifiable individual, and treats voice features (特徴量) used for speaker identification as personal data within the same framework rather than as a separately enumerated biometric category. The 2022 amendments tightened cross-border transfer rules and breach notification; the 2024 supplementary amendments added enforcement teeth around overseas business operators serving Japanese consumers.
The Personal Information Protection Commission (PPC / 個人情報保護委員会) has been more active in administrative guidance than in headline fines. The [PPC's February 2024 statement to OpenAI](https://www.ppc.go.jp/files/pdf/240202_alert_AI_utilize.pdf) addressed training-data practices and the handling of 要配慮個人情報 (special-care-required personal information). The action was administrative guidance, not a fine, and it followed the broader EU pattern of clarifying that lawful basis matters for training data.
The [METI / MIC AI Guidelines for Business](https://www.meti.go.jp/policy/mono_info_service/geniac/ai_guidelines.html) (version 1.1 published March 2025, with subsequent updates) sit one level below regulation. They are voluntary, principle-based, and explicitly designed to cross-reference GDPR, the EU AI Act, and US-state regimes rather than re-implement them. Operators interpret them as best-practice scaffolding. They are not enforceable as such.
The operative consequence for an operator collecting two-channel voice data in Japan: input is permissive relative to the EU and to BIPA-jurisdiction US states, and downstream misuse is unforgiving but on a different vector. Japanese civil tort theory (defamation 名誉毀損, likeness rights 肖像権) provides a meaningful private-law avenue against unauthorized voice cloning of a named individual, separate from APPI's administrative regime. The combination is permissive-on-collection, expensive-on-cloning. For a corpus that explicitly avoids identification of individual speakers and does not produce consumer-facing clones, the risk surface is meaningfully smaller in Japan than in Illinois or Italy.
There is a less visible problem hiding under that surface. [The benchmark-landscape map](/blog/benchmark-landscape) named the Japanese full-duplex benchmark gap. The same gap exists in consent infrastructure: there is no Japanese-language equivalent of the EDPB explicit-consent guidance specific to voice, no Japanese voice-data crowd platform that publishes a consent template, and no Japanese-language dataset card analog to AB-2013. Japan's voice AI ecosystem is operating at speed (J-Moshi, the LINE/SoftBank voice work, NTT's research lines) on top of a consent infrastructure that is not yet built out. That gap is itself an opportunity for any operator that lands a defensible Japanese-language consent flow first.
## What an opt-in economy actually requires
Take the three layers as given and work backward to the dataset. An opt-in conversational voice corpus that survives the Replika logic, the BIPA litigation curve, and the AB-2013 disclosure requirement has to clear six things. None of them are difficult in isolation. The hard part is doing them together at scale.
{{FIG:f5}}
The first requirement is separate, layered consent flows. The Garante's operative finding was that one ToS checkbox cannot cover distinct purposes. A compliant corpus needs at least four distinct affirmative consents, gathered and revocable separately: recording consent per speaker, in-house training consent, third-party redistribution consent, and speaker-identification consent where enrollment is in scope. A fifth layer (likeness reuse for synthesized output) is the right answer for any operator that intends to license cloned voices.
The second is per-speaker consent on multi-speaker recordings. The platform middle layer's structural failure is that all consent attaches to the uploader, not to the speakers. A two-channel conversational corpus must reverse that, attaching consent to each speaker as a distinct data subject, captured before the call rather than after the upload.
The third is persistent revocation architecture. Article 9 GDPR consent must be as easy to withdraw as to give; BIPA written-consent is more durable but still subject to retention-schedule limits. The compliant infrastructure is a per-speaker consent record, an audit log, and a tested deletion path that propagates to derived models where contract permits.
The fourth is machine-readable dataset cards. AB-2013 requires a published training-data summary. The clean answer is a per-corpus dataset card, machine-readable, that names sources, types of data, license, the consent regime each segment was collected under, and the redress mechanism. Hugging Face's dataset card schema is the closest existing template.
The fifth is standards hooks. Three standards efforts matter. [RSL](https://rslstandard.org/rsl) is the web-layer expression primitive. [C2PA](https://c2pa.org/) carries the AI-output provenance metadata that Article 50 will require. The [IETF AI Preferences working group](https://datatracker.ietf.org/wg/aipref/about/) is where machine-readable training-data preferences are being formalized. None of these is finished.
The sixth is a pricing model that names the counterparty. The 2026 opt-in conversation has shifted from "should creators be paid" (settled, yes) to "what is the unit and who is paid." For two-channel conversational voice, the unit is per recorded hour per speaker because two distinct individuals supplied the input. Compensation models that treat the operator as the only counterparty replicate the platform-middle-layer asymmetry.
The harder claim is that the opt-in economy is emergent, not won. None of the six items above is fully shipped at scale. Naming what is missing is not the same as supplying it. An honest version of the Fullduplex thesis is that someone has to build this layer for two-channel conversational data, that it is the precondition for the next-generation STS models [the model-landscape article](/blog/sts-model-landscape) catalogued, and that the company that builds it ends up holding the substrate the next decade of voice AI runs on. That is the investment thesis. It is not yet the operational reality.
## Where the next fight lands
Three predictable 2026-2027 enforcement fronts are already visible.
The first is more BIPA voiceprint litigation against voice-AI defendants. *Cisneros v. Nuance Communications* on Seventh Circuit appeal is the bellwether; the Walmart warehouse voiceprint class action is the volume case. If either produces a clean voiceprint-as-biometric ruling and a non-trivial settlement, the plaintiffs' bar will rotate toward voice-AI vendors. Voice-cloning startups, real-time voice-agent vendors, and enterprise voice-biometric authentication products are all exposed.
The second is the first round of EU AI Act Article 50 enforcement after August 2, 2026. The Commission, the AI Office, and member-state DPAs are all positioned to act. The Italian Garante's prior cadence (the Replika ban in 2023, the €5M fine in 2025) suggests Italy will be early. The March 2026 Rome court reversal of the OpenAI fine is the open question: will Article 50 enforcement survive judicial review better than ad-hoc training-data fines did? Operators should plan for early enforcement that targets clear, verifiable obligations (chatbot disclosure, output marking) rather than ambiguous ones (lawful basis for training).
The third is a follow-on enforcement against a voice-companion app from a non-Italian EU DPA, applying the Replika logic. As of April 2026 no parallel fine has been issued. The EDPB cross-notified the Garante decision in May 2025; member-state DPAs have a 2026 calendar to pick up the precedent if they choose to.
Federal US silence continues. ADPPA has not advanced. There is no federal voice-biometric statute. State law is the operative floor through at least 2027. The sub-category that compounds risk fastest is voice-cloning inside integrated speech-to-speech models (FlashLabs Chroma, covered in [the model-landscape article](/blog/sts-model-landscape), is the first open example; closed cloud cloning offerings have been live since 2023). Voice cloning amplifies every consent and licensing question this article walked through. A two-channel dataset operator that does not have a defensible position on cloning before the cloning controversy escalates is operating with a thin moat.
The longer arc is not more enforcement. The longer arc is a mature consent infrastructure that operates the way HTTPS or web cookies operate, layered, machine-readable, default-on, and boring. The closing essay of this series will take that arc further — toward what a world of full-duplex machine listeners actually does to the 100,000-year-old interface that voice has been for everyone else.
---
Fullduplex is building large-scale two-channel full-duplex conversational speech datasets for next-generation speech-to-speech AI models, with consent flows designed for the layered regime this article describes. If your research, model, or product needs conversational speech data with a defensible consent record, [contact us about dataset access](mailto:hello@fullduplex.ai). Investors evaluating the data layer of the voice AI stack can [request data room access](mailto:hello@fullduplex.ai).
---
_Originally published at [https://fullduplex.ai/blog/consent-licensing-opt-in](https://fullduplex.ai/blog/consent-licensing-opt-in)._
_Part of **The STS Series** · 09 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._