the sts series02 / 10#full-duplex#threshold§ 09 sections · 07 figures

The full-duplex threshold.

For ten years voice assistants have felt like walkie-talkies. One side talks, releases, the other talks, releases. In 2024 a small cluster of systems quietly stopped working that way. The shift has a name, a number, and a reason.

fullduplex research

published apr 2026· 18 min read· ~4,600 words· series 02 / 10

18m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

fig.00 · two channels, both always open — the 200 ms floorfullduplex / synthesized

The transaction and the conversation

Every time you have used a voice assistant you have probably noticed the same small rhythm. You say something. The device falls silent for a moment. A beep. A pause. A canned reply. Across ten years and several trillion dollars of product development, the rhythm has not really changed, because it is not a product decision. It is a structural one.

Voice assistants have been walkie-talkies. One side talks, releases, the other talks, releases. You are not having a conversation. You are filing a short request. Siri, Alexa, the airline's phone tree, the smart speaker in your kitchen. They all accept one utterance at a time, think, then reply with one utterance back. It is a transaction, not a conversation.

In 2024 a small cluster of voice systems stopped working that way. They started behaving like a telephone. You can interrupt them mid-sentence. They can hum agreement while you are still talking. They can begin speaking before you have finished. If you try one of them on a good day and then immediately go back to Siri, the older system feels broken.

That shift has a name. It is the crossing of the full-duplex threshold. This article is about what the threshold actually is, what it takes to cross it, why the timing is bounded by a number about human biology rather than engineering effort, and what conversations above the threshold unlock that transactions never could.

Walkie-talkie and telephone, formalized

The words come from telecommunications. A half-duplex channel carries signal in only one direction at a time. Push to talk. Release to listen. A full-duplex channel carries signal both ways simultaneously. Talking and listening share the wire.

Every voice interface you have used sits somewhere on that axis. An airline IVR, a push-to-talk radio, a 2013-era Siri: walkie-talkie. A telephone call, a face-to-face conversation, the best 2024–2026 speech-to-speech systems: telephone. The push-to-talk button on a 2000s Nokia and the wake-word-then-utterance pattern on a 2020 Alexa are doing the same thing, one more explicitly than the other. Both are half-duplex.

The distinction is not stylistic. Half-duplex systems force you to serialize the two hardest parts of a conversation, listening and speaking, that humans overlap constantly. Once you notice the pattern you cannot un-notice it. The cadence of a walkie-talkie voice assistant is the sound of a channel that can only carry one direction at a time.

fig.01 · half-duplex vs full-duplex·········

Half-duplex systems (push-to-talk radio, legacy IVR, first-generation Siri) force speakers to take strict turns on a single channel. Full-duplex systems (a telephone call, an in-person conversation, 2024–2026 speech-to-speech models) carry two independent channels, so overlap, barge-in, and backchannel all become possible without breaking the exchange.

What humans actually do, measured in milliseconds

Start with a number most people have never heard. The modal gap between one person finishing a turn and the next person starting is about 200 milliseconds. Stivers et al. measured this in 2009 across 10 languages, from English and Japanese to Yélî Dnye, a language with about 4,000 speakers on a small island off Papua New Guinea. The distribution is remarkably stable. Wherever humans speak to each other, the gap clusters near 200 ms.

That number is small. It is small in a way that forces a specific conclusion about how the brain is organized.

Producing a single spoken word takes at least 600 milliseconds, measured end-to-end from intention to articulation (Levinson & Torreira 2015). Formulating a sentence takes much longer. If you waited until the other person had stopped talking to begin planning what to say, you would not be answering at 200 ms. You would be answering at somewhere between 800 and 2,000 ms. You would sound like Siri.

So humans must be doing something else. A series of EEG studies has begun to fill in the mechanism. Bögels and colleagues recorded brain activity while people answered spoken questions and found that the brain begins drafting the reply within about 500 ms of having enough information to answer. In most real conversations that moment arrives seconds before the other speaker finishes their sentence. Think of it as a kitchen during dinner rush: the listener is not waiting for the order to finish, then cooking. They are chopping, searing, plating, listening for the next order, all at once. By the time the current speaker falls silent, the reply is already half-built and waiting at the lips.

The 200 ms gap is not a politeness target. It is the shortest possible window to verify that the other person really did finish, release the staged response, and articulate. Below that window, overlap starts to happen. Heldner and Edlund 2010 measured actual gap distributions in three spontaneous speech corpora and found that overlaps account for roughly 40 percent of all transitions between speakers. The other 40 percent are gaps longer than 200 ms, with the modal cluster near that value. Human conversation is not a sequence of tidy alternating turns. It is a tightly interleaved stream in which "your turn" and "my turn" are often literally true at the same time.

That is the baseline any voice system is being measured against. Not an aspiration. A biological fact about how humans coordinate speech.

fig.02 · turn-gap across 10 languages·········

Approximate modal floor-transfer offsets from Stivers et al. 2009 (PNAS), measuring the gap between one turn ending and the next beginning across 10 unrelated languages. The modal value clusters tightly near 200 ms, with a maximum around 300 ms. Positions illustrative; the paper reports distributions, not single modal points.

Four behaviors that live above the threshold

Call them the four micro-behaviors that a walkie-talkie cannot do.

The first is barge-in. You are talking to the assistant. It is halfway through its answer. You already have what you need and you cut in. A full-duplex system stops speaking mid-word and re-centers on you. A half-duplex system keeps going until it reaches the end of its utterance, then processes you.

The second is backchannel. You are telling it a story. It says "mm-hm" while you talk, not to take the floor, but to signal that it is tracking. A full-duplex system emits these short acknowledgements without taking the turn. A half-duplex system cannot emit sound while listening, so it stays silent, and the silence reads as indifference.

The third is overlap and recovery. You both start at the same moment. One yields, the other continues, the rhythm recovers within a second. In a half-duplex system only one of you has a microphone open at a time and both of you end up restarting.

The fourth is co-completion. You are searching for a word and the listener supplies it. "The place we went last year, with the, the," "the tree house?" "yes, the tree house." Humans do this constantly. Full-duplex systems are only beginning to.

The first three of these are what Full-Duplex-Bench (Lin et al. 2025) turns into measurable axes. The benchmark measures pause handling, backchanneling, turn-taking, and interruption. The trick inside is a single simple rule, applied twice with the sign flipped. The rule: the model has "taken over" the floor if it produces more than 3 words or keeps speaking for more than 1 second. That rule is scored in two different situations. When the user has clearly finished and is waiting for a reply, you want the rule to trigger (good turn-taking, higher is better). When the user has only paused mid-thought and is about to keep going, you want the rule to stay quiet (good pause handling, lower is better). Same detector, opposite expectation, depending on what the user was actually doing. One tool, two sides of the same mistake.

Backchanneling is scored by chopping time into 200 ms buckets and comparing how often the model says "mm-hm" in each bucket against how often a real human does in the same spot, using natural conversation recordings as the ground truth (the Full-Duplex-Bench repository ships the evaluation code). The closer the two timing patterns match, the better. Interruption, the most open-ended axis, asks a separate language model, GPT-4-turbo, to read the exchange and grade the response on a 0-to-5 scale. That last one is a tell: even a serious benchmark for full-duplex behavior cannot reduce "did the model respond reasonably when I interrupted?" to a deterministic rule. It needs a reader.

The fourth behavior, co-completion, is not yet in the benchmark. That is worth sitting with. Co-completion is arguably the most conversational thing humans do together. No public evaluation of voice AI in 2026 measures it directly. The benchmarks can see three of the four threshold behaviors. The fourth one is waiting.

a number worth pinningThe successor benchmark, Full-Duplex-Bench v3, extends the test set with real human disfluency and tool-use scenarios. On it the current commercial leader, OpenAI's gpt-realtime, scores Pass@1 ≈ 0.600. In plain terms: under realistic speech conditions the best-in-class commercial system still fails roughly four out of every ten tool-using conversations. That is the single highest public number on a current full-duplex eval. It is also a long way from reliable.

fig.03 · four threshold behaviors·········

Darker blocks mark full-voiced speech, lighter blocks mark trailing speech after the overlap starts. Full-Duplex-Bench v1 operationalizes the first three as evaluation axes (barge-in as interruption, backchannel as backchanneling, overlap as pause handling plus turn-taking). Co-completion is not measured by any public 2026 benchmark.

Why walkie-talkie systems could not cross it

A 2023-era voice assistant was a pipeline. The user spoke. Speech recognition converted audio to text. A language model read the text and generated a reply. Text-to-speech spoke the reply aloud. Each stage ran in strict order. Nothing began until the previous stage finished.

That architecture has a floor. The cheapest modern streaming ASR reports first-token latency in the 100 to 500 millisecond range. A typical LLM needs 350 to 1,000 milliseconds to generate a usable first token of text response. Text-to-speech adds another 75 to 200 ms. Add glue, routing, and network jitter and you cannot drive the end-to-end latency much below one second. One second is five times the human modal turn gap. It is the cadence of a walkie-talkie.

Some vendors shaved this by running the stages in tight concurrent streams. The first tokens of the LLM's output are handed to TTS while the LLM is still generating later tokens. This gets serious production cascades down into the 300 to 800 millisecond range when everything goes right. It is honest engineering and it makes voice products noticeably more responsive. It does not change the underlying architecture.

The deeper limitation is that the pipeline cannot listen while it is speaking. The microphone is either open for input or closed for output. A user who tries to interrupt the assistant is ignored until the speech synthesizer finishes the current sentence. Imagine a person with only one ear and one mouth active at a time: while the mouth is working, the ears are shut off. That is the half-duplex mind. This is not a bug in a specific product. It is a property of the architecture. The machine can only be in one state at a time, and "I am speaking" and "I am listening" are different states.

There is a subtler loss in the cascade. Speech recognition throws away the melody of speech on its way in. Linguists call this prosody, the rise and fall of pitch, the stretched syllable, the micro-pauses that warn a sentence is not done yet. "Wait, I meant the green one, not the red," has a falling-then-rising pitch and a tiny pause right after "wait." Speech recognition flattens all of that into a plain string of text tokens. The language model then works only from the text. The single strongest cue a human uses to judge "is it my turn yet?" — the shape of the pitch and the micro-timing of the phrase boundary — has been deleted before the model ever sees it. De Ruiter et al. 2006 showed in a Dutch button-press study that the words and grammar of a sentence alone are enough to predict where a turn will end. Pitch alone is not. Both cues matter together for precise launch timing. A system that sees only the text is working with half the evidence.

All of that means "just make it faster" was not going to work. The walkie-talkie feel was built into the shape of the architecture, not its speed.

fig.04 · the half-duplex state machine·········

Listen, Transcribe, Think, Generate, Speak, repeat. Exactly one state is active at a time. When the user tries to interrupt during the Speak state, the system cannot accept the input because the microphone is closed. Timing ranges: ASR 100–500 ms, LLM 350–1,000 ms, TTS 75–200 ms, glue 150 ms. Minimum end-to-end latency is the sum of the stages.

The systems that crossed it, and what to listen for

Three systems, three labs, one summer. In May 2024 OpenAI demonstrated GPT-4o voice, a closed multimodal model that handles voice natively, with an average response latency reported around 320 milliseconds and live on-stage interruption handling. Google's Gemini Live arrived soon after with bidirectional streaming on Vertex. Then in September the French lab Kyutai released Moshi, an open-weights speech-to-speech model that listens and speaks on the same audio frame. These are the first systems that visibly cleared the full-duplex threshold in a commercial or open-weights form.

noteOf the three, only Moshi ships with an academic paper describing the internals. GPT-4o and Gemini Live are closed commercial systems, and their architectural details are not public. The rest of this section leans on Moshi for specific numbers because it is the only one with public numbers, not because it is the only one doing the work.

Moshi is not a cascade. Think of it the way video plays on a screen, as a sequence of frames, 12.5 slices per second. At each slice the same model is simultaneously deciding two things: what the user is saying, and what it wants to say next. There is no separate "listening" phase and "speaking" phase. On a single Nvidia L4 GPU the end-to-end lag is about 200 milliseconds. The paper reports a theoretical floor of 160 milliseconds, made up of the 80 ms frame duration and an 80 ms acoustic look-ahead (the amount of future audio the model peeks at before committing to an output).

That 200 ms figure is worth a caveat. It is a measurement on specific hardware under specific load. The 160 ms floor is structural, but the extra 40 ms that shows up in practice is compute time on an L4. On other hardware the picture changes, sometimes for the worse. Treating 200 ms as a hardware-independent property of Moshi is one of the common mistakes in the current discourse.

The second wave has already changed the shape of the landscape. Kyutai has split its product line into three systems, each above the threshold in a different sense. Moshi is the true full-duplex dialogue model. Hibiki is full-duplex in one direction only: it streams simultaneous French-to-English speech translation, keeping the rhythm of the original speaker intact. Unmute is a modular cascade that wraps any LLM with Kyutai's streaming ASR and TTS, running at 450 to 750 ms end-to-end. That is not above the 200 ms threshold, but it is below what most enterprises are used to. Three products, three different answers to "how simultaneous should the conversation be." OpenAI and Google have made similar product-line choices between integrated and cascaded voice systems, but less transparently. Kyutai is the clearest example to point at because each tier is documented publicly.

Research labs have added several more integrated full-duplex systems in 2025. SyncLLM from NVIDIA trains a full-duplex model on a Llama-3-8B base, using 212,000 hours of synthetic two-channel dialogue plus only about 2,000 hours of real Fisher conversations. That 100-to-1 ratio of synthetic to real is itself a datapoint: it tells you how scarce real two-channel conversation data is when you actually sit down to train one of these models. OmniFlatten takes a different approach, flattening the user and model streams into a single interleaved sequence so a standard autoregressive decoder can learn the full-duplex pattern without architecture changes.

A third category sits just below the threshold. Freeze-Omni and Mini-Omni2 claim "duplex" capability but achieve it by running interrupt detection on top of a half-duplex generator. Picture a walkie-talkie with a motion sensor bolted to the front: it is still a walkie-talkie, but it notices faster when the user starts to speak and switches direction sooner. The model is still only ever in one state at a time. This works well enough to feel responsive in most interactions, and is cheaper to train. It is also not the same thing as a native dual-stream model, and the distinction becomes visible whenever the conversation requires a backchannel or a co-completion.

The honest test is the one from § 4. Try barge-in. Try a backchannel. Try starting at the same time. A system that handles all three without breaking rhythm is above the threshold. A system that handles only barge-in is using an interrupt detector.

fig.05 · systems above the line·········

Six systems above or adjacent to the full-duplex threshold. Top row: the 2024 commercial or research entrants that first crossed it. Bottom row: the 2024–2025 wave, including Kyutai's translation-specific Hibiki and NVIDIA's SyncLLM. Freeze-Omni and Mini-Omni2 are not shown; they sit in a separate “partial” category (interrupt detection over a half-duplex generator).

Where crossing the threshold changes what voice is useful for

Below the threshold, voice is good for transactions. Above it, voice becomes viable for conversation. Those are different markets.

The transactional applications were always going to work on a walkie-talkie. "Set a timer for ten minutes." "What is the weather." "Cancel my 3 pm." These are slot-fills. You have one thing to say, the assistant has one thing to say back. A 1,000 ms gap is fine. In fact, for commands, a deliberate pause is often reassuring: it tells you the system registered your intent.

The conversational applications were not. Any use case where the user speaks more than a sentence at a time, or where the system needs to react mid-utterance, or where pauses carry meaning, broke on half-duplex. Some of those use cases are now plausible products for the first time.

Hands busy, eyes busy. A driver, a surgeon, a factory worker, a cook with flour on both hands. A system that can be interrupted when the context shifts, and can hum acknowledgement without forcing a formal turn, is usable in environments where a transactional assistant was a liability.
Accessibility. A voice interface that cannot be interrupted is a worse interface for a user with motor or vision impairment, not a better one. The WHO estimates 2.2 billion people with some form of vision impairment. A meaningful slice of them depends on voice interfaces. Full-duplex is the difference between a tool that respects their pace and one that forces them into a rigid listen-and-wait ritual.
Language learning. A practice partner that corrects you mid-sentence, the way a good human tutor does, not one that waits politely for you to finish your wrong sentence and then restart. The backchannel behavior matters here too: a learner knows they are being heard, not processed.
Companionship and emotional conversation. Journaling, therapy-adjacent reflection, light conversation for lonely hours. Pauses carry meaning in these contexts. A system that cannot respect a hesitant pause, or cannot say "mm-hm" while the user is working something out, reads as clinical even when its text content is good. This is one of the places where we at oto believe voice genuinely carries feelings and not only tasks, and a walkie-talkie rhythm undercuts that directly.
Children and elders. Groups for whom typing is the wrong affordance, and for whom a robotic turn-based voice feels even more alien than it does to working-age adults. A grandchild who has just learned to talk does not naturally understand the concept of a beep.

None of these is a small market. Taken together, they are the difference between voice being a convenience layer on your phone and voice being a primary interface for a large share of software.

fig.06 · what it unlocks·········

Use categories that become plausible products once voice is above the full-duplex threshold. Top row: productivity and accessibility contexts. Bottom row: emotional, generational, and agentic use cases where walkie-talkie rhythm was the structural blocker.

What full-duplex does not fix

Crossing the threshold is not the same as finishing the job. An honest reader should close this article knowing what is still broken.

The most important limitation is that having the capability in the model is not the same as having good behavior. A system that technically can listen and speak at the same time still has to decide when to jump in and when to hold back. That judgment is learned from data, and the amount of real multi-channel conversation data available for that kind of training is small. The CANDOR corpus, a 1,656-pair naturalistic two-channel collection, is the single best publicly documented two-channel resource at scale. It ships under CC BY-NC, which means academic labs can use it but commercial systems cannot train on it. The next tier is SyncLLM's 2,000 real Fisher hours and its 212,000 synthetic hours. Synthetic data is not nothing, but it is not the same thing either.

Second, capability is not reliability. In April 2026 two new benchmarks landed that let you calibrate how much of the work is actually done. τ-Voice, from Sierra, ran 278 voice-agent tasks and found that voice agents retain only 30 to 45 percent of the equivalent text agent's task score. In plain terms: if you take the same underlying model and give it a voice instead of a keyboard, it loses most of its competence. Full-Duplex-Bench v3, as noted above, has gpt-realtime at around 0.600 Pass@1 on tool use under real human disfluency. Voice AI, in short, is not yet voice-first AI.

Third, integrated systems have their own ceilings. Moshi's context window — the amount of conversation the model can hold in its working memory at any moment — is roughly four minutes of audio. That is a property of the architecture, not a knob the developer can turn up. Four minutes is long enough for a quick exchange and short enough that a therapy session, a tutoring hour, or a long customer support call will run out of memory partway through. Long-horizon voice conversation is an open problem that full-duplex alone does not touch.

Fourth, the quality of the threshold crossing depends on the language. Every benchmark and evaluation named in this article is dominated by English, with some coverage of Chinese and French. Stivers 2009 covered 10 languages and found the ~200 ms gap to be stable across all of them, but the systems are not stable across all of them. A Moshi trained on English conversation data does not automatically generalize to Japanese turn-taking, which has somewhat different rhythms around how speakers hand the floor back and forth than Indo-European languages. Consumer voice assistants in 2026 commonly ship in 100+ locales. Full-duplex voice AI does not yet have the training data to do that.

Fifth, production latency is usually worse than demo latency. Network jitter, load balancing, cold starts, and shared inference clusters push even an integrated model into the 300 to 500 ms range on a busy day. Above the threshold on a best-case test is not the same thing as above the threshold in a user's kitchen.

what this adds up toNone of this is a reason to discount the shift. The threshold is crossed. The walkie-talkie rhythm is no longer a structural necessity. But the systems that crossed it are brittle in ways the demos do not surface.

fig.07 · scorecard, april 2026·········

Scorecard for full-duplex voice AI as of April 2026. Left column: conversational behaviors the architecture now unlocks. Right column: what a full-duplex model alone does not fix, calibrated against τ-Voice 2026 (30–45% task-score retention) and Full-Duplex-Bench v3 (0.600 Pass@1 for gpt-realtime on tool use with human disfluency).

What it feels like

Come back to the opening scene. The kitchen, the smart speaker, the beep. That is what below the threshold sounds like. Command. Pause. Reply. Command. Pause. Reply.

Now imagine the same device, but you can start your second sentence before the first reply has finished, and it just stops and recenters on you. You pause in the middle of an instruction, and it says "mm-hm" quietly, and waits. You ask it a complicated question, and it answers at your pace, not its own. You forget that it is a machine for a few seconds, because the small rhythms of the conversation work. The device has not become smarter. It has become present.

That is what the full-duplex threshold actually is. Not faster, exactly. Not smarter. Closer to the shape of human conversation.— once you have had it, the walkie-talkie feels like a regression

The threshold is the easy part, in retrospect. What comes next is harder. A system that crosses the threshold still has to have something worth saying, has to remember you from one conversation to the next, has to work in the language you grew up speaking, has to respect when you do not want to be interrupted. The architecture has gotten out of the way. Everything else now has to be earned.

The engineering work is real. The product space is just beginning. The series continues with a deeper look at what actually has to be true about training data for systems above the threshold to keep improving.

■ ■ ■

Fullduplex

2026

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

Benchmarks, models, datasets — curated and kept current. If you want the next piece when it goes up, or want to help keep the catalogs honest, the community is the place to be. Series updates also go out on the oto newsletter.

see the benchmarks ↗join the community

previous in the series

Speech-to-speech AI, a primer.

What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.

←

next in the series

From pipeline to integrated.

Four families under one label. How to read a latency number, a data bet, and a license tier at the same time, as of April 2026.

→

#full-duplex#sts-series#latency#turn-taking#moshifiled under: the latent · sts 02

The full-duplex threshold.

The transaction and the conversation

Walkie-talkie and telephone, formalized

What humans actually do, measured in milliseconds

Four behaviors that live above the threshold

Why walkie-talkie systems could not cross it

The systems that crossed it, and what to listen for

Where crossing the threshold changes what voice is useful for

What full-duplex does not fix

What it feels like

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

The STS Series · 10 articles, released weekly

A primer on speech-to-speech.

The full-duplex threshold.

From pipeline to integrated.