---
title: "The full-duplex threshold"
description: "A number, a biology fact, and a small cluster of systems. What the full-duplex threshold actually is, what it takes to cross it, and what conversations above it unlock."
article_number: "02"
slug: full-duplex-threshold
published_at: 2026-04-09
reading_minutes: 13
tags: ["full-duplex", "latency", "conversation"]
canonical_url: https://fullduplex.ai/blog/full-duplex-threshold
markdown_url: https://fullduplex.ai/blog/full-duplex-threshold/md
series: "The STS Series"
series_position: 2
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# The full-duplex threshold: when AI stops sounding like a walkie-talkie

## 1. The transaction and the conversation

Every time you have used a voice assistant you have probably noticed the same small rhythm. You say something. The device falls silent for a moment. A beep. A pause. A canned reply. Across ten years and several trillion dollars of product development, the rhythm has not really changed, because it is not a product decision. It is a structural one.

Voice assistants have been walkie-talkies. One side talks, releases, the other talks, releases. You are not having a conversation. You are filing a short request. Siri, Alexa, the airline's phone tree, the smart speaker in your kitchen. They all accept one utterance at a time, think, then reply with one utterance back. It is a transaction, not a conversation.

In 2024 a small cluster of voice systems stopped working that way. They started behaving like a telephone. You can interrupt them mid-sentence. They can hum agreement while you are still talking. They can begin speaking before you have finished. If you try one of them on a good day and then immediately go back to Siri, the older system feels broken.

That shift has a name. It is the crossing of the full-duplex threshold. This article is about what the threshold actually is, what it takes to cross it, why the timing is bounded by a number about human biology rather than engineering effort, and what conversations above the threshold unlock that transactions never could.

## 2. Walkie-talkie and telephone, formalized

The words come from telecommunications. A [half-duplex](https://en.wikipedia.org/wiki/Duplex_(telecommunications)) channel carries signal in only one direction at a time. Push to talk. Release to listen. A [full-duplex](https://en.wikipedia.org/wiki/Duplex_(telecommunications)#Full_duplex) channel carries signal both ways simultaneously. Talking and listening share the wire.

Every voice interface you have used sits somewhere on that axis. An airline IVR, a push-to-talk radio, a 2013-era Siri: walkie-talkie. A telephone call, a face-to-face conversation, the best 2024-2026 speech-to-speech systems: telephone. The push-to-talk button on a 2000s Nokia and the wake-word-then-utterance pattern on a 2020 Alexa are doing the same thing, one more explicitly than the other. Both are half-duplex.

The distinction is not stylistic. Half-duplex systems force you to serialize the two hardest parts of a conversation, listening and speaking, that humans overlap constantly. Once you notice the pattern you cannot un-notice it. The cadence of a walkie-talkie voice assistant is the sound of a channel that can only carry one direction at a time.

*Figure F1: two small side-by-side diagrams. On the left, a walkie-talkie with a push-to-talk button and a dashed line showing one-direction-at-a-time transmission. On the right, a telephone with two overlapping waveforms, labeled both-at-once. Inside-SVG text limited to the labels "one at a time" and "both at once." Detailed caption with historical attribution in the figcaption.*

## 3. What humans actually do, measured in milliseconds

Start with a number most people have never heard. The modal gap between one person finishing a turn and the next person starting is about 200 milliseconds. Stivers et al. [measured this in 2009](https://www.pnas.org/doi/10.1073/pnas.0903616106) across 10 languages, from English and Japanese to Yélî Dnye, a language with about 4,000 speakers on a small island off Papua New Guinea. The distribution is remarkably stable. Wherever humans speak to each other, the gap clusters near 200 ms.

That number is small. It is small in a way that forces a specific conclusion about how the brain is organized.

Producing a single spoken word takes at least 600 milliseconds, measured end-to-end from intention to articulation [[Levinson & Torreira 2015](https://doi.org/10.3389/fpsyg.2015.00731)]. Formulating a sentence takes much longer. If you waited until the other person had stopped talking to begin planning what to say, you would not be answering at 200 ms. You would be answering at somewhere between 800 and 2,000 ms. You would sound like Siri.

So humans must be doing something else. A series of EEG studies has begun to fill in the mechanism. Bögels and colleagues [recorded brain activity](https://www.nature.com/articles/srep12881) while people answered spoken questions and found that the brain begins drafting the reply within about 500 ms of having enough information to answer. In most real conversations that moment arrives seconds before the other speaker finishes their sentence. Think of it as a kitchen during dinner rush: the listener is not waiting for the order to finish, then cooking. They are chopping, searing, plating, listening for the next order, all at once. By the time the current speaker falls silent, the reply is already half-built and waiting at the lips.

The 200 ms gap is not a politeness target. It is the shortest possible window to verify that the other person really did finish, release the staged response, and articulate. Below that window, overlap starts to happen. [Heldner and Edlund 2010](https://doi.org/10.1016/j.wocn.2010.08.002) measured actual gap distributions in three spontaneous speech corpora and found that overlaps account for roughly 40 percent of all transitions between speakers. The other 40 percent are gaps longer than 200 ms, with the modal cluster near that value. Human conversation is not a sequence of tidy alternating turns. It is a tightly interleaved stream in which "your turn" and "my turn" are often literally true at the same time.

That is the baseline any voice system is being measured against. Not an aspiration. A biological fact about how humans coordinate speech.

*Figure F2: horizontal strip chart, 0 to 1500 ms on the x-axis. Ten language bars clustering around 200 ms. One filled dot at 200 ms labelled "human modal." Inside-SVG text limited to axis label and language labels. Numeric precision and source in figcaption.*

## 4. Four behaviors that live above the threshold

Call them the four micro-behaviors that a walkie-talkie cannot do.

The first is **barge-in**. You are talking to the assistant. It is halfway through its answer. You already have what you need and you cut in. A full-duplex system stops speaking mid-word and re-centers on you. A half-duplex system keeps going until it reaches the end of its utterance, then processes you.

The second is **backchannel**. You are telling it a story. It says "mm-hm" while you talk, not to take the floor, but to signal that it is tracking. A full-duplex system emits these short acknowledgements without taking the turn. A half-duplex system cannot emit sound while listening, so it stays silent, and the silence reads as indifference.

The third is **overlap and recovery**. You both start at the same moment. One yields, the other continues, the rhythm recovers within a second. In a half-duplex system only one of you has a microphone open at a time and both of you end up restarting.

The fourth is **co-completion**. You are searching for a word and the listener supplies it. "The place we went last year, with the, the," "the tree house?" "yes, the tree house." Humans do this constantly. Full-duplex systems are only beginning to.

The first three of these are what [Full-Duplex-Bench](https://arxiv.org/abs/2503.04721) (Lin et al. 2025) turns into measurable axes. The benchmark measures pause handling, backchanneling, turn-taking, and interruption. The trick inside is a single simple rule, applied twice with the sign flipped. The rule: the model has "taken over" the floor if it produces more than 3 words or keeps speaking for more than 1 second. That rule is scored in two different situations. When the user has clearly finished and is waiting for a reply, you want the rule to trigger (good turn-taking, higher is better). When the user has only paused mid-thought and is about to keep going, you want the rule to stay quiet (good pause handling, lower is better). Same detector, opposite expectation, depending on what the user was actually doing. One tool, two sides of the same mistake.

Backchanneling is scored by chopping time into 200 ms buckets and comparing how often the model says "mm-hm" in each bucket against how often a real human does in the same spot, using natural conversation recordings as the ground truth (the [Full-Duplex-Bench repository](https://github.com/DanielLin94144/Full-Duplex-Bench) ships the evaluation code). The closer the two timing patterns match, the better. Interruption, the most open-ended axis, asks a separate language model, GPT-4-turbo, to read the exchange and grade the response on a 0-to-5 scale. That last one is a tell: even a serious benchmark for full-duplex behavior cannot reduce "did the model respond reasonably when I interrupted?" to a deterministic rule. It needs a reader.

The fourth behavior, co-completion, is not yet in the benchmark. That is worth sitting with. Co-completion is arguably the most conversational thing humans do together. No public evaluation of voice AI in 2026 measures it directly. The benchmarks can see three of the four threshold behaviors. The fourth one is waiting.

The successor benchmark, [Full-Duplex-Bench v3](https://arxiv.org/abs/2604.04847), pushes further. It extends the test set with real human disfluency (the "um"s, false starts, and restarts that real speakers produce) and tool-use scenarios (the agent has to actually look something up mid-conversation). On that benchmark the current commercial leader, OpenAI's [gpt-realtime](https://platform.openai.com/docs/guides/realtime), scores Pass@1 around 0.600. In plain terms: under realistic speech conditions the best-in-class commercial system still fails roughly four out of every ten tool-using conversations. That is the single highest public number on a current full-duplex eval. It is also a long way from reliable.

*Figure F3: four compact timeline snippets arranged in a 2x2 grid. Two horizontal tracks per snippet, one for each speaker, with audio blocks colored differently. No numbers. Three-word labels: "barge in," "backchannel," "overlap," "co-completion." The shape of each interaction should be readable at a glance.*

## 5. Why walkie-talkie systems could not cross the threshold

A 2023-era voice assistant was a pipeline. The user spoke. Speech recognition converted audio to text. A language model read the text and generated a reply. Text-to-speech spoke the reply aloud. Each stage ran in strict order. Nothing began until the previous stage finished.

That architecture has a floor. The cheapest modern streaming ASR reports first-token latency in the [100 to 500 millisecond range](https://openai.com/index/introducing-our-next-generation-audio-models/). A typical LLM needs 350 to 1,000 milliseconds to generate a usable first token of text response. Text-to-speech adds another 75 to 200 ms. Add glue, routing, and network jitter and you cannot drive the end-to-end latency much below one second. One second is five times the human modal turn gap. It is the cadence of a walkie-talkie.

Some vendors shaved this by running the stages in tight concurrent streams. The first tokens of the LLM's output are handed to TTS while the LLM is still generating later tokens. This gets serious production cascades down into the 300 to 800 millisecond range when everything goes right. It is honest engineering and it makes voice products noticeably more responsive. It does not change the underlying architecture.

The deeper limitation is that the pipeline cannot listen while it is speaking. The microphone is either open for input or closed for output. A user who tries to interrupt the assistant is ignored until the speech synthesizer finishes the current sentence. Imagine a person with only one ear and one mouth active at a time: while the mouth is working, the ears are shut off. That is the half-duplex mind. This is not a bug in a specific product. It is a property of the architecture. The machine can only be in one state at a time, and "I am speaking" and "I am listening" are different states.

There is a subtler loss in the cascade. Speech recognition throws away the melody of speech on its way in. Linguists call this *prosody*, the rise and fall of pitch, the stretched syllable, the micro-pauses that warn a sentence is not done yet. "Wait, I meant the green one, not the red," has a falling-then-rising pitch and a tiny pause right after "wait." Speech recognition flattens all of that into a plain string of text tokens. The language model then works only from the text. The single strongest cue a human uses to judge "is it my turn yet?" — the shape of the pitch and the micro-timing of the phrase boundary — has been deleted before the model ever sees it. [De Ruiter et al. 2006](https://doi.org/10.1353/lan.2006.0130) showed in a Dutch button-press study that the words and grammar of a sentence alone are enough to predict where a turn will end. Pitch alone is not. Both cues matter together for precise launch timing. A system that sees only the text is working with half the evidence.

All of that means "just make it faster" was not going to work. The walkie-talkie feel was built into the shape of the architecture, not its speed.

*Figure F4: half-duplex state machine. Five states in a circular layout: Listen, Transcribe, Think, Generate Voice, Speak. Clean arrows between them. One dashed red arrow labeled "user interrupts here" pointing at the Speak state and bouncing off. No numbers inside the SVG. Timing ranges live in the figcaption.*

## 6. The systems that crossed it, and what to listen for

Three systems, three labs, one summer. In May 2024 OpenAI demonstrated [GPT-4o voice](https://openai.com/index/hello-gpt-4o/), a closed multimodal model that handles voice natively, with an average response latency reported around 320 milliseconds and live on-stage interruption handling. Google's [Gemini Live](https://deepmind.google/technologies/gemini/) arrived soon after with bidirectional streaming on Vertex. Then in September the French lab Kyutai released [Moshi](https://moshi.chat), an open-weights speech-to-speech model that listens and speaks on the same audio frame. These are the first systems that visibly cleared the full-duplex threshold in a commercial or open-weights form.

A note on what is citable from here on. Of the three, only Moshi ships with an academic paper describing the internals. GPT-4o and Gemini Live are closed commercial systems, and their architectural details are not public. The rest of this section leans on Moshi for specific numbers because it is the only one with public numbers, not because it is the only one doing the work.

Moshi is not a cascade. Think of it the way video plays on a screen, as a sequence of frames, 12.5 slices per second. At each slice the same model is simultaneously deciding two things: what the user is saying, and what it wants to say next. There is no separate "listening" phase and "speaking" phase. On a single Nvidia L4 GPU the end-to-end lag is about 200 milliseconds. The paper reports a theoretical floor of 160 milliseconds, made up of the 80 ms frame duration and an 80 ms acoustic look-ahead (the amount of future audio the model peeks at before committing to an output).

That 200 ms figure is worth a caveat. It is a measurement on specific hardware under specific load. The 160 ms floor is structural, but the extra 40 ms that shows up in practice is compute time on an L4. On other hardware the picture changes, sometimes for the worse. Treating 200 ms as a hardware-independent property of Moshi is one of the common mistakes in the current discourse.

The second wave has already changed the shape of the landscape. Kyutai has split its product line into three systems, each above the threshold in a different sense. [Moshi](https://github.com/kyutai-labs/moshi) is the true full-duplex dialogue model. [Hibiki](https://github.com/kyutai-labs/hibiki) is full-duplex in one direction only: it streams simultaneous French-to-English speech translation, keeping the rhythm of the original speaker intact. [Unmute](https://kyutai.org/unmute) is a modular cascade that wraps any LLM with Kyutai's streaming ASR and TTS, running at 450 to 750 ms end-to-end. That is not above the 200 ms threshold, but it is below what most enterprises are used to. Three products, three different answers to "how simultaneous should the conversation be." OpenAI and Google have made similar product-line choices between integrated and cascaded voice systems, but less transparently. Kyutai is the clearest example to point at because each tier is documented publicly.

Research labs have added several more integrated full-duplex systems in 2025. [SyncLLM](https://arxiv.org/abs/2409.15594) from NVIDIA trains a full-duplex model on a Llama-3-8B base, using 212,000 hours of synthetic two-channel dialogue plus only about 2,000 hours of real Fisher conversations. That 100-to-1 ratio of synthetic to real is itself a datapoint: it tells you how scarce real two-channel conversation data is when you actually sit down to train one of these models. [OmniFlatten](https://arxiv.org/abs/2410.17799) takes a different approach, flattening the user and model streams into a single interleaved sequence so a standard autoregressive decoder can learn the full-duplex pattern without architecture changes.

A third category sits just below the threshold. [Freeze-Omni](https://arxiv.org/abs/2411.00774) and [Mini-Omni2](https://arxiv.org/abs/2410.11190) claim "duplex" capability but achieve it by running interrupt detection on top of a half-duplex generator. Picture a walkie-talkie with a motion sensor bolted to the front: it is still a walkie-talkie, but it notices faster when the user starts to speak and switches direction sooner. The model is still only ever in one state at a time. This works well enough to feel responsive in most interactions, and is cheaper to train. It is also not the same thing as a native dual-stream model, and the distinction becomes visible whenever the conversation requires a backchannel or a co-completion.

The honest test is the one from §4. Try barge-in. Try a backchannel. Try starting at the same time. A system that handles all three without breaking rhythm is above the threshold. A system that handles only barge-in is using an interrupt detector.

*Figure F5: six demo cards arranged in two rows of three. Card header: model name, developer, release date. Card body: one-line "listen for" prompt. Cards for Moshi, GPT-4o voice, Gemini Live, Kyutai Hibiki, SyncLLM, and OmniFlatten. Links live in the figcaption and surrounding HTML, not in the SVG.*

## 7. Where the threshold changes what voice is useful for

Below the threshold, voice is good for transactions. Above it, voice becomes viable for conversation. Those are different markets.

The transactional applications were always going to work on a walkie-talkie. "Set a timer for ten minutes." "What's the weather." "Cancel my 3 pm." These are slot-fills. You have one thing to say, the assistant has one thing to say back. A 1,000 ms gap is fine. In fact, for commands, a deliberate pause is often reassuring: it tells you the system registered your intent.

The conversational applications were not. Any use case where the user speaks more than a sentence at a time, or where the system needs to react mid-utterance, or where pauses carry meaning, broke on half-duplex. Some of those use cases are now plausible products for the first time.

**Hands busy, eyes busy.** A driver, a surgeon, a factory worker, a cook with flour on both hands. A system that can be interrupted when the context shifts, and can hum acknowledgement without forcing a formal turn, is usable in environments where a transactional assistant was a liability.

**Accessibility.** A voice interface that cannot be interrupted is a worse interface for a user with motor or vision impairment, not a better one. The [WHO estimates 2.2 billion people with some form of vision impairment](https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment). A meaningful slice of them depends on voice interfaces. Full-duplex is the difference between a tool that respects their pace and one that forces them into a rigid listen-and-wait ritual.

**Language learning.** A practice partner that corrects you mid-sentence, the way a good human tutor does, not one that waits politely for you to finish your wrong sentence and then restart. The backchannel behavior matters here too: a learner knows they are being heard, not processed.

**Companionship and emotional conversation.** Journaling, therapy-adjacent reflection, light conversation for lonely hours. Pauses carry meaning in these contexts. A system that cannot respect a hesitant pause, or cannot say "mm-hm" while the user is working something out, reads as clinical even when its text content is good. This is one of the places where the team at oto believes voice genuinely carries feelings and not only tasks, and a walkie-talkie rhythm undercuts that directly. The primer article for this series [discusses the design argument at more length](./01-sts-primer).

**Children and elders.** Groups for whom typing is the wrong affordance, and for whom a robotic turn-based voice feels even more alien than it does to working-age adults. A grandchild who has just learned to talk does not naturally understand the concept of a beep.

None of these is a small market. Taken together, they are the difference between voice being a convenience layer on your phone and voice being a primary interface for a large share of software.

*Figure F6: 2x3 grid of use-case cards. Small, recognizable icon per cell. Two-word labels. Numbers and context in the figcaption.*

## 8. What full-duplex does not fix

Crossing the threshold is not the same as finishing the job. A honest reader should close this article knowing what is still broken.

The most important limitation is that having the capability in the model is not the same as having good behavior. A system that technically *can* listen and speak at the same time still has to decide *when* to jump in and *when* to hold back. That judgment is learned from data, and the amount of real multi-channel conversation data available for that kind of training is small. The [CANDOR corpus](https://www.betterup.com/research/candor-research), a 1,656-pair naturalistic two-channel collection, is the single best publicly documented two-channel resource at scale. It ships under [CC BY-NC](https://creativecommons.org/licenses/by-nc/4.0/), which means academic labs can use it but commercial systems cannot train on it. The next tier is SyncLLM's 2,000 real Fisher hours and its 212,000 synthetic hours. Synthetic data is not nothing, but it is not the same thing either.

Second, capability is not reliability. In April 2026 two new benchmarks landed that let you calibrate how much of the work is actually done. [τ-Voice](https://arxiv.org/abs/2603.13686), from Sierra, ran 278 voice-agent tasks and found that voice agents retain only 30 to 45 percent of the equivalent text agent's task score. In plain terms: if you take the same underlying model and give it a voice instead of a keyboard, it loses most of its competence. Full-Duplex-Bench v3, as noted above, has gpt-realtime at around 0.600 Pass@1 on tool use under real human disfluency. Voice AI, in short, is not yet voice-first AI.

Third, integrated systems have their own ceilings. Moshi's context window — the amount of conversation the model can hold in its working memory at any moment — is roughly four minutes of audio. That is a property of the architecture, not a knob the developer can turn up. Four minutes is long enough for a quick exchange and short enough that a therapy session, a tutoring hour, or a long customer support call will run out of memory partway through. Long-horizon voice conversation is an open problem that full-duplex alone does not touch.

Fourth, the quality of the threshold crossing depends on the language. Every benchmark and evaluation named in this article is dominated by English, with some coverage of Chinese and French. Stivers 2009 covered 10 languages and found the ~200 ms gap to be stable across all of them, but the systems are not stable across all of them. A Moshi trained on English conversation data does not automatically generalize to Japanese turn-taking, which has somewhat different rhythms around how speakers hand the floor back and forth than Indo-European languages. Consumer voice assistants in 2026 commonly ship in 100+ locales. Full-duplex voice AI does not yet have the training data to do that.

Fifth, production latency is usually worse than demo latency. Network jitter, load balancing, cold starts, and shared inference clusters push even an integrated model into the 300 to 500 ms range on a busy day. Above the threshold on a best-case test is not the same thing as above the threshold in a user's kitchen.

None of this is a reason to discount the shift. The threshold is crossed. The walkie-talkie rhythm is no longer a structural necessity. But the systems that crossed it are brittle in ways the demos do not surface.

*Figure F7: two-column card. Left column "What works above the threshold," five green check-mark rows: simultaneous listen + speak, barge-in, backchannel, short overlap recovery, human-adjacent turn gaps. Right column "What does not yet," five question-mark rows: co-completion at scale, long-horizon memory, non-English turn-taking, tool-use reliability, production-latency consistency. Short phrases only inside the SVG.*

## 9. What it feels like

Come back to the opening scene. The kitchen, the smart speaker, the beep. That is what below the threshold sounds like. Command. Pause. Reply. Command. Pause. Reply.

Now imagine the same device, but you can start your second sentence before the first reply has finished, and it just stops and recenters on you. You pause in the middle of an instruction, and it says "mm-hm" quietly, and waits. You ask it a complicated question, and it answers at your pace, not its own. You forget that it is a machine for a few seconds, because the small rhythms of the conversation work. The device has not become smarter. It has become present.

That is what the full-duplex threshold actually is. Not faster, exactly. Not smarter. Closer to the shape of human conversation. And once you have had a conversation in that shape, returning to a walkie-talkie feels like a regression.

The threshold is the easy part, in retrospect. What comes next is harder. A system that crosses the threshold still has to have something worth saying, has to remember you from one conversation to the next, has to work in the language you grew up speaking, has to respect when you do not want to be interrupted. The architecture has gotten out of the way. Everything else now has to be earned.

The engineering work is real. The product space is just beginning. The series continues with a deeper look at what actually has to be true about training data for systems above the threshold to keep improving. If you want the next piece when it goes up, the [oto newsletter](https://oto.earth/newsletter) is the easiest way to get it.

---

### Notes, corrections, and sources

External links are inline in the prose above, per CLAUDE.md Writing rule 9 ("link while drafting, not after"). Previously-outstanding SOURCE_NEEDED items resolved on 2026-04-19 draft-v2 pass: Full-Duplex-Bench v3 arXiv ID is 2604.04847, τ-Voice arXiv ID is 2603.13686, Full-Duplex-Bench v1 repo moved to `github.com/DanielLin94144/Full-Duplex-Bench`, CANDOR canonical URL is `betterup.com/research/candor-research`. One link-reachability class check is still owed: the egress proxy blocks kyutai.org, moshi.chat, openai.com, deepmind.google, oto.earth, and academic journal mirrors from this environment, so visual verification of those URLs in a browser is still required before final.mdx.

Atomic notes directly supporting this draft: `2026-04-18-half-vs-full-duplex-analogy`, `2026-04-18-cascade-latency-math`, `2026-04-18-latency-threshold-crossed`, `2026-04-19-moshi-two-transformer-split`, `2026-04-19-moshi-200ms-hardware-dependent`, `2026-04-19-moshi-inner-monologue-text-first`, `2026-04-19-moshi-two-channel-training-is-the-trick`, `2026-04-19-fdb-pause-handling-definition`, `2026-04-19-fdb-backchanneling`, `2026-04-19-fdb-turn-taking`, `2026-04-19-fdb-interruption-definition`, `2026-04-19-fdb-four-axes-synthesis`.

Paper notes: `defossez-2024-moshi`, `stivers-2009-turn-taking`, `bogels-2015-neural-signatures`, `levinson-2015-timing-turn-taking`, `heldner-2010-pauses-gaps-overlaps`, `deruiter-2006-projecting-turn-end`.

Model notes used: `moshi`, `gpt-4o-voice`, `openai-realtime-api`, `gemini-live`, `kyutai-unmute`, `kyutai-hibiki`, `syncllm`, `omniflatten`, `freeze-omni`, `mini-omni2`.

Benchmark notes used: `full-duplex-bench`, `full-duplex-bench-v3`, `tau-voice`, `uro-bench`.

---

_Originally published at [https://fullduplex.ai/blog/full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold)._
_Part of **The STS Series** · 02 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
