From pipeline to integrated.
“Integrated” sounds like one architecture. It is at least four. As of April 2026 the open speech-to-speech landscape has fractured into four architecturally distinct families, each with its own latency math, data bets, and license exposure. Knowing which family a model belongs to is the prerequisite for reading any of its numbers.
Moshi as the landmark
In September 2024 a small French lab, Kyutai, released Moshi. A 7-billion-parameter model that listened and spoke at the same time, shipped as public weights under CC-BY 4.0, with the Mimi codec released alongside it under MIT. You could download Moshi, run it on a laptop with a recent GPU, and have a conversation that did not feel like Siri. The release paper claimed 160 milliseconds of theoretical latency and 200 milliseconds measured on an NVIDIA L4.
That single release turned a distinction that had lived inside research papers for years into a product fact. On one side of the line were cascades. Recognize the user's speech, run a language model, synthesize the reply, play it back. On the other side was something else: speech in, speech out, one model, continuous inference. The research community had sometimes called this split “pipeline versus end-to-end” or “modular versus integrated.” Before Moshi, only closed commercial systems (GPT-4o voice, which OpenAI had demoed in May 2024) could be pointed at as instances of the integrated side. After Moshi, you could clone a repository and read the architecture.
Eighteen months later, the integrated side has fractured. As of April 2026 there are at least four architecturally distinct families of full-duplex STS, plus a growing closed commercial layer above them. This article maps the families and explains why the distinctions matter, especially when comparing latency, training-data needs, and licensing exposure across products.
The pipeline ancestor
The cascade is worth naming in one paragraph so the contrast later has something to push against. You already saw the detailed version in Article 02. In short, the legacy voice pipeline is a five-stage loop. The device captures audio, an automatic speech recognition (ASR) model turns it into text, a language model reads the text and produces a reply, a text-to-speech (TTS) model renders that reply back into audio, and the speaker plays it. Each stage has to finish before the next can start, so the minimum achievable latency is the sum of stages. Once you add network hops, model routing, and cold starts, a well-tuned cascade lands near one second end-to-end on a typical day.
It is fair to say that modern cascades have kept improving. Deepgram's Aura-2 and Aura-Nova lines quote sub-second end-to-end latency for their agent stack, and Cartesia's Sonic is now one of the fastest commercial TTS engines at roughly 90 milliseconds time-to-first-audio. A cascade with excellent components can now dip well under a second on favorable conditions. The structural property that still holds is that a cascade cannot, by its own design, listen while it speaks. Every full-duplex claim you read about is a claim that some part of the system crosses that property.
Four families under one label
Call the other side “integrated” and it looks like one thing. In practice it is at least four. The cleanest way to keep them straight is not an architecture diagram. It is a kitchen.
Imagine a single cook who has to take orders from the dining room and plate dishes at the counter at the same time. Four kitchens solve this problem four different ways, and the four families of integrated STS map onto them almost exactly.
- Family 1 is two intercoms at once. One channel carries the incoming order, another carries the outgoing plating call. Both are open simultaneously. The cook also scribbles mental notes on a whiteboard to keep the thread straight across both lines. Moshi, PersonaPlex, and Sesame CSM are wired this way.
- Family 2 is one intercom, alternating very fast. Only one line is available, but orders and plating calls take very short turns on it, alternating so quickly that from the dining room it sounds like both directions are happening at once. OmniFlatten, Qwen2.5-Omni, Covo-Audio, and Kimi-Audio are wired this way.
- Family 3 is a relay with a supervisor. The classic kitchen pipeline (take order, prep, plate) is still there, stage by stage. A supervisor stands behind the line and shouts “cut in now!” every half second, so the relay overlaps at short time scales instead of waiting for each handoff. Freeze-Omni and MiniCPM-o are wired this way.
- Family 4 is no tickets at all. The cook never writes anything down, never converts the conversation into printable text. The whole kitchen runs on continuous hand signals, and an internal “speak or listen” instinct decides which way the signal flows. SALMONN-omni is the only public example.
The quadrant below shows the same split in the form researchers tend to draw, and the cheat-sheet table after it is the one-line reference.
| Family | Everyday intuition | Example models | What “200 ms” means here |
|---|---|---|---|
| 1. Dual-stream + codec | Two intercoms at once | Moshi, PersonaPlex, CSM-1B | Theoretical floor on the single transformer's forward pass |
| 2. Interleaved / flatten | One intercom, very fast alternation | Qwen2.5-Omni, Covo-Audio-Chat-FD, Kimi-Audio | The length of one alternating block, not the response time |
| 3. Cascade + predictor | Relay with a supervisor | Freeze-Omni, MiniCPM-o 4.5 | The supervisor signal alone. Full pipeline is 5–10× slower |
| 4. Codec-free / thinking | No tickets, only hand signals | SALMONN-omni | Not yet standardized; the family is too young |
The reason the distinction matters is not aesthetic. Each family sets a different bound on what a published number actually means. A “200 millisecond” result from Family 1 is a theoretical floor on how long the single transformer takes to produce its reply. The same number from Family 2 is the length of one alternating block, not the response time: a Family 2 system still has to watch several blocks go by before it has formed a full reply. A Family 3 “200 ms” usually describes only the supervisor signal, while the full relay behind it reports separately, and that number is typically five to ten times higher. Three families, three different phenomena, one column header.
Family 1 Dual-stream with a neural codec
The reference system for this family is Moshi itself. Kyutai's decision that made the architecture tractable was the Mimi codec, a streaming neural audio codec operating at 12.5 Hertz. Think of the codec as a smart compressor in the same spirit as MP3, but designed specifically so a language model can read its output the way it reads words. Every 80 milliseconds Mimi emits a small handful of “sound tokens” that capture tone and rhythm as well as content.
In Moshi, three streams of these tokens run side by side: one for the user's audio, one for the model's audio, and a third text stream for the model's “inner monologue,” which is a running text draft of what the model is about to say. A single transformer reads all three streams together. The model's audio stream is fed back out through Mimi in reverse to become sound.
That architecture has now been adopted by more than one lab. NVIDIA PersonaPlex-7B-v1, released by NVIDIA ADLR in January 2026, initializes its weights directly from Moshi and fine-tunes on a corpus of 1,840 hours of synthetic customer-service dialogue and 410 hours of question-answering dialogue, generated by Qwen3-32B and GPT-OSS-120B as transcripts and rendered by Chatterbox TTS as speech, plus the Fisher English corpus for casual dialogue. PersonaPlex's contribution is a hybrid system-prompt mechanism that conditions the model on a role (via text) and a voice (via a short audio sample). Code is MIT, weights are under the NVIDIA Open Model License, a bespoke license that permits commercial use with conditions. Sesame's CSM-1B, released under Apache 2.0, uses the same Mimi codec and a Llama backbone, trained on roughly one million hours of English conversational audio. Sesame's larger CSM-Medium at 8B parameters remains closed; only the 1B tier is public.
note“Open weights” is not the same as “open data.” Moshi's training corpus is known to include Fisher and undisclosed scraped English conversational audio. CSM publishes almost nothing about its 1-million-hour corpus. PersonaPlex is explicit about its synthetic corpus but inherits whatever Moshi was trained on at the base. Family 1 has the cleanest license stories for model weights, and the least transparent stories for the audio those weights learned from. That distinction is most of what Article 10 of this series will be about.
Family 2 Interleaved single-stream
The second family makes a different bet. Instead of running two audio streams in parallel, it packs speech and text into a single timeline with repeating blocks. The model reads a small block of text, then a small block of speech, then text, then speech, and so on. Full-duplex behavior is not really parallel here; it is very fast alternation. If the blocks are small enough, the outside observer cannot tell the difference.
The paper that named the design is OmniFlatten, published by Alibaba's Tongyi Lab in October 2024. OmniFlatten is built on Qwen2-0.5B (yes, half a billion parameters, not seven) and uses a staged training recipe that progressively shrinks the interleaving grain: from a four-stream layout to three streams to two streams, with a final configuration of text-chunk size two and speech-chunk size ten. OmniFlatten was trained on 2,000 hours of synthetic dialogue rendered by CosyVoice, making it one of the first full-duplex STS systems trained entirely on generated audio. The weights were not released; the productised descendant is Qwen2.5-Omni, which ships under Apache 2.0.
The family has grown rapidly. Step-Audio 2 and GLM-4-Voice use close variants of the flatten idea. LLaMA-Omni 2 is a Meta-LLaMA-based reimplementation. Moonshot's Kimi-Audio, released under MIT, claims 13 million hours of speech pretraining. Tencent's Covo-Audio and Covo-Audio-Chat-FD, released in March 2026 under CC BY 4.0, extend the pattern by adding a third kind of block (images) to the alternation, and ship a dedicated full-duplex variant alongside. That last release is worth calling out separately: as of April 2026, Covo-Audio-Chat-FD is the most permissively licensed full-duplex STS weight release in public. CC BY 4.0 is genuinely commercial-safe with attribution, which Family 2 has otherwise struggled to offer at full-duplex scale.
The tradeoff is that serialization has a built-in cadence. Turn-taking in Family 2 is a blocking pattern, not a concurrent behavior. The smallest unit of responsiveness is the block. This is why the fact that OmniFlatten achieves full-duplex at 0.5B parameters is interesting (the architecture scales down gracefully), and also why Family 2 latency numbers should always be read together with the block size. A 200-millisecond chunk cadence is not the same object as a 200-millisecond end-to-end response.
Family 3 Cascade with a chunk-level duplex predictor
Family 3 is the family that looks like a cascade and behaves, at short time scales, like a full-duplex system. Think back to the kitchen with the supervisor. The ASR, LLM, and TTS stages are still a line cook working stage by stage. The supervisor standing behind the line is a small extra model, a “state predictor,” whose only job is to decide every fraction of a second whether the system should be listening, speaking, or cutting in. The predictor breaks the input and output into tiny slices so the three stages can overlap inside each slice rather than waiting for each other.
Freeze-Omni, released by Tencent AI Lab and collaborators at NJU, Fudan, and NPU in November 2024, is the cleanest example. It pairs a Qwen2-7B-Instruct language model with a CTC-pretrained speech encoder and a TiCodec-based autoregressive speech decoder, then trains a chunk-level state predictor on 60,000 question-answer pairs using eight GPUs. The paper reports a model-only latency of 160 to 320 milliseconds and a real-scenario latency of roughly 1.2 seconds. That gap is the honest number: the state predictor itself runs in hundreds of milliseconds, but the full pipeline behind it takes more than a second to come back with a response on real hardware. The 160-millisecond figure is not a like-for-like comparison against Moshi's 200-millisecond number, and careful readers should not treat them as competitors on a single axis.
OpenBMB's MiniCPM-o 4.5, a 9-billion-parameter on-device model, takes the same basic idea and applies it aggressively. It composes SigLIP2 vision, Whisper ASR, CosyVoice2 TTS, and Qwen3-8B language modeling into a single on-device multimodal system, using time-division multiplexing to interleave the modalities. MiniCPM-o is the clearest evidence that Family 3 can be made to run on consumer hardware, since the entire stack fits on a Mac Studio or a modern NVIDIA 4090.
The family's value proposition is pragmatic. If your organization already has a strong ASR model and a strong LLM and a strong TTS model, Family 3 lets you reach full-duplex behavior without retraining the whole stack end-to-end. If you do not have those components, Family 3 is no cheaper than Family 1 or Family 2 and imports their latency and coordination problems at the same time.
Family 4 Codec-free with a thinking mechanism
The smallest of the four families, but architecturally distinct enough to deserve its own slot, is the codec-free line. To return to the kitchen image: the cook never writes a ticket, never reads an order aloud, never turns the conversation into printable text at any stage.
A second picture is the difference between a music sheet and humming. The first three families all “write down” speech before doing anything with it. They turn audio into a sequence of discrete codec tokens that a transformer can read the way it reads words. Family 4 never writes it down. It keeps the audio as a continuous signal throughout, the way a person hums along to a melody without ever naming the notes.
ByteDance's SALMONN-omni is the representative release. SALMONN-omni does not use a neural audio codec at all. The model takes continuous audio embeddings from a SALMONN encoder, runs them through a transformer, and emits an internal “thinking state” that decides when to speak versus listen, all without ever converting the audio into discrete codes.
The reason this matters even at a minority share is that it is the clearest counter-example to “integrated STS means a neural audio codec.” Family 4 argues that speech tokenization is a design choice, not a requirement, and that the full-duplex behavior can emerge from continuous-space attention alone. The public tooling around SALMONN-omni is thinner than the other three families. Weights exist but the surrounding evaluation and benchmark plumbing is younger. Whether the family grows into a major line or stays a single-example curiosity is one of the live questions in STS research in 2026.
Closed commercial STS, alongside the families
The cleanest picture for the gap between the four open families and the closed commercial layer is a shop window. Behind the window is an open kitchen, which is the four families from the previous sections. All code visible. All architecture documented in papers. All weights downloadable with some license attached. In front of the window is the sales floor, where finished voice products are sold by the minute or by the token. You can watch a GPT-4o or Nova Sonic response come out the door. You can time it. You can compare it to a competitor. But the prep line that produced it is on the other side of the glass, and the vendor keeps the blinds drawn.
Most of the STS minutes actually used in production in 2026 are not spoken through any of the four open families above. They are spoken through closed commercial systems. GPT-4o voice, introduced by OpenAI in May 2024, reports an average latency of 320 milliseconds and a minimum of 232 milliseconds. Gemini Live has shipped on Google's apps since 2024, and the Gemini 3.1 Flash Live API on Vertex AI, released in March 2026, reports roughly 320 milliseconds first-token p50. Amazon Nova Sonic, Microsoft MAI-Voice-1, ByteDance's Doubao voice line, Hume EVI, and Cartesia Sonic all ship production voice stacks with latency, pricing, and demos published, but architectural internals kept private.
The honest reverse-engineering discipline, for readers trying to place these systems on the taxonomy, is that you can infer some things from behavior but not family membership. A system that can barge in gracefully, backchannel without hallucination, and recover from overlap without dropping state is very likely running something architecturally closer to Family 1 or Family 2 than a plain cascade. A system whose interruption behavior feels chunky and whose overlap handling falls back to a “please wait” cue is likely a Family 3 descendant. But vendors rarely confirm this in public, and behavior varies even within a single product across load conditions and model revisions. The short version: you can see the dish, but not the recipe.
| Visible from the window | Behind the blinds |
|---|---|
| Latency numbers (p50, p95) | Which of the four families the model belongs to |
| Pricing per minute or per token | Training data provenance and licensing |
| Interruption and backchannel feel | Parameter count and serving hardware |
| Supported languages and voices | Whether synthetic or real audio dominated training |
| Demo behavior on scripted examples | Internal license terms for audio used in training |
The gap between the closed commercial layer and the open-weights layer is itself a business-model signal. Most progress on STS evaluation, on licensing clarity, and on architectural variety is happening in public, through research labs and Chinese academic groups. Most production usage is happening in private, behind a small number of commercial APIs from a small number of US hyperscalers and scaleups. Any investor thesis on voice AI in 2026 has to take a view on how that gap is going to close: whether the closed systems will be caught (Kyutai's Moshi, Sesame's CSM, Qwen2.5-Omni, and Covo-Audio-Chat-FD each argue yes, but at different scales and with different licenses), or whether the closed layer will stay ahead on quality and data scale indefinitely.
Why the taxonomy matters, and what comes next
Three consequences follow from the four-family split, and each one directly shapes a decision that matters outside the architecture community.
The first is that latency numbers are not cross-comparable. A 200-millisecond number from Moshi (Family 1) describes the theoretical floor of a dual-stream transformer on an L4. A 200-millisecond chunk cadence from OmniFlatten (Family 2) describes the size of an interleaving block. A 200-millisecond number from Freeze-Omni (Family 3) describes the state predictor alone, while the full pipeline's real-scenario number is closer to 1.2 seconds. Reading these three as if they are the same quantity is the most common way that product evaluations and benchmark tables mislead.
The second is that training-data implications differ wildly per family. Family 1 is hungriest for clean two-channel conversational speech. Moshi trained on Fisher and undisclosed English conversational corpora; PersonaPlex added 2,250 hours of synthetic customer-service dialogue on top. Family 2 can be trained largely or entirely on synthetic interleaved data, as OmniFlatten's 2,000-hour CosyVoice corpus demonstrated at 0.5B parameters. Family 3 leans on pretrained ASR and TTS corpora plus a small fine-tuning set for the state predictor. Family 4 needs continuous-embedding conversational audio, which is a quantity the field has not yet tried to scale. Picking a family is, in practice, picking a data bet. The forward pointer here is to Article 04 on what public datasets actually contain, and to Article 05 on why two-channel matters (forthcoming in this series).
The third is that licensing exposure differs. Family 1 has the cleanest license stories for weights: Moshi under CC-BY 4.0, CSM-1B under Apache 2.0, PersonaPlex under the NVIDIA Open Model License. Family 2 ranges from MIT (Kimi-Audio) and CC BY 4.0 (Covo-Audio) at the permissive end to custom community licenses with commercial carve-outs (Baichuan-Omni) in the middle to paper-only with no weight release (OmniFlatten) at the far end. Families 3 and 4 are younger and less license-consistent. An enterprise buyer running a procurement review has to read at least five different license regimes to cover the open layer alone. That story will be the subject of Article 10 on consent and licensing.
Picking a family is, in practice, picking three bets at once. The cheat sheet below collapses the section into one table.
| If you are... | Likely family | Why |
|---|---|---|
| A lab chasing the shortest transformer forward pass | Family 1 | Fewest moving parts, one joint model reads all streams together |
| A startup minimizing training data cost | Family 2 | Can train on 2,000 hours of synthetic interleaved audio, as OmniFlatten showed |
| An enterprise with a strong ASR + LLM + TTS stack already | Family 3 | Adds full-duplex behavior without retraining the pipeline end-to-end |
| A research group testing whether codec tokens are necessary | Family 4 | Only family that skips the neural codec entirely |
The architecture side of STS is no longer scarce. What remains scarce is the training audio those models learn from. What truly separates a demo from a production system is the data.— four families, more than twenty-five public releases, one shared bottleneck
Four families, more than twenty-five public model releases, and a steady cadence of new arXiv papers mean that picking a model is now a question of fit, not availability. What remains scarce is the training audio those models learn from, and what truly separates a demo from a production system is the data. That is where the rest of this series goes.
We index the STS / full-duplex / audio foundation-model landscape so you don't have to.
Benchmarks, models, datasets — curated and kept current. If you are evaluating STS architectures and want to know which data shape each family actually needs, the catalog is the starting point. oto also builds two-channel STS datasets — get in touch or access the investor data room.