the sts series03 / 10#architecture#four-families§ 09 sections · 09 figures · 03 tables

From pipeline to integrated.

“Integrated” sounds like one architecture. It is at least four. As of April 2026 the open speech-to-speech landscape has fractured into four architecturally distinct families, each with its own latency math, data bets, and license exposure. Knowing which family a model belongs to is the prerequisite for reading any of its numbers.

fullduplex research

published apr 2026· 15 min read· ~4,200 words· series 03 / 10

15m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

fig.00 · four families under one label — a field guidefullduplex / synthesized

Moshi as the landmark

In September 2024 a small French lab, Kyutai, released Moshi. A 7-billion-parameter model that listened and spoke at the same time, shipped as public weights under CC-BY 4.0, with the Mimi codec released alongside it under MIT. You could download Moshi, run it on a laptop with a recent GPU, and have a conversation that did not feel like Siri. The release paper claimed 160 milliseconds of theoretical latency and 200 milliseconds measured on an NVIDIA L4.

That single release turned a distinction that had lived inside research papers for years into a product fact. On one side of the line were cascades. Recognize the user's speech, run a language model, synthesize the reply, play it back. On the other side was something else: speech in, speech out, one model, continuous inference. The research community had sometimes called this split “pipeline versus end-to-end” or “modular versus integrated.” Before Moshi, only closed commercial systems (GPT-4o voice, which OpenAI had demoed in May 2024) could be pointed at as instances of the integrated side. After Moshi, you could clone a repository and read the architecture.

Eighteen months later, the integrated side has fractured. As of April 2026 there are at least four architecturally distinct families of full-duplex STS, plus a growing closed commercial layer above them. This article maps the families and explains why the distinctions matter, especially when comparing latency, training-data needs, and licensing exposure across products.

fig.01 · the landmark release·········

Moshi as the landmark. Three token streams (user audio, model audio, text inner monologue) run through a single transformer at 12.5 Hz using the Mimi neural audio codec. Kyutai's release paper reports 160 ms theoretical latency and 200 ms measured on an NVIDIA L4 (Défossez et al. 2024).

The pipeline ancestor

The cascade is worth naming in one paragraph so the contrast later has something to push against. You already saw the detailed version in Article 02. In short, the legacy voice pipeline is a five-stage loop. The device captures audio, an automatic speech recognition (ASR) model turns it into text, a language model reads the text and produces a reply, a text-to-speech (TTS) model renders that reply back into audio, and the speaker plays it. Each stage has to finish before the next can start, so the minimum achievable latency is the sum of stages. Once you add network hops, model routing, and cold starts, a well-tuned cascade lands near one second end-to-end on a typical day.

It is fair to say that modern cascades have kept improving. Deepgram's Aura-2 and Aura-Nova lines quote sub-second end-to-end latency for their agent stack, and Cartesia's Sonic is now one of the fastest commercial TTS engines at roughly 90 milliseconds time-to-first-audio. A cascade with excellent components can now dip well under a second on favorable conditions. The structural property that still holds is that a cascade cannot, by its own design, listen while it speaks. Every full-duplex claim you read about is a claim that some part of the system crosses that property.

fig.02 · the pipeline ancestor·········

A legacy voice cascade runs five stages sequentially: capture, ASR, LLM, TTS, play. Minimum achievable latency is the sum of stage times. A cascade cannot, by its own design, listen while it speaks. Full-duplex systems are defined by crossing that structural property.

Four families under one label

Call the other side “integrated” and it looks like one thing. In practice it is at least four. The cleanest way to keep them straight is not an architecture diagram. It is a kitchen.

Imagine a single cook who has to take orders from the dining room and plate dishes at the counter at the same time. Four kitchens solve this problem four different ways, and the four families of integrated STS map onto them almost exactly.

Family 1 is two intercoms at once. One channel carries the incoming order, another carries the outgoing plating call. Both are open simultaneously. The cook also scribbles mental notes on a whiteboard to keep the thread straight across both lines. Moshi, PersonaPlex, and Sesame CSM are wired this way.
Family 2 is one intercom, alternating very fast. Only one line is available, but orders and plating calls take very short turns on it, alternating so quickly that from the dining room it sounds like both directions are happening at once. OmniFlatten, Qwen2.5-Omni, Covo-Audio, and Kimi-Audio are wired this way.
Family 3 is a relay with a supervisor. The classic kitchen pipeline (take order, prep, plate) is still there, stage by stage. A supervisor stands behind the line and shouts “cut in now!” every half second, so the relay overlaps at short time scales instead of waiting for each handoff. Freeze-Omni and MiniCPM-o are wired this way.
Family 4 is no tickets at all. The cook never writes anything down, never converts the conversation into printable text. The whole kitchen runs on continuous hand signals, and an internal “speak or listen” instinct decides which way the signal flows. SALMONN-omni is the only public example.

The quadrant below shows the same split in the form researchers tend to draw, and the cheat-sheet table after it is the one-line reference.

fig.03 · four architecturally distinct families·········

Four families of full-duplex STS as of April 2026. Each quadrant carries a one-phrase everyday analogy and lists representative open or paper-released models. Family membership is the lens through which latency numbers, training-data bets, and license exposure should be read; a single “200 ms” number means different things in different quadrants.

cheat sheet · what “200 ms” means

Family	Everyday intuition	Example models	What “200 ms” means here
1. Dual-stream + codec	Two intercoms at once	Moshi, PersonaPlex, CSM-1B	Theoretical floor on the single transformer's forward pass
2. Interleaved / flatten	One intercom, very fast alternation	Qwen2.5-Omni, Covo-Audio-Chat-FD, Kimi-Audio	The length of one alternating block, not the response time
3. Cascade + predictor	Relay with a supervisor	Freeze-Omni, MiniCPM-o 4.5	The supervisor signal alone. Full pipeline is 5–10× slower
4. Codec-free / thinking	No tickets, only hand signals	SALMONN-omni	Not yet standardized; the family is too young

cheat sheetFour families, four intuitions, four different meanings behind the same “200 ms” column header. Use this table together with fig.03 whenever you read a latency number from an STS product or paper.

The reason the distinction matters is not aesthetic. Each family sets a different bound on what a published number actually means. A “200 millisecond” result from Family 1 is a theoretical floor on how long the single transformer takes to produce its reply. The same number from Family 2 is the length of one alternating block, not the response time: a Family 2 system still has to watch several blocks go by before it has formed a full reply. A Family 3 “200 ms” usually describes only the supervisor signal, while the full relay behind it reports separately, and that number is typically five to ten times higher. Three families, three different phenomena, one column header.

Family 1 Dual-stream with a neural codec

The reference system for this family is Moshi itself. Kyutai's decision that made the architecture tractable was the Mimi codec, a streaming neural audio codec operating at 12.5 Hertz. Think of the codec as a smart compressor in the same spirit as MP3, but designed specifically so a language model can read its output the way it reads words. Every 80 milliseconds Mimi emits a small handful of “sound tokens” that capture tone and rhythm as well as content.

fig.04 · the Mimi codec at a glance·········

A continuous audio waveform is compressed into a sequence of discrete “sound tokens” at 12.5 per second, which means one token covers 80 milliseconds of audio. Because each token is short and dense, a transformer can mix several token streams in the same forward pass (Défossez et al. 2024).

In Moshi, three streams of these tokens run side by side: one for the user's audio, one for the model's audio, and a third text stream for the model's “inner monologue,” which is a running text draft of what the model is about to say. A single transformer reads all three streams together. The model's audio stream is fed back out through Mimi in reverse to become sound.

That architecture has now been adopted by more than one lab. NVIDIA PersonaPlex-7B-v1, released by NVIDIA ADLR in January 2026, initializes its weights directly from Moshi and fine-tunes on a corpus of 1,840 hours of synthetic customer-service dialogue and 410 hours of question-answering dialogue, generated by Qwen3-32B and GPT-OSS-120B as transcripts and rendered by Chatterbox TTS as speech, plus the Fisher English corpus for casual dialogue. PersonaPlex's contribution is a hybrid system-prompt mechanism that conditions the model on a role (via text) and a voice (via a short audio sample). Code is MIT, weights are under the NVIDIA Open Model License, a bespoke license that permits commercial use with conditions. Sesame's CSM-1B, released under Apache 2.0, uses the same Mimi codec and a Llama backbone, trained on roughly one million hours of English conversational audio. Sesame's larger CSM-Medium at 8B parameters remains closed; only the 1B tier is public.

note“Open weights” is not the same as “open data.” Moshi's training corpus is known to include Fisher and undisclosed scraped English conversational audio. CSM publishes almost nothing about its 1-million-hour corpus. PersonaPlex is explicit about its synthetic corpus but inherits whatever Moshi was trained on at the base. Family 1 has the cleanest license stories for model weights, and the least transparent stories for the audio those weights learned from. That distinction is most of what Article 10 of this series will be about.

fig.05 · family 1 · dual-stream + codec·········

Two audio streams (user, model) and one text inner-monologue stream all flow through a single joint transformer. The Mimi codec tokenizes audio at 12.5 Hz so that several streams can be processed concurrently. Representative systems: Moshi, PersonaPlex, CSM-1B.

Family 2 Interleaved single-stream

The second family makes a different bet. Instead of running two audio streams in parallel, it packs speech and text into a single timeline with repeating blocks. The model reads a small block of text, then a small block of speech, then text, then speech, and so on. Full-duplex behavior is not really parallel here; it is very fast alternation. If the blocks are small enough, the outside observer cannot tell the difference.

The paper that named the design is OmniFlatten, published by Alibaba's Tongyi Lab in October 2024. OmniFlatten is built on Qwen2-0.5B (yes, half a billion parameters, not seven) and uses a staged training recipe that progressively shrinks the interleaving grain: from a four-stream layout to three streams to two streams, with a final configuration of text-chunk size two and speech-chunk size ten. OmniFlatten was trained on 2,000 hours of synthetic dialogue rendered by CosyVoice, making it one of the first full-duplex STS systems trained entirely on generated audio. The weights were not released; the productised descendant is Qwen2.5-Omni, which ships under Apache 2.0.

The family has grown rapidly. Step-Audio 2 and GLM-4-Voice use close variants of the flatten idea. LLaMA-Omni 2 is a Meta-LLaMA-based reimplementation. Moonshot's Kimi-Audio, released under MIT, claims 13 million hours of speech pretraining. Tencent's Covo-Audio and Covo-Audio-Chat-FD, released in March 2026 under CC BY 4.0, extend the pattern by adding a third kind of block (images) to the alternation, and ship a dedicated full-duplex variant alongside. That last release is worth calling out separately: as of April 2026, Covo-Audio-Chat-FD is the most permissively licensed full-duplex STS weight release in public. CC BY 4.0 is genuinely commercial-safe with attribution, which Family 2 has otherwise struggled to offer at full-duplex scale.

The tradeoff is that serialization has a built-in cadence. Turn-taking in Family 2 is a blocking pattern, not a concurrent behavior. The smallest unit of responsiveness is the block. This is why the fact that OmniFlatten achieves full-duplex at 0.5B parameters is interesting (the architecture scales down gracefully), and also why Family 2 latency numbers should always be read together with the block size. A 200-millisecond chunk cadence is not the same object as a 200-millisecond end-to-end response.

fig.06 · family 2 · interleaved / flatten·········

Text and speech tokens are packed into one serial sequence in repeating blocks. OmniFlatten trained this design at 0.5B parameters on 2,000 hours of synthetic CosyVoice audio. Descendants include Qwen2.5-Omni, Step-Audio 2, GLM-4-Voice, Kimi-Audio, LLaMA-Omni 2, Covo-Audio-Chat-FD.

Family 3 Cascade with a chunk-level duplex predictor

Family 3 is the family that looks like a cascade and behaves, at short time scales, like a full-duplex system. Think back to the kitchen with the supervisor. The ASR, LLM, and TTS stages are still a line cook working stage by stage. The supervisor standing behind the line is a small extra model, a “state predictor,” whose only job is to decide every fraction of a second whether the system should be listening, speaking, or cutting in. The predictor breaks the input and output into tiny slices so the three stages can overlap inside each slice rather than waiting for each other.

Freeze-Omni, released by Tencent AI Lab and collaborators at NJU, Fudan, and NPU in November 2024, is the cleanest example. It pairs a Qwen2-7B-Instruct language model with a CTC-pretrained speech encoder and a TiCodec-based autoregressive speech decoder, then trains a chunk-level state predictor on 60,000 question-answer pairs using eight GPUs. The paper reports a model-only latency of 160 to 320 milliseconds and a real-scenario latency of roughly 1.2 seconds. That gap is the honest number: the state predictor itself runs in hundreds of milliseconds, but the full pipeline behind it takes more than a second to come back with a response on real hardware. The 160-millisecond figure is not a like-for-like comparison against Moshi's 200-millisecond number, and careful readers should not treat them as competitors on a single axis.

OpenBMB's MiniCPM-o 4.5, a 9-billion-parameter on-device model, takes the same basic idea and applies it aggressively. It composes SigLIP2 vision, Whisper ASR, CosyVoice2 TTS, and Qwen3-8B language modeling into a single on-device multimodal system, using time-division multiplexing to interleave the modalities. MiniCPM-o is the clearest evidence that Family 3 can be made to run on consumer hardware, since the entire stack fits on a Mac Studio or a modern NVIDIA 4090.

The family's value proposition is pragmatic. If your organization already has a strong ASR model and a strong LLM and a strong TTS model, Family 3 lets you reach full-duplex behavior without retraining the whole stack end-to-end. If you do not have those components, Family 3 is no cheaper than Family 1 or Family 2 and imports their latency and coordination problems at the same time.

fig.07 · family 3 · cascade + predictor·········

The pipeline is still ASR, LLM, TTS, but a state predictor sitting above the chain decides at sub-second granularity whether to speak, listen, or interrupt. Freeze-Omni is the reference release; MiniCPM-o 4.5 applies the same idea on-device. Published “latency” numbers here often describe the predictor alone, not the full pipeline.

Family 4 Codec-free with a thinking mechanism

The smallest of the four families, but architecturally distinct enough to deserve its own slot, is the codec-free line. To return to the kitchen image: the cook never writes a ticket, never reads an order aloud, never turns the conversation into printable text at any stage.

A second picture is the difference between a music sheet and humming. The first three families all “write down” speech before doing anything with it. They turn audio into a sequence of discrete codec tokens that a transformer can read the way it reads words. Family 4 never writes it down. It keeps the audio as a continuous signal throughout, the way a person hums along to a melody without ever naming the notes.

ByteDance's SALMONN-omni is the representative release. SALMONN-omni does not use a neural audio codec at all. The model takes continuous audio embeddings from a SALMONN encoder, runs them through a transformer, and emits an internal “thinking state” that decides when to speak versus listen, all without ever converting the audio into discrete codes.

The reason this matters even at a minority share is that it is the clearest counter-example to “integrated STS means a neural audio codec.” Family 4 argues that speech tokenization is a design choice, not a requirement, and that the full-duplex behavior can emerge from continuous-space attention alone. The public tooling around SALMONN-omni is thinner than the other three families. Weights exist but the surrounding evaluation and benchmark plumbing is younger. Whether the family grows into a major line or stays a single-example curiosity is one of the live questions in STS research in 2026.

fig.08 · family 4 · codec-free·········

SALMONN-omni operates on continuous audio embeddings end-to-end. An internal “think state” decides when to speak, listen, or wait. This family is the smallest of the four and the clearest counter-example to the idea that integrated STS requires a neural audio codec.

Closed commercial STS, alongside the families

The cleanest picture for the gap between the four open families and the closed commercial layer is a shop window. Behind the window is an open kitchen, which is the four families from the previous sections. All code visible. All architecture documented in papers. All weights downloadable with some license attached. In front of the window is the sales floor, where finished voice products are sold by the minute or by the token. You can watch a GPT-4o or Nova Sonic response come out the door. You can time it. You can compare it to a competitor. But the prep line that produced it is on the other side of the glass, and the vendor keeps the blinds drawn.

Most of the STS minutes actually used in production in 2026 are not spoken through any of the four open families above. They are spoken through closed commercial systems. GPT-4o voice, introduced by OpenAI in May 2024, reports an average latency of 320 milliseconds and a minimum of 232 milliseconds. Gemini Live has shipped on Google's apps since 2024, and the Gemini 3.1 Flash Live API on Vertex AI, released in March 2026, reports roughly 320 milliseconds first-token p50. Amazon Nova Sonic, Microsoft MAI-Voice-1, ByteDance's Doubao voice line, Hume EVI, and Cartesia Sonic all ship production voice stacks with latency, pricing, and demos published, but architectural internals kept private.

The honest reverse-engineering discipline, for readers trying to place these systems on the taxonomy, is that you can infer some things from behavior but not family membership. A system that can barge in gracefully, backchannel without hallucination, and recover from overlap without dropping state is very likely running something architecturally closer to Family 1 or Family 2 than a plain cascade. A system whose interruption behavior feels chunky and whose overlap handling falls back to a “please wait” cue is likely a Family 3 descendant. But vendors rarely confirm this in public, and behavior varies even within a single product across load conditions and model revisions. The short version: you can see the dish, but not the recipe.

cheat sheet · closed systems

Visible from the window	Behind the blinds
Latency numbers (p50, p95)	Which of the four families the model belongs to
Pricing per minute or per token	Training data provenance and licensing
Interruption and backchannel feel	Parameter count and serving hardware
Supported languages and voices	Whether synthetic or real audio dominated training
Demo behavior on scripted examples	Internal license terms for audio used in training

cheat sheetWhat a reader can and cannot learn about a closed commercial STS product without vendor cooperation. Product evidence is suggestive of architectural family, not determinative.

The gap between the closed commercial layer and the open-weights layer is itself a business-model signal. Most progress on STS evaluation, on licensing clarity, and on architectural variety is happening in public, through research labs and Chinese academic groups. Most production usage is happening in private, behind a small number of commercial APIs from a small number of US hyperscalers and scaleups. Any investor thesis on voice AI in 2026 has to take a view on how that gap is going to close: whether the closed systems will be caught (Kyutai's Moshi, Sesame's CSM, Qwen2.5-Omni, and Covo-Audio-Chat-FD each argue yes, but at different scales and with different licenses), or whether the closed layer will stay ahead on quality and data scale indefinitely.

fig.09 · two-layer landscape·········

Two-layer STS landscape as of April 2026. The closed commercial layer ships most of the production minutes but keeps its architecture private. The open-weights layer fractures into four families, each with readable code and a different license tier. Arrows between the layers are inferences from product behavior, not vendor disclosures.

Why the taxonomy matters, and what comes next

Three consequences follow from the four-family split, and each one directly shapes a decision that matters outside the architecture community.

The first is that latency numbers are not cross-comparable. A 200-millisecond number from Moshi (Family 1) describes the theoretical floor of a dual-stream transformer on an L4. A 200-millisecond chunk cadence from OmniFlatten (Family 2) describes the size of an interleaving block. A 200-millisecond number from Freeze-Omni (Family 3) describes the state predictor alone, while the full pipeline's real-scenario number is closer to 1.2 seconds. Reading these three as if they are the same quantity is the most common way that product evaluations and benchmark tables mislead.

The second is that training-data implications differ wildly per family. Family 1 is hungriest for clean two-channel conversational speech. Moshi trained on Fisher and undisclosed English conversational corpora; PersonaPlex added 2,250 hours of synthetic customer-service dialogue on top. Family 2 can be trained largely or entirely on synthetic interleaved data, as OmniFlatten's 2,000-hour CosyVoice corpus demonstrated at 0.5B parameters. Family 3 leans on pretrained ASR and TTS corpora plus a small fine-tuning set for the state predictor. Family 4 needs continuous-embedding conversational audio, which is a quantity the field has not yet tried to scale. Picking a family is, in practice, picking a data bet. The forward pointer here is to Article 04 on what public datasets actually contain, and to Article 05 on why two-channel matters (forthcoming in this series).

The third is that licensing exposure differs. Family 1 has the cleanest license stories for weights: Moshi under CC-BY 4.0, CSM-1B under Apache 2.0, PersonaPlex under the NVIDIA Open Model License. Family 2 ranges from MIT (Kimi-Audio) and CC BY 4.0 (Covo-Audio) at the permissive end to custom community licenses with commercial carve-outs (Baichuan-Omni) in the middle to paper-only with no weight release (OmniFlatten) at the far end. Families 3 and 4 are younger and less license-consistent. An enterprise buyer running a procurement review has to read at least five different license regimes to cover the open layer alone. That story will be the subject of Article 10 on consent and licensing.

Picking a family is, in practice, picking three bets at once. The cheat sheet below collapses the section into one table.

cheat sheet · which family fits which buyer

If you are...	Likely family	Why
A lab chasing the shortest transformer forward pass	Family 1	Fewest moving parts, one joint model reads all streams together
A startup minimizing training data cost	Family 2	Can train on 2,000 hours of synthetic interleaved audio, as OmniFlatten showed
An enterprise with a strong ASR + LLM + TTS stack already	Family 3	Adds full-duplex behavior without retraining the pipeline end-to-end
A research group testing whether codec tokens are necessary	Family 4	Only family that skips the neural codec entirely

cheat sheetFamily choice as a three-bet decision. Use together with fig.03 (“what 200 ms means”) and fig.09 (the two-layer landscape) when placing a specific product or model.

The architecture side of STS is no longer scarce. What remains scarce is the training audio those models learn from. What truly separates a demo from a production system is the data.— four families, more than twenty-five public releases, one shared bottleneck

Four families, more than twenty-five public model releases, and a steady cadence of new arXiv papers mean that picking a model is now a question of fit, not availability. What remains scarce is the training audio those models learn from, and what truly separates a demo from a production system is the data. That is where the rest of this series goes.

■ ■ ■

Fullduplex

2026

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

Benchmarks, models, datasets — curated and kept current. If you are evaluating STS architectures and want to know which data shape each family actually needs, the catalog is the starting point. oto also builds two-channel STS datasets — get in touch or access the investor data room.

see the benchmarks ↗join the community

previous in the series

The full-duplex threshold.

A number, a biology fact, and a small cluster of systems. What the threshold is, what it takes to cross it, and why the timing is bounded by human biology rather than engineering effort.

←

#architecture#four-families#sts-series#moshi#omniflattenfiled under: the latent · sts 03

From pipeline to integrated.

Moshi as the landmark

The pipeline ancestor

Four families under one label

Family 1 Dual-stream with a neural codec

Family 2 Interleaved single-stream

Family 3 Cascade with a chunk-level duplex predictor

Family 4 Codec-free with a thinking mechanism

Closed commercial STS, alongside the families

Why the taxonomy matters, and what comes next

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

The STS Series · 10 articles, released weekly

The full-duplex threshold.

From pipeline to integrated.

The data ceiling.