---
title: "Speech-to-speech AI, a primer"
description: "What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions."
article_number: "01"
slug: sts-primer
published_at: 2026-04-02
reading_minutes: 11
tags: ["STS", "primer", "foundation-models"]
canonical_url: https://fullduplex.ai/blog/sts-primer
markdown_url: https://fullduplex.ai/blog/sts-primer/md
series: "The STS Series"
series_position: 1
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Speech-to-Speech AI: A Primer

*What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.*

---

## 1. The telephone moment

In a natural conversation between two humans, the gap between one person finishing and the other starting averages about 200 milliseconds. That is roughly a quarter of a blink. It is also one of the most stable numbers in human behavior, measured the same way across [ten very different languages from Japanese to Yélî Dnye](https://www.pnas.org/doi/10.1073/pnas.0903616106). Until 2024, voice assistants needed about a full second to do the same thing. That difference is the difference between a conversation and a transaction, and it is why the voice AI demos of the last eighteen months feel qualitatively different from anything before.

Here is a simple way to hold the shift in your head. Old voice assistants worked like a walkie-talkie. One side presses the button, speaks a complete thought, releases, and waits. The other side does the same. Interruptions break it. Overlaps break it. Listening and speaking are separate modes and only one happens at a time. The new systems work like a telephone. Two people, two open channels, both able to listen and speak at once, able to interrupt and be interrupted, able to murmur *mhm* while the other person is still talking.

This is a primer on what changed, what the new models are actually doing, and why the terms you are about to see, speech-to-speech, full-duplex, audio foundation model, are worth distinguishing carefully.

<div class="diagram" id="d1"><!-- latency ruler --></div>

## 2. Four words you will see everywhere

Four phrases do most of the work in this field, and they overlap in ways that quietly trip people up.

A **speech-to-speech (STS) model** is a model that takes audio in and emits audio out, without converting to text as an intermediate step. The audio is the input, the audio is the output, the model itself does the thinking in a representation that lives closer to sound than to written language.

**Full-duplex** describes how the conversation flows. A full-duplex system can listen and speak at the same time, the way a telephone can. A half-duplex system has to finish one before starting the other, the way a walkie-talkie does. Full-duplex is a property of the interaction pattern, not of the model architecture, though certain architectures make it much easier.

An **audio foundation model** is a big pretrained model that understands and generates audio. *Foundation* is a borrowed word from the text world, where it means the model was pretrained on a very large, broad corpus and can be adapted to many tasks. An audio foundation model does the same thing but with waveforms as its native material.

A **speech language model** (or SpeechLM) is a large model that treats speech the way GPT treats text: as a sequence of discrete tokens, predicted one after another. SpeechLMs are usually built on top of a neural audio codec that converts waveforms into tokens, which we will come to in a moment.

These terms overlap but are not interchangeable. Moshi, the open-source system Kyutai released in late 2024, is all four at once: a speech-to-speech model, full-duplex, a foundation model for audio, and a speech language model. VALL-E, an earlier Microsoft system, is a SpeechLM but only for text-to-speech, not STS. A traditional cascade of ASR plus LLM plus TTS is speech-in and speech-out at the system level, but there is no STS model at its core, and it is usually half-duplex in practice.

Before looking at the full landscape, it helps to separate three different kinds of voice AI product that sometimes get lumped together. A *speech-to-text* service turns audio into a transcript. A *text-to-speech* service turns text into audio. A *conversational AI* system does both, in a loop, and has to decide what to say. The first two are components. The third is the system you actually talk to.

<div class="diagram" id="d2"><!-- three modes of voice AI (STT / TTS / conversational) with vendors --></div>

That distinction matters because the same brand can appear in more than one layer. ElevenLabs sells a TTS service, a STT service, and a conversational AI product built on its own components. VAPI and Retell do not train speech models at all. They orchestrate Deepgram plus an LLM plus ElevenLabs into a voice agent. Moshi and OpenAI's Realtime API sit in a different place on the map. They are the model itself, not a pipeline of third-party components.

<div class="diagram" id="d3"><!-- 2x2 landscape matrix: integrated vs cascade × half vs full duplex, populated with companies --></div>

## 3. How audio becomes tokens *(optional)*

A language model works on discrete tokens, not on raw audio. Before any of the approaches above can work on speech, there has to be a way to turn a waveform into a sequence of discrete units and back again, without losing too much of what made the audio sound human.

That job falls to a **neural audio codec**. Think of it as MP3 encoding with one extra trick. Like MP3, it compresses a waveform into a much smaller representation. Unlike MP3, the compressed representation is a sequence of integers that a language model can read and write directly.

The number that matters most is the *frame rate*. Kyutai's Mimi codec, released with Moshi, emits tokens at 12.5 Hz, which is close to the rate at which word-like text tokens arrive in normal speech. That alignment is what lets audio sit side by side with text inside one model without overwhelming it. If that single detail is all you take from this section, you have the point.

<div class="diagram" id="d4"><!-- codec flow: waveform → encoder → frame rate → decoder → waveform --></div>

*Aside for technically curious readers: the trick inside most modern codecs is called **residual vector quantization** (RVQ). A stack of small dictionaries, where each layer encodes what the previous layer missed. Five layers with a vocabulary of 320 each can describe more acoustic variation than a single flat vocabulary of a billion. If that is interesting, the SoundStream and Moshi papers walk through it. If not, skip ahead.*

## 4. How we got here, in four years

The new wave of voice AI did not fall out of the sky in late 2024. It is the visible end of a research arc that began quietly around 2021 and gathered pace each year since.

In early 2021, a team at Meta published a paper called [**Generative Spoken Language Modeling**, or GSLM](https://arxiv.org/abs/2102.01192). It showed something that, at the time, felt almost heretical: you could train a language model on raw speech with no text at all, by clustering speech features into pseudo-words and then modeling the sequence of those units. The speech did not have to pass through writing to be learnable.

Later that year, Google released [**SoundStream**](https://arxiv.org/abs/2107.03312), the neural audio codec that delivered the RVQ trick described above. Together, GSLM and SoundStream were the grammar and the alphabet for a future speech language model.

In 2022, Google combined the two with its [**AudioLM**](https://arxiv.org/abs/2209.03143) system, which introduced a hierarchy of semantic tokens and acoustic tokens. Semantic tokens carried the content, acoustic tokens carried the voice. AudioLM could continue a short audio clip in the speaker's own voice for many seconds, with linguistic coherence and acoustic realism people had not quite seen before.

Also in 2022, Meta's follow-up [**dGSLM**](https://arxiv.org/abs/2203.16502) extended GSLM from monologue to two-speaker dialogue, trained on the Fisher corpus, and produced the first textless model with natural turn-taking behavior, including overlaps and backchannels. The pieces for a conversational speech model were on the table.

In 2023, two systems generalized the approach in different directions. Microsoft's [**VALL-E**](https://arxiv.org/abs/2301.02111) used the codec-plus-language-model recipe for high-quality text-to-speech, cloning a voice from a three-second sample. Fudan's [**SpeechGPT**](https://arxiv.org/abs/2305.11000) plugged speech tokens into a text LLM's vocabulary and produced one of the first models that could take a spoken instruction and answer in speech, end to end.

Then, in September 2024, Kyutai released [**Moshi**](https://arxiv.org/abs/2410.00037). Open weights under a CC-BY 4.0 license, code under Apache 2.0, running on a single GPU. The first real-time, full-duplex, speech-text foundation model available to anyone who wanted to study it. That is the moment the research arc met the demo stage, and it is why the second half of 2024 felt different from the first half.

A parallel thread, worth knowing so you do not mistake this for the only story, runs through Google's **Translatotron** ([2019](https://arxiv.org/abs/1904.06037) and [2021](https://arxiv.org/abs/2107.08661)), which did direct speech-to-speech translation without text. It sat outside the LLM lineage but proved the broader point that text is not a mandatory intermediate step for voice.

<div class="diagram" id="d5"><!-- timeline 2021-2024 with 6 papers + Moshi highlighted --></div>

## 5. What the new architecture actually does

Moshi is the clearest public example of how these models are put together. Understanding its shape helps make the whole category concrete.

Moshi models two audio streams at once. One is the user's channel, the audio coming in. The other is the model's own channel, the audio going out. Both streams are represented in the same kind of Mimi tokens, and both are predicted by the same network. That is what gives the model the structural ability to listen and speak simultaneously. There is no push-to-talk state, no moment when the model stops hearing in order to speak.

Alongside the two audio streams, Moshi maintains a third stream: a time-aligned text transcript of what the model itself is saying. At each 80 millisecond frame, the model first predicts a text token, then predicts the audio tokens for that frame. The text token acts as a kind of inner monologue, a semantic handle that lets the model reason linguistically while generating audio. This technique, which Kyutai calls **Inner Monologue**, is the training detail that keeps the spoken output coherent over long turns.

It is worth pausing on what this buys you. Earlier speech language models, including SpeechGPT, followed a pattern where the model first produces a complete text response, then synthesizes audio for that text. Real-time conversation is almost impossible in that arrangement, because the audio cannot begin until the text is done. Moshi's frame-by-frame interleaving means text and audio are generated together. Each word the model says earns its place in the same forward pass.

That is why every part of the label *real-time, full-duplex, speech-text foundation model* is literal. Real-time because generation is frame by frame. Full-duplex because two audio channels are modeled at once. Speech-text because text and audio are co-generated, not stage-separated. Foundation model because the whole thing is pretrained at scale on conversational data, then aligned for dialogue.

<div class="diagram" id="d6"><!-- frame interleaving: Moshi per-frame vs SpeechGPT chain-of-modality --></div>

## 6. What the cascade cannot do

The old pipeline, ASR to text LLM to TTS, still works, and in many narrow domains it works very well. [OpenAI's own developer documentation](https://platform.openai.com/docs/guides/voice-agents) frames voice as two valid tracks: chained pipelines, which remain reliable and easier to debug, and speech-to-speech models, which aim for lower latency and more natural conversation. The argument here is narrower than a dismissal of the cascade. It is that on two specific dimensions, the cascade is structurally disadvantaged by design.

The first is **paralinguistic loss**. Speech carries two kinds of information. There are the words themselves, which a transcript captures, and there is everything about how the words were said, which a transcript throws away. Pitch, prosody, emotion, timbre, rate, breath. When an ASR model converts speech to text, it throws away this second channel entirely. A text LLM reasoning on the transcript cannot recover information that was never passed to it. The TTS that speaks the answer has to invent prosody from scratch, based only on the words, with no clue about the user's mood, urgency, or register. Sarcasm becomes sincerity. A panicked question comes back at conversational pace. A sardonic *sure* gets a chipper *absolutely*. OpenAI made the same observation in its [Realtime API release notes](https://openai.com/index/introducing-gpt-realtime/), where the team acknowledged that traditional stitched pipelines tend to lose emotion, emphasis, and accents. That admission from a player whose first voice product was itself a cascade is a useful primary-source signal that the loss is a property of the architecture, not of any one implementation.

The second is **error propagation**. Each stage of the pipeline is independently trained on its own task, and none of them sees the full audio. An ASR mistake on a homophone (*knight* for *night*, *ate eight* for *eight*) changes the meaning the LLM reasons about, and the error cannot be corrected downstream because the downstream stages never saw the original waveform. Accented speech, which many ASR models still handle unevenly, compounds the same problem. The TTS can pronounce the wrong answer with perfect clarity, which is actually worse than a garbled one, because it sounds confident.

It is worth being honest about what this does not mean. Cascades are not dead. For high-accuracy, highly constrained domains, a cascade with domain-tuned ASR still often wins on task accuracy, and modular pipelines remain easier to debug. A recent line of work, including the [**X-Talk**](https://arxiv.org/abs/2512.18706) survey on modular systems, argues that well-engineered modular designs with paralinguistic side-channels can close much of the gap. The claim here is narrower. Cascades hit a structural ceiling on the naturalness of conversation. Latency below 300 milliseconds and faithful paralinguistic preservation are not problems the cascade architecture is shaped to solve.

<div class="diagram" id="d7"><!-- failure mode 1: paralinguistic loss. Sarcastic "sure" → ASR drops tone → LLM reads flat → TTS invents cheerful prosody → mismatch --></div>

<div class="diagram" id="d8"><!-- failure mode 2: error propagation. "update my address" → ASR mishears "update my dress" → LLM replies "Sure. Describe the dress?" → each stage saw only the previous stage's output, not the original waveform --></div>

## 7. What STS actually solves

Pulling the threads together, an integrated speech-to-speech model is structurally better positioned on three capabilities where the cascade is disadvantaged by design.

First, it brings latency closer to the conversational threshold. Around 200 milliseconds instead of around 1,000 in the best reported measurements. That is the difference between an exchange and a conversation, and it is now being reported from real systems rather than only from research papers.

Second, it preserves paralinguistic signal through the pipeline. The prosody, emotion, rate, and affect of what the user said are carried through rather than discarded and reinvented. That is why the best demos from this generation sound like they are responding to *how* you spoke, not just *what* you said.

Third, it supports natural turn-taking. Because the architecture models two audio channels at once, overlaps, interruptions, and backchannels behave the way they do in human conversation. Duplex is no longer a product feature bolted on top. It is built into the model.

<div class="diagram" id="d9"><!-- three capability cards: latency under 300ms, tone preserved, natural turn-taking --></div>

The reason latency, tone, and turn-taking matter beyond demos is the size of the categories voice is the natural interface for. Start with headcount and call volume rather than TAM. The [US Bureau of Labor Statistics](https://www.bls.gov/ooh/office-and-administrative-support/customer-service-representatives.htm) counts roughly 2.8 million customer service representatives in 2024, at a median wage near $20 an hour, with the category projected to shrink through 2034. In the UK, [NHS 111 logged 1.68 million calls](https://www.england.nhs.uk/statistics/statistical-work-areas/iuc-ccas/) in a single month in 2025. The [World Health Organization estimates that at least 2.2 billion people](https://www.who.int/publications/i/item/9789241516570) have near or distance vision impairment, a population for which voice is not a convenience but the primary interface. [UNESCO](https://www.unesco.org/gem-report/en/inclusion) has repeatedly flagged that hundreds of millions of learners are taught in a language they do not speak at home. These are not TAM slides. They are problem sizes for which a conversational interface that preserves tone, handles interruption, and runs under a human turn-taking threshold is a credible lever.

Market research puts the same pressure on the supply side. [Gartner projects that conversational AI in contact centers will save roughly $80B](https://www.gartner.com/en/newsroom/press-releases/2022-08-31-gartner-predicts-conversational-ai-will-reduce-contact-centre-labour-costs-by-80-billion-in-2026) in agent labor cost by 2026. [Grand View Research sizes the AI voice agents market at $2.54B in 2025](https://www.grandviewresearch.com/industry-analysis/ai-voice-agents-market-report) with a projected 39% CAGR through 2033. ElevenLabs alone [reported more than $330M in ARR at the end of 2025, then raised $500M in February 2026 at an $11B valuation](https://www.sequoiacap.com/article/partnering-with-elevenlabs-series-d/), roughly three times the valuation it carried a year earlier. Estimates from different research firms vary by a factor of two or three depending on scope, so treat each individual figure as directional. The direction, though, is not ambiguous.

The market becomes easier to read once you split it into three layers, each with a distinct revenue model and a distinct KPI.

<div class="diagram" id="m1"><!-- three-layer stack: platform (OpenAI Realtime, Gemini Live, Nova Sonic, Moshi, Sesame) / enterprise (ElevenLabs, Decagon, Deepgram, VAPI, Retell) / consumer (Character.AI, Replika, companion apps) --></div>

Underneath those layers, the capital market has already repriced voice as a standalone primitive. One cluster of rounds inside a single 14-month window, from January 2025 through February 2026, sets a rough floor for how investors read the category.

<div class="diagram" id="m2"><!-- funding timeline scatter: ElevenLabs Series C $180M Jan 2025 at $3.3B, Sesame Series B $250M Oct 2025 at ~$1B, Gradium $70M seed Dec 2025, Deepgram $130M Series C early 2026, Decagon $250M Series D, ElevenLabs $500M Feb 2026 at $11B --></div>

Most of these rounds fund companies selling into existing markets: contact centers, customer service, outbound sales, clinical intake, enterprise note-taking. These are categories where voice is already the interface and the KPI is task effectiveness. How many calls are handled, how many minutes are saved, how many issues are resolved on first contact. The math on these markets is well understood, and the capital on the chart above is largely betting on winning them. The one exception is [Sesame's $250M Series B](https://techcrunch.com/2025/10/21/sesame-the-conversational-ai-startup-from-oculus-founders-raises-250m-and-launches-beta/), led by Sequoia and Spark in October 2025, which took the company above a $1B valuation on the strength of a voice-companion product (Maya and Miles) and a smart-glasses roadmap, not a contact-center pitch. That round is the first billion-dollar price tag inside this window that is pointed at the second market below rather than the first.

What is more interesting, if less easily sized today, is the second market STS quietly opens up. Because the model can both read and write paralinguistic signal (pitch, prosody, rate, the shape of a breath), the interface becomes capable of carrying feelings, not just instructions. Text never could. The early consumer signals are already visible: companion apps crossed $120M in mobile revenue in 2025, and 48% of users report using their AI companion for mental-health support. These are small numbers today, and some of the usage patterns are known to be fragile. The novel part is the shape: a category where the KPI is presence rather than task completion.

<div class="diagram" id="m3"><!-- two-column split: existing markets (task-effectiveness KPI) vs emergent market (emotion KPI). Left: Call Center AI $2.4B 2025, Gartner $80B savings, healthcare, enterprise copilots. Right: AI Companion apps $120M 2025, Character.AI 20M MAU, Replika 2M MAU, Mental Health AI $1.71B, 48% mental-health use, 72% of US teens have tried AI companions. Banner: voice is the first interface that can read and write feelings. --></div>

> **oto perspective**
>
> The obvious wins for STS are in the left column. Customer support, clinical intake, enterprise voice copilots. These are categories where the KPI is task effectiveness and voice is already the interface. Capital and product teams are correctly racing into them.
>
> What we find structurally more interesting is the right column. Because STS can read and write paralinguistic signal (pitch, prosody, rate, the shape of a breath), it is the first interface a computer has ever had that can carry feelings. Picture the clock on your bedside table with STS inside it. At the start and end of each day, instead of opening TikTok, you spend ten minutes talking to a companion that actually understands how you sounded yesterday and the day before, and that helps you journal. The product KPI is how much better you feel, not tasks completed. The societal KPI, if this ever works at scale, is measured in suicide rates, mental-health incidence, and daily stress. Those categories are enormous, and they do not exist as products today for one reason: text cannot carry the signal. STS might.
>
> That is the market we think is worth building toward, and it is the reason we care about the quality of the data going in. A model that hallucinates a task is annoying. A model that hallucinates an emotion is something else.

The three capabilities above land differently across these two markets. Sub-conversational latency matters most where interruption and back-and-forth are constant. Paralinguistic preservation matters most where tone carries information the words do not. Full-duplex turn-taking matters most where the interaction is long and unstructured. A model that clears all three is a candidate default interface for most of the categories above, which makes the next question a question about inputs.

None of this means the category is finished. STS models still hallucinate, and when they do, there is no intermediate text transcript to point to, so debugging is harder. Specialized ASR and TTS still beat foundation models in narrow, high-accuracy domains. On evaluation, 2025 was the year the first STS-native benchmarks appeared: [Full-Duplex-Bench](https://arxiv.org/abs/2503.04721) (arxiv 2503.04721, March 2025) focuses on turn-taking and interruption behavior, and [URO-Bench](https://arxiv.org/abs/2502.17810) (arxiv 2502.17810, Feb 2025, EMNLP 2025) is the first S2S benchmark to score paralinguistic understanding and response. The stack is still fragmented, with no single dominant end-to-end standard for *is this a good STS agent*. Those are the threads later articles in this series pick up.

One final observation about where the bottleneck now sits. With [gpt-realtime generally available](https://openai.com/index/introducing-gpt-realtime/), [Gemini Live on Vertex](https://cloud.google.com/vertex-ai/generative-ai/docs/live-api), and open-weight models like [Moshi](https://github.com/kyutai-labs/moshi) and [Sesame CSM](https://github.com/SesameAILabs/csm) downloadable, the architecture side of STS is rapidly becoming a commodity. What separates a demo from a product that works across accents, emotional registers, and full conversational turns is not the model graph anymore. It is the data the model was trained on. Which leads to the next article.

## 8. What comes next: data

Full-duplex models have to learn from conversations that actually look like conversations. Two channels, one per speaker. Overlap left intact. Paralinguistic signals preserved. Not read speech, not scripted dialog, not bulked-up monologue transcripts.

What is scarce is not speech data in general, but clean speaker-wise full-duplex conversational audio at scale. Most in-the-wild dialogue still exists as monaural mixtures, not separate channels, so overlap has to be reconstructed rather than observed. That conversational speech itself can scale is no longer in doubt. [J-CHAT](https://arxiv.org/abs/2407.15828), published in 2024, is a 76,000-hour Japanese dialogue speech corpus assembled from the public web. Recent work on full-duplex specifically, such as [InteractSpeech](https://aclanthology.org/2025.findings-emnlp.424/) (2025) and [DialogueSidon](https://arxiv.org/abs/2604.09344) (2026), is still measured in the low hundreds of hours, and the open ceiling for clean two-channel conversation remains [Fisher](https://catalog.ldc.upenn.edu/LDC2004S13), a 1,960-hour corpus collected by LDC in 2004. Moshi trained on it. Nearly every serious full-duplex effort does. Frontier models are already operating at scales where 2,000 hours of two-channel dialogue is a starting point, not a ceiling. The gap between what the next generation of STS models needs and what is actually available, licensed, and channel-separated, is the practical bottleneck the rest of this series looks at.

That is where we go next. What is in the public datasets, what is missing from them, what a full-duplex training set actually has to contain, and what it takes to build one at the scale the models now demand.

---

*oto builds large-scale two-channel full-duplex conversational speech datasets for next-generation speech-to-speech models. If you are training an STS model and running into the data ceiling described above, get in touch.*

---

_Originally published at [https://fullduplex.ai/blog/sts-primer](https://fullduplex.ai/blog/sts-primer)._
_Part of **The STS Series** · 01 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._