FullduplexFullduplex/
NETWORK · BENCHMARK DRIFT · 2022→2026cover · fig.01series · 01 / 10 · newwhere speech-to-speech came from, what the jargon means, and why audio is finally a first-class language — not a pipeline of text conversions
sts 01 / 10#STS#primer11 min read

Speech-to-speech AI, a primer.

What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.

read article 01
all references →

The STS Series

A weekly dispatch mapping speech-to-speech, full-duplex, and audio foundation models. Ten articles, honest status.

09 published · 01 coming · 10 dispatches
  1. 01Speech-to-speech AI, a primerWhat changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.published
  2. 02The full-duplex thresholdA number, a biology fact, and a small cluster of systems. What the full-duplex threshold actually is, what it takes to cross it, and what conversations above it unlock.published
  3. 03From pipeline to integrated“Integrated” sounds like one architecture. It is at least four. A field guide to the 2026 full-duplex STS landscape — four families under one label, their latency math, their data bets, and their license exposure.published
  4. 04The data ceilingFull-duplex conversational recordings at internet scale do not exist. The two escape hatches engineers reach for first — better separation AI and bigger YouTube scrapes — do not escape. Full-duplex STS still leans on a 2004 telephone corpus for its post-training recipe.published
  5. 05Foundation before verticalFull-duplex STS sits between the GPT-2 and GPT-3 moments. Asking “which vertical wins first?” in 2026 is a category error — the constraint is whether the foundation the verticals will sit on exists yet. A thesis essay on the foundation threshold, the 30×–150× data gap, and six plausible routes to 100,000+ hours of two-channel dialogue.published
  6. 06Mapping the benchmark landscapeToo many speech-to-speech benchmarks, each covering a different slice. The map, as of April 2026 — arena versus fixed test set, four capability axes, a coverage heatmap, and a Japanese gap.published
  7. 07Why STS needs new benchmarksThe STS field inherited evaluation machinery from ASR, TTS, and text-LLM paradigms. None of them measured a live, two-channel, socially-timed conversation. The argument for a rebuild, plus a concrete picture of who could run it.published
  8. 08The STS model landscapeThirty-plus speech-to-speech models, four architectural families, and a licensing pattern that is starting to split inside each lab. A field guide to the April 2026 map, legible enough to place newly announced models in one or two paragraphs.published
  9. 09Consent, licensing & the opt-in economyThe consent and licensing stack for conversational voice data in April 2026 is three layers deep: a fixed biometric-privacy floor, a seven-platform patchwork middle, and a transparency ceiling partially in force and partially in draft. An opt-in voice-data economy requires all three to survive together.published
  10. 10What comes after STSThe open questions the first nine dispatches leave behind — multimodal, on-device, and the next evaluation moat.Q3 2026soon

Explore Fullduplex

tracked catalogs, updated as the field moves