FullduplexFullduplex/blog
§B · evaluations

Benchmarks.

How the field measures voice models — not just ASR and TTS, but conversation, turn-taking, emotion, instruction-following, and task completion. Scope is deliberately a little wider than strict STS: this page tracks 42 speech-interaction and adjacent benchmarks. Every entry carries a tier (native / component / adjacent / legacy) and a setting (lab / arena / live / vertical), and links are split into site, paper, code, and leaderboard slots so the link is exactly what it claims to be. Found a stale score or a missing entry? report it to the community.

25
native (core STS)
10
component (ASR · TTS · SLU)
04
adjacent (text-only · TTS-only)
03
legacy
observatory refresh · 2026-04verified this month · 23 / 42awaiting re-check · 19
§B0 · coverage × capability

What each benchmark actually measures.

read the full map →

30 speech-interaction benchmarks on the rows (the core STS subset of the 42-entry directory below), grouped by capability family, against 15 capability axes that today's benchmarks actually score. Toggle the +5 unexplored axes button above the grid to expose five structurally-uncovered columns (code-switch, long-form memory, emotion regulation, on-device, audio adversarial) — axes that exist in text-LLM or ASR/TTS evaluation but have no public STS benchmark as of April 2026. Each row carries two runs-on chips —cas for cascade stacks and fd for full-duplex STS models — plus year / setting / license metadata. Click a group header to fold its rows, or a column header to rank benchmarks by that axis. The cards below this grid add 12 component / adjacent / legacy entries that are out of the grid's scope but still part of the measurement landscape (TTS arenas, ASR WER, S2ST, text-only verticals, and the historical baselines).

Click a column header to sort · click a row to see benchmark detail.
direct scoreindirect / subjectiveno coveragecascade-runnableFD-STS-runnablepartialunexplored axis
§01 · conversation timing

Full-duplex & interactive

Benchmarks that score turn-taking, back-channeling, overlap, and interruption — the axes that separate real voice agents from LLM-plus-TTS.

14 entries
§02 · lab · native

Speech-LM & Audio Foundation Models

Lab-style offline evals for speech-LMs and audio foundation models — knowledge, reasoning, safety, instruction following, multilingual dialogue, paralinguistic awareness.

8 entries
§03 · human votes

Arena & preference

Blind A/B human-preference arenas. The only evaluators that reliably catch 'this just sounds off' — and the only ones that scale across languages and accents.

4 entries
§04 · domain KPIs

Vertical & task

Task-completion benchmarks in customer service, outbound calling, and healthcare. Where the rubber meets the phone line — and where text-first agents collapse.

3 entries
§05 · S2ST · TTS · paralinguistic

Components

Component benchmarks that aren't speech-LM native but score the parts every STS system is built from — translation, synthesis, non-verbal delivery.

10 entries
§06 · text-only · TTS-only · duplex ASR

Adjacent & transferable

Benchmarks that aren't STS in the strict sense — TTS-only arenas, text-only clinical rubrics, duplex-ASR for verticals — but that any serious voice system will ultimately need to pass on top of its speech stack. Kept separate from Components because they don't live inside the STS pipeline itself.

0 entries

Nothing here yet.

§07 · classic

Legacy & context

Saturated or representation-era benchmarks. Rarely used to rank modern speech-LMs, but still the scaffolding every new speech encoder reports against.

3 entries
§N · observatory notes

Three gaps we're watching.

Based on everything catalogued above, the measurement landscape has a few structural holes. They are where new benchmarks — and new companies — are likely to emerge in 2026.

  1. gap · 01

    Full-duplex × vertical × real audio.

    τ³-Bench proves voice-reasoning collapses by ~50 pts when real-time is introduced, but its acoustic layer is simulated. No public benchmark ties real recordings, full-duplex timing, and a vertical task (CS, sales, health) together.
  2. gap · 02

    Non-English, non-Chinese full-duplex.

    URO-Bench ships EN + ZH; no turn-taking leaderboard exists for most other high-speaker-count languages. Anyone who builds one at scale first writes the standard for that language.
  3. gap · 03

    Latency as a first-class axis.

    First-response-ms numbers are vendor-reported and unreplicated across Moshi / GPT-4o Voice / Gemini Live / Sesame. Full-Duplex-Bench v2 makes this measurable; we expect a dedicated latency leaderboard to land within the year.

Fullduplex's editorial bias: we'll flag any new benchmark that closes one of these three gaps on the blog and add it here within the week.

New benchmark or revised score? submit an entry