§B · evaluations

Benchmarks.

How the field measures voice models — not just ASR and TTS, but conversation, turn-taking, emotion, instruction-following, and task completion. Scope is deliberately a little wider than strict STS: this page tracks 42 speech-interaction and adjacent benchmarks. Every entry carries a tier (native / component / adjacent / legacy) and a setting (lab / arena / live / vertical), and links are split into site, paper, code, and leaderboard slots so the link is exactly what it claims to be. Found a stale score or a missing entry? report it to the community.

native (core STS)

component (ASR · TTS · SLU)

adjacent (text-only · TTS-only)

legacy

observatory refresh · 2026-04verified this month · 23 / 42awaiting re-check · 19

§B0 · coverage × capability

What each benchmark actually measures.

read the full map →

30 speech-interaction benchmarks on the rows (the core STS subset of the 42-entry directory below), grouped by capability family, against 15 capability axes that today's benchmarks actually score. Toggle the +5 unexplored axes button above the grid to expose five structurally-uncovered columns (code-switch, long-form memory, emotion regulation, on-device, audio adversarial) — axes that exist in text-LLM or ASR/TTS evaluation but have no public STS benchmark as of April 2026. Each row carries two runs-on chips —cas for cascade stacks and fd for full-duplex STS models — plus year / setting / license metadata. Click a group header to fold its rows, or a column header to rank benchmarks by that axis. The cards below this grid add 12 component / adjacent / legacy entries that are out of the grid's scope but still part of the measurement landscape (TTS arenas, ASR WER, S2ST, text-only verticals, and the historical baselines).

FDB v1casfd

2025labopen

FDB v1.5casfd

2025labopen

FDB v2casfd

2025labopen

FDB v3casfd

2026labopen

Audio MultiChallengecasfd

2025labopen

Talking Turnscasfd

2025labresearch

SID-Benchcasfd

2025labresearch

FD-Benchcasfd

2025labopen

MTR-DuplexBenchcasfd

2025labresearch

HumDial (ICASSP '26)casfd

2026labresearch

FLEXIcasfd

2025labresearch

direct scoreindirect / subjectiveno coveragecascade-runnableFD-STS-runnablepartialunexplored axis

§01 · conversation timing

Full-duplex & interactive

Benchmarks that score turn-taking, back-channeling, overlap, and interruption — the axes that separate real voice agents from LLM-plus-TTS.

14 entries

NTU / UC Berkeley / UW / MIT · updated 2026-04 · verified 2026-04
nativelive
Full-Duplex-Bench v3
April 2026 release — adds tool-use events in the dialogue stream and fine-grained latency breakdown (first-token / first-word / full-reply).
full-duplexvertical-agent
v3 drops backchannel and overlap subsets (retired to v1.5 / v2) and adds tool invocation + reaction-time breakdown. Expected to become the default 'latency in a dialogue with tools' scoreboard through 2026.
first-token latencyfirst-word latencytool-call latency
site code
Sierra Research · updated 2026-02 · verified 2026-04
nativevertical
τ³-Bench (τ-Voice)
The first agent benchmark with a live, full-duplex voice track — reveals a 50-point gap between text and voice reasoning.
vertical-agentfull-duplexspeech-lm
Extends τ-bench and τ²-bench into banking (698 docs, ~195 k tokens of policy) and knowledge retrieval, then adds τ-Voice: live full-duplex conversations with OpenAI / Gemini / xAI voice models under simulated noisy, accented conditions. Text-reasoning agents score ~85 %; the same models drop to 26–38 % once voice and real-time are introduced.
pass^ktool-use successvoice task success
site paper code
ICASSP 2026 Spoken Dialogue Challenge · updated 2026-01 · verified 2026-04
nativelab
HumDial (ICASSP '26)
Dual-track human-evaluated spoken dialogue challenge — first public benchmark to explicitly score paralinguistic input alongside turn-taking dynamics.
full-duplexspeech-lmemotion
Two tracks (Track I English; Track II Chinese + English) with a held-out human evaluation protocol. Sits at the intersection of full-duplex dynamics and paralinguistic understanding.
turn-takingparalinguistic groundinghuman preference
site
Academic (Audio MultiChallenge authors) · updated 2025-12 · verified 2026-04
nativelab
Audio MultiChallenge
Multi-turn evaluation of spoken dialogue systems on natural human interaction — scores Inference Memory, Instruction Retention, Self Coherence, Voice Editing, Audio-Cue handling.
speech-lmfull-duplex
Covers the long-range context and self-coherence axes that are currently thin on the STS measurement layer. Strong complement to Talking Turns (turn-taking) and MTR-DuplexBench (multi-turn dynamics).
Inference MemoryInstruction RetentionSelf CoherenceVoice EditingAudio-Cue
site paper
NTU / UC Berkeley / UW / MIT · updated 2025-11 · verified 2026-04
nativelive
Full-Duplex-Bench v2
Late-2025 WebRTC-based live harness — first FDB release that runs dynamic evaluation against streaming endpoints, plus multi-turn stimuli.
full-duplexspeech-lm
v2 moves the FDB family from fixed-file scoring to real-time streaming against the model under test. Adds multi-turn dialogue stimuli and a (subjective) first probe of instruction-following inside the FDB harness. Now the most-cited public full-duplex eval outside arenas.
streaming latencymulti-turn coherenceoverlap handling
site code
Academic (FLEXI authors) · updated 2025-11 · verified 2026-04
nativelab
FLEXI
Full-duplex safety benchmark — scores whether a model correctly barges in on a user in safety-critical moments (medical, self-harm, imminent danger).
full-duplex
The only public eval targeted at proactive safety interrupts. Thin coverage across other axes, but unique: it forces full-duplex systems to demonstrate that they will speak up on their own when needed.
safety-interrupt successfalse-alarm rate
site paper
Academic (MTR-DuplexBench authors) · updated 2025-10 · verified 2026-04
nativelab
MTR-DuplexBench
Multi-turn, multi-topic full-duplex benchmark built from recorded call-center-style dialogues.
full-duplexspeech-lm
Probes how full-duplex dynamics degrade across longer conversations and topic shifts — the regime where most turn-taking benchmarks stop. Currently the main public signal on long-range full-duplex behaviour.
multi-turn coherencetopic-shift recoverydynamics retention
site paper
Academic (SID-Bench authors) · updated 2025-09 · verified 2026-04
nativelab
SID-Bench
Spoken Interaction Dynamics benchmark — millisecond-resolution ground truth for end-of-turn detection and interruption recovery.
full-duplex
Zooms in where FDB stays macro: SID-Bench scores reaction-time dynamics in full-duplex speech with millisecond-level labels for turn boundaries, barge-in timing, and interruption recovery. Useful as a paired eval alongside FDB v1 on the reaction-time axis.
EOT latency (ms)barge-in timingrecovery time
site paper
Academic (MMedFD authors) · updated 2025-09 · verified 2026-04
adjacentvertical
MMedFD
Real-world healthcare benchmark for multi-turn full-duplex automatic speech recognition — adjacent to STS, crucial for duplex deployment.
asrvertical-agentfull-duplex
Not STS in the strict sense — it scores ASR under multi-turn full-duplex healthcare conversations — but the failure modes it exposes (overlap, disfluency, medical-term recall) are exactly the ones that break vertical voice deployments. Listed as `adjacent` because the STS community will quote it whenever duplex systems hit healthcare production.
duplex WERmedical-term recalloverlap-region WER
site paper
NTU / UC Berkeley / UW / MIT · updated 2025-08 · verified 2026-04
nativelab
Full-Duplex-Bench v1.5
Mid-2025 refresh of FDB v1 — adds the simultaneous-speech / overlap axis (user interruption, listener back-channel, side conversation, ambient speech).
full-duplex
Same harness as v1, same automatic scoring, one new axis: what happens when both sides are producing audio at once. v1.5 is what most 2025 papers now cite when they claim 'full-duplex evaluated.'
overlap durationmutual-silence rateinterruption recovery
site code
Academic (FD-Bench authors) · updated 2025-07 · verified 2026-04
nativelab
FD-Bench
Independent full-duplex benchmark focused on natural-pause / short-utterance regimes — a cross-check against FDB.
full-duplex
Different stimulus sources to FDB but overlapping metrics, which is the point: agreement / disagreement between FDB and FD-Bench has become a quick sanity signal for whether a model is overfit to one harness.
pause handlingturn-takingoverlap
site paper
Apple / CMU · updated 2025-04 · verified 2026-04
nativelab
Talking Turns
A supervised turn-taking judge trained on human-human conversations to score audio foundation models.
full-duplexspeech-lm
Trains a classifier to predict turn-taking events (end-of-turn, backchannel, interruption) in Switchboard and uses it as an automatic judge for spoken dialogue systems. The first study to systematically show that existing speech-LMs fail to understand when to speak and rarely back-channel. Published at ICLR 2025.
models scoredGemini 2.5 Live Gemini 3.1 Flash Live Preview Moshi
turn-event F1backchannel rateinterruption rate
site paper
NTU / UC Berkeley / UW / MIT · updated 2025-03 · verified 2026-04
nativelab
Full-Duplex-Bench v1
Original Full-Duplex-Bench — automatic scoring of the four canonical turn-taking events: when-to-speak, back-channel, interruption success, pause handling.
full-duplex
Released 2025-03. Fixed audio stimuli + reproducible automatic metrics, establishing the first public benchmark dedicated to full-duplex interactive behaviour. Everything after (v1.5 / v2 / v3) is additive on top of this harness.
turn-taking latencyback-channel timinginterruption successpause handling
site paper code
Japanese academic community · updated 2024-11 · verified 2026-04
nativelab
J-Moshi (subjective)
Japanese open-weights full-duplex model paper whose subjective MOS listening tests are the main public Japanese STS measurement layer today.
full-duplexspeech-lm
Not a held-out test set — just MOS scores on human-evaluated samples. Listed here because right now there is no shared automatic Japanese full-duplex benchmark, and J-Moshi's listener ratings are what the JA community points at.
MOSlistener-preference
site paper

§02 · lab · native

Speech-LM & Audio Foundation Models

Lab-style offline evals for speech-LMs and audio foundation models — knowledge, reasoning, safety, instruction following, multilingual dialogue, paralinguistic awareness.

8 entries

Artificial Analysis · updated 2026-03 · verified 2026-04
nativelab
Big Bench Audio
1 000 audio-question reasoning benchmark — the dataset behind the Artificial Analysis S2S speech-reasoning leaderboard.
speech-lminstruction-following
Adapts four Big Bench Hard categories (Formal Fallacies, Navigate, Object Counting, Web of Lies) into 250 spoken questions each, rendered with 23 top TTS voices. Used by Artificial Analysis to publish the public `Speech Reasoning` score across S2S providers and to compute Time-to-First-Audio. v1.1 (2026-03) switched to a Claude Sonnet 4.6 judge and includes unanswered questions in the score. Current top entry (2026-04) is Step-Audio R1.1 Realtime at 97.0%.
top entry
Step-Audio R1.1 Realtime97.0%
models scoredAmazon Nova 2 Sonic Gemini 2.5 Live Gemini 3.1 Flash Live Preview Grok Voice Agent API Moshi OpenAI Realtime (gpt-realtime)Qwen3-Omni Step-Audio 2 mini
accuracyTTFA (s)
site code leaderboard
SJTU (OmniAgent) · updated 2026-01 · verified 2026-04
nativelab
VocalBench
A four-dimensional stress test for speech-interaction models — semantic, acoustic, conversational, and robust.
speech-lmaudio-understanding
~24 k EN + ZH instances evaluating knowledge / reasoning (ACC), open-ended chat (LLM-judge), acoustic quality (UTMOS, WER), emotional empathy (EER), and robustness to noise, reverb, far-field, packet loss, and clipping. Tested on 27 mainstream speech-interaction models including GPT-4o Voice and Qwen-Audio.
ACCLLM-judgeUTMOSWEREER
site paper
SJTU (OmniAgent) · updated 2025-11 · verified 2026-04
nativelab
VocalBench-zh
Mandarin version of VocalBench — 10 subsets, ~10k instances, 14 models evaluated at the Nov 2025 launch.
speech-lmaudio-understanding
The default Mandarin STS scoring target and a common co-citation with VoiceBench. Same four-dimensional design as the original VocalBench adapted to Chinese data.
ACCLLM-judgeUTMOSWEREER
site paper
Academic (CS3-Bench authors) · updated 2025-10 · verified 2026-04
nativelab
CS3-Bench
Mandarin-English code-switching benchmark — headline finding: S2S models drop ~66% relative on code-switched inputs vs monolingual ones.
speech-lm
Closest thing the field has to a real code-switch axis for speech-LMs. Probes where most current systems silently fail — language mixing within an utterance — which the multilingual benchmarks above currently underweight.
monolingual-vs-codeswitch deltaper-pair accuracy
site paper
Ruiqi Yan et al. · updated 2025-08 · verified 2026-04
nativelab
URO-Bench
The first end-to-end S2S benchmark with multilingual, multi-turn, and paralinguistic coverage.
speech-lminstruction-followingemotion
Forty datasets across twenty tasks in full Chinese-English pairings, split into a basic track and a pro track. Each track scores Understanding / Reasoning / Oral-conversation (URO) axes, exposing how open-source spoken dialogue models lag their backbone LLMs in instruction following, paralinguistics, and audio understanding. Findings of EMNLP 2025.
models scoredCovo-Audio-Chat / Covo-Audio-Chat-FD Step-Audio 2 mini
U/R/O compositeparalinguistic score
site paper
NUS · updated 2025-06 · verified 2026-04
nativelab
VoiceBench
Evaluates LLM-based voice assistants across knowledge, reasoning, safety, and instruction-following — judged by a GPT-4-class model.
speech-lminstruction-following
Wraps nine text/spoken subtasks (AlpacaEval, CommonEval, WildVoice, OpenBookQA, MMSU, SD-QA, IFEval, BBH, AdvBench) around speech-in / text-out assistants, injecting speaker, environment, and content variations. The public leaderboard tracks 39+ systems split across cascaded, audio-LLM, omni, and S2S / full-duplex categories.
models scoredCovo-Audio-Chat / Covo-Audio-Chat-FD Kimi-Audio LLaMA-Omni 2 Qwen2.5-Omni Qwen3-Omni Step-Audio 2 mini
overallQAreasoning (BBH)IFEvalsafety
site paper
AudioLLMs (I2R Singapore) · updated 2025-05 · verified 2026-04
nativelab
IFEval-Audio
Format-constrained spoken instructions, verified programmatically — an audio port of text IFEval.
instruction-followingspeech-lm
280 audio-instruction-answer triples across six dimensions (content, capitalization, symbol, list, length, format), drawn from Spoken SQuAD, TED-LIUM 3, MuchoMusic, and others. Each answer is programmatically checked against its constraint, giving IFR (format), SCR (semantics), and OSR (both) scores. Published at IJCNLP-AACL 2025.
IFRSCROSR
site paper
Alibaba DAMO (OFA-Sys) · updated 2024-12 · verified 2026-04
nativelab
AIR-Bench
A two-tier stress test for audio-LLMs covering speech, natural sound, and music — MCQ plus chat.
audio-understandinginstruction-following
Foundation benchmark runs ~19 k single-choice questions across 19 audio-understanding tasks; the chat benchmark scores ~2 k open-ended audio-instruction dialogues with a GPT-4 judge. Supports English and Chinese; published at ACL 2024.
models scoredKimi-Audio Qwen2.5-Omni Qwen3-Omni
MCQ accGPT-4 judge score
site paper

§03 · human votes

Arena & preference

Blind A/B human-preference arenas. The only evaluators that reliably catch 'this just sounds off' — and the only ones that scale across languages and accents.

4 entries

Artificial Analysis · updated 2026-04 · verified 2026-04
adjacentarena
tts-onlynot-full-duplex
Artificial Analysis Speech Arena
Blind user votes turn TTS models into a live Elo leaderboard — the de-facto preference number for TTS. Despite the 'Speech Arena' label, this is a TTS-only (text → speech) arena, not an STS ranking.
tts
Sixty-nine TTS models (15 open-weights) ranked by Elo from blind A/B listening comparisons on the same prompt. Filterable by category (assistants, entertainment, customer service, knowledge sharing) and accent. Currently led by Inworld TTS 1.5 Max and Eleven v3. Useful as a voice-quality proxy, but it does not evaluate conversation, turn-taking, or dialogue behaviour — do not read it as an STS benchmark.
Elo
site leaderboard
Scale AI · updated 2026-03 · verified 2026-04
nativearena
not-full-duplex
Scale Voice Showdown
First in-the-wild preference arena for voice AI — 11 frontier models, 60+ languages, real user speech. Dictate + S2S modes live; full-duplex mode still in development.
speech-lmtts
Intercepts <5% of real ChatLab conversations for blind head-to-head battles between two voice models, then continues the session with the winner. Eighty-one percent of prompts are open-ended conversational, and a third of battles happen in non-English (ES, AR, JA, PT, HI, FR). Modes currently evaluated are Dictate (speech-in / text-out) and S2S — *not* yet full-duplex live streaming, despite the arena framing. Treat scores as 'best voice assistant under turn-based play,' not as a full-duplex ranking.
models scoredOpenAI Realtime (gpt-realtime)
preference win-rateper-language score
site leaderboard
Hugging Face · updated 2026-03 · verified 2026-04
adjacentarena
tts-onlynot-full-duplex
TTS Arena
The original blind A/B voting space for TTS, feeding an Elo ladder on Hugging Face.
tts
Users blind-compare outputs from two random TTS systems reading the same prompt; votes feed an Elo score. Predates Artificial Analysis' arena and remains the open-research reference point for human-preference TTS scoring.
Elo
site leaderboard
FreedomIntelligence / CUHK-SZ · updated 2025-09 · verified 2026-04
nativearena
MTalk-Bench
First multi-turn S2S benchmark with both arena-style (pairwise) and rubric-based (absolute) protocols for dialogue evaluation.
speech-lminstruction-followingemotion
Covers three dimensions — Semantic, Paralinguistic, and Ambient Sound — over nine scenarios each, judged by humans and by LLM-as-a-judge. Unlike VoiceBench (single-turn) and S2S-Arena (no rubric), MTalk-Bench evaluates holistic conversational behaviour in audio-grounded multi-turn contexts. Code released under Apache-2.0; dataset under research license. Updated through 2025-09 with an expanded model pool (GPT-4o Voice, Kimi-Audio, Step-Audio, Qwen2.5-Omni, Moshi, and more).
arena win-raterubric scoreper-dimension score
site paper

§04 · domain KPIs

Vertical & task

Task-completion benchmarks in customer service, outbound calling, and healthcare. Where the rubber meets the phone line — and where text-first agents collapse.

3 entries

Meituan / Xbench / Agora · updated 2025-10 · verified 2026-04
nativevertical
VoiceAgentEval
A dual-dimensional benchmark for expert-level outbound-calling agents — interaction fluency vs task flow compliance.
vertical-agentspeech-lm
Covers six business domains and thirty sub-scenarios (recruitment, sales, CS, financial risk control, market research, proactive care). A large-model User Simulator combines five personalities × thirty scenarios to yield 150 evaluation dialogues. Separates General Interaction Capability (GIC) from Task Flow Compliance (TFC), exposing 'thoughtful listener' vs 'strict executor' trade-offs.
GICTFCtask success
site paper
Academic (VoiceAgentBench authors) · updated 2025-10 · verified 2026-04
nativevertical
VoiceAgentBench
Spoken agentic benchmark — multi-tool workflows, multi-turn interactions, and safety under realistic voice-agent conditions; covers English and Hindi.
vertical-agentspeech-lminstruction-following
Complements τ³-bench's voice track: VoiceAgentBench focuses on ~tool-rich multi-step agent tasks in the voice-first setting, with explicit safety and multilingual (including Hindi) subsets. Fills a gap that VoiceBench / VocalBench leave at the 'does the assistant actually execute the task' end.
task successtool-selection accuracymulti-turn stabilitysafety
site paper
OpenAI · updated 2025-05 · verified 2026-04
adjacentvertical
text-only
HealthBench
5,000 multi-turn health conversations × 48k physician-written rubric criteria — text-only, listed as a transferable foundation rubric rather than a voice benchmark.
vertical-agentinstruction-following
The most rigorous vertical dialogue eval to date, but *text-only*: conversations and rubrics are strings, so it scores an LLM's clinical reasoning rather than any speech or full-duplex behaviour. 262 physicians practicing in 60 countries wrote bespoke rubric criteria, weighted by clinical importance; responses are graded by GPT-4.1. Included here as a **transferable text-only benchmark** that any serious medical voice agent will ultimately need to pass on top of its speech stack — not as a native STS benchmark.
rubric scoreper-theme breakdown
site paper code

§05 · S2ST · TTS · paralinguistic

Components

Component benchmarks that aren't speech-LM native but score the parts every STS system is built from — translation, synthesis, non-verbal delivery.

10 entries

SJTU / MBZUAI · NeurIPS 2025 · updated 2026-02 · verified 2026-04
componentlab
MMAR
1 000 human-curated deep-reasoning audio QA items spanning speech, sound, music, and their mix — four hierarchical reasoning layers.
audio-understandingspeech-lm
Covers Signal / Perception / Semantic / Cultural reasoning layers, each annotated with chain-of-thought rationales. Harder than AudioBench and MMAU: no evaluated model approaches human performance as of publication. MMAR-Rubrics (2026-02) adds instance-level rubric scoring, and the Interspeech 2026 Audio Reasoning Challenge uses MMAR as its primary evaluation set. Dataset is CC-BY-NC-4.0; code is on GitHub under Apache-2.0.
models scoredKimi-Audio Qwen3-Omni
overall accuracyper-layer accuracyrubric score
site paper
NVSpeech authors · updated 2025-08 · verified 2026-04
componentlab
NV-Bench
Standardised test for generating laughs, sighs, fillers, and back-channels in TTS.
ttsemotion
1,651 multilingual (English + Mandarin) utterances across 14 non-verbal vocalization categories — vegetative sounds, affect bursts, conversational grunts — each paired with human reference audio. Scores instruction alignment via PCER (paralinguistic character error rate) and acoustic fidelity via FAD, WavLM speaker similarity, and DNSMOS.
PCERFADSIMDNSMOS
site paper
ByteDance · updated 2024-11 · verified 2026-04
componentlab
Seed-TTS Eval
The de-facto evaluation set for zero-shot voice cloning — WER, speaker similarity, and CMOS in one recipe.
tts
Defined in the Seed-TTS tech report and reused by most zero-shot TTS papers since. Combines automatic ASR-based word error, SECS speaker similarity against a reference clip, and CMOS listening tests to triangulate intelligibility, identity, and naturalness.
WERSECSCMOS
site paper
NTU et al. · updated 2024-11 · verified 2026-04
componentlab
Dynamic-SUPERB
Instruction-tuned successor to SUPERB — ~55 tasks across speech understanding and light paralinguistic input.
speech-lminstruction-following
The baseline every new speech-LM paper still reports. Covers instruction-following over speech across classification, generation, and open-ended tasks; Phase-2 (2024-11) doubled the task pool.
per-task accuracycomposite score
site paper code
Academic (VoiceAssistant-Eval authors) · updated 2024-09 · verified 2026-04
componentlab
VoiceAssistant-Eval
End-to-end voice assistant eval — instruction-following, multi-turn dialogue, and naturalness in one suite with GPT-4-class judges.
speech-lminstruction-following
Bundles open-ended task prompts with GPT-4-judge scoring on content, style, and multi-turn coherence. Useful as a second opinion alongside VoiceBench on the assistant-quality axis.
judge scoremulti-turn coherencenaturalness
site paper
AudioLLMs (I2R Singapore) · updated 2024-07 · verified 2026-04
componentlab
AudioBench
Broad audio-understanding benchmark covering speech, music, and ambient sound reasoning.
audio-understanding
Good signal on audio-grounded reasoning across speech / sound / music, but does not score conversational dynamics. Complements AIR-Bench as a second audio-LLM reasoning reference.
per-track accuracyoverall
site paper code
Academic (SD-Eval authors) · updated 2024-06 · verified 2026-04
componentlab
SD-Eval
Baseline probe on paralinguistic input — does a speech-LM use tone, emotion, and speaker traits at all?
speech-lmemotion
Four subsets (emotion, accent, age, environment) check whether the model's response is conditioned on paralinguistic features rather than transcript-only. A precondition every serious paralinguistic-output claim should first beat.
paralinguistic-grounded accuracydelta over ASR baseline
site paper
Meta AI · updated 2023-12 · verified 2026-04
componentlab
SeamlessExpressive (mExpresso)
Expressive speech-to-speech translation judged on translation, prosody, and vocal-style preservation.
s2st
Bundles the mExpresso benchmark (EN → FR / DE / ES / IT / ZH, seven styles including happy, sad, whisper, and laughing) with metrics that go beyond ASR-BLEU. AutoPCP scores phrase-level prosody correspondence and Vocal Style Similarity measures voice carry-over across languages.
ASR-BLEUAutoPCPVocal Style Sim
site paper
Academic (ProsAudit authors) · updated 2023-02 · verified 2026-04
componentlab
ProsAudit
Prosodic boundary detection benchmark — can the model perceive phrase / sentence boundaries from audio alone?
emotionaudio-understanding
Old (2023) but still cited as a prerequisite: if the model cannot hear prosodic boundaries, downstream paralinguistic-output claims do not hold up.
boundary F1delta over text-only
site paper
Meta AI · updated 2020-07 · verified 2026-04
componentlab
CoVoST-2 BLEU
21 language pairs of speech-to-text translation, long the S2T standard.
s2stasr
Built on Common Voice, CoVoST-2 pairs read speech with translated text across 21 language pairs (EN⇄X and X⇄EN). It is the reference workload for cross-model S2T benchmarking used by SeamlessM4T, Whisper, USM, and friends.
BLEUchrF
site paper

§06 · text-only · TTS-only · duplex ASR

Adjacent & transferable

Benchmarks that aren't STS in the strict sense — TTS-only arenas, text-only clinical rubrics, duplex-ASR for verticals — but that any serious voice system will ultimately need to pass on top of its speech stack. Kept separate from Components because they don't live inside the STS pipeline itself.

0 entries

Nothing here yet.

§07 · classic

Legacy & context

Saturated or representation-era benchmarks. Rarely used to rank modern speech-LMs, but still the scaffolding every new speech encoder reports against.

3 entries

NTU et al. (s3prl) · updated 2024-06 · verified 2026-04
legacylab
SUPERB
The classic: general-purpose speech representations scored on ten frozen-encoder downstream tasks.
speech-lmasraudio-understanding
Ten-task evaluation suite (ASR, PR, KS, SID, IC, SF, ER, QbE, SD, VC) for frozen self-supervised speech encoders, run through the s3prl toolkit introduced at Interspeech 2021. The Dynamic-SUPERB fork extends the protocol to instruction-following for speech-LMs. Listed here for context — modern speech-LMs are no longer ranked on SUPERB.
composite scoreper-task scores
site paper
OpenSLR · updated 2024-01 · verified 2026-04
legacylab
LibriSpeech WER
The coordinate system of English ASR — test-clean and test-other WER.
asr
Evaluation splits of the 960-hour LibriSpeech read-speech corpus distributed through OpenSLR. Not conversational and long-since saturated, but still the most cited fitness test for English ASR and the reference number every new speech encoder reports.
WER (test-clean)WER (test-other)
site
Academic · updated 2023-06 · verified 2026-04
legacylab
MELD / IEMOCAP
The classic emotion-recognition pair, evaluated with full dialogue context.
emotionaudio-understanding
MELD is built from Friends with 7-way utterance-level emotion labels; IEMOCAP uses scripted-plus-improvised dyadic sessions with 6 emotion classes. Both remain standard baselines for emotional ASR and conversational emotion recognition, though modern speech-LMs are rarely ranked on them.
Weighted F1Accuracy
site

§N · observatory notes

Three gaps we're watching.

Based on everything catalogued above, the measurement landscape has a few structural holes. They are where new benchmarks — and new companies — are likely to emerge in 2026.

gap · 01
Full-duplex × vertical × real audio.
τ³-Bench proves voice-reasoning collapses by ~50 pts when real-time is introduced, but its acoustic layer is simulated. No public benchmark ties real recordings, full-duplex timing, and a vertical task (CS, sales, health) together.
gap · 02
Non-English, non-Chinese full-duplex.
URO-Bench ships EN + ZH; no turn-taking leaderboard exists for most other high-speaker-count languages. Anyone who builds one at scale first writes the standard for that language.
gap · 03
Latency as a first-class axis.
First-response-ms numbers are vendor-reported and unreplicated across Moshi / GPT-4o Voice / Gemini Live / Sesame. Full-Duplex-Bench v2 makes this measurable; we expect a dedicated latency leaderboard to land within the year.

Fullduplex's editorial bias: we'll flag any new benchmark that closes one of these three gaps on the blog and add it here within the week.

New benchmark or revised score? submit an entry

Benchmarks.

What each benchmark actually measures.

Full-duplex & interactive

Full-Duplex-Bench v3

τ³-Bench (τ-Voice)

HumDial (ICASSP '26)

Audio MultiChallenge

Full-Duplex-Bench v2

FLEXI

MTR-DuplexBench

SID-Bench

MMedFD

Full-Duplex-Bench v1.5

FD-Bench

Talking Turns

Full-Duplex-Bench v1

J-Moshi (subjective)

Speech-LM & Audio Foundation Models

Big Bench Audio

VocalBench

VocalBench-zh

CS3-Bench

URO-Bench

VoiceBench

IFEval-Audio

AIR-Bench

Arena & preference

Artificial Analysis Speech Arena

Scale Voice Showdown

TTS Arena

MTalk-Bench

Vertical & task

VoiceAgentEval

VoiceAgentBench

HealthBench

Components

MMAR

NV-Bench

Seed-TTS Eval

Dynamic-SUPERB

VoiceAssistant-Eval

AudioBench

SD-Eval

SeamlessExpressive (mExpresso)

ProsAudit

CoVoST-2 BLEU

Adjacent & transferable

Legacy & context

SUPERB

LibriSpeech WER

MELD / IEMOCAP

Three gaps we're watching.

Full-duplex × vertical × real audio.

Non-English, non-Chinese full-duplex.

Latency as a first-class axis.