---
title: "Why STS needs new benchmarks"
description: "The STS field inherited evaluation machinery from ASR, TTS, and text-LLM paradigms. None of them measured a live, two-channel, socially-timed conversation. The argument for a rebuild, plus a concrete picture of who could run it."
article_number: "07"
slug: why-new-benchmarks
published_at: 2026-04-20
reading_minutes: 17
tags: ["benchmarks", "evaluation", "full-duplex"]
canonical_url: https://fullduplex.ai/blog/why-new-benchmarks
markdown_url: https://fullduplex.ai/blog/why-new-benchmarks/md
series: "The STS Series"
series_position: 7
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Why speech-to-speech AI needs new benchmarks

[The previous dispatch](/blog/benchmark-landscape) mapped twenty-four speech-to-speech benchmarks onto fifteen capability axes. Half the cells are empty. Four different metrics share the name "barge-in latency." The commercial information diet is gated by one proprietary runner. A Japanese product team has zero dedicated benchmarks. Reading that map as "we need more benchmarks to fill the gaps" is the wrong conclusion.

The map is telling us something harder. The field imported its evaluation machinery from three prior paradigms — ASR, TTS, and text-LLM — and none of those paradigms measured the thing that makes STS hard: a live, two-channel, bidirectional, socially-timed conversation. Patching the gaps with more benchmarks of the same shape gets us a taller stack of measurements that still miss. The next generation of STS benchmarks has to be designed from the conversation outward, not from the transcript inward. This article is that argument, plus a concrete picture of what the rebuild would look like and who could run it.

## What the map is telling us

Three findings carry over from [the benchmark map](/blog/benchmark-landscape).

First, the benchmarks are **fragmented**. Two dozen public benchmarks each cover a different slice of a single production voice agent. No row on the heatmap lights up across the whole grid.

Second, the commercial information diet is **funneled through one proprietary bridge**. The [Artificial Analysis S2S leaderboard](https://artificialanalysis.ai/speech-to-speech) implements Big Bench Audio and a subset of Full-Duplex-Bench, and almost every commercial STS launch since late 2024 cites it. That bridge is not reproducible without access to AA's internal runner and prompt templating.

Third, the **multilingual coverage is a global gap, not a Japanese-only one.** Mandarin has three dedicated benchmarks. Japanese has zero dedicated full-duplex benchmarks. Arabic, Hindi, Spanish, Portuguese, French, German, Russian, Korean — none have a dedicated full-duplex benchmark either.

The obvious read is "the field needs to build more benchmarks." That read is wrong, or at least incomplete. The underlying problem is that the existing benchmarks measure what it is easy to measure with ASR, TTS, and text-LLM infrastructure, and *not* what full-duplex STS actually needs scored. The rest of this article argues that claim in three moves: where the inherited paradigms came from, which mismatches they produced, and what a next-generation benchmark would need to measure instead. Then we name who could build it.

## Three inherited paradigms, three blind spots

STS evaluation did not start from scratch. It reused machinery from three earlier speech and language paradigms. Each inheritance imported a useful metric and a specific blind spot.

**ASR paradigm → Word Error Rate.** The automatic speech recognition field spent thirty years refining WER, the ratio of transcription errors to total words spoken. When large speech models arrived, WER was the ready-to-hand metric that researchers knew how to compute. But WER measures *transcription*, not *interaction*. A model can score WER 5% on a held-out test set and still interrupt the user constantly, freeze when interrupted itself, or backchannel at wrong moments. [The Full-Duplex-Bench v1 paper](https://arxiv.org/abs/2503.04721) made this argument explicit in early 2025: transcription accuracy measures the wrong thing for conversational models. Interaction is orthogonal to transcription, and if you score only the latter, you reward the former by accident.

**TTS paradigm → Mean Opinion Score and listening tests.** The text-to-speech field's standard measurement is MOS: human raters scoring audio quality on a 1-5 scale. MOS captures *naturalness* — does this voice sound like a person? — but not *appropriateness*. A model can have a pleasant voice and still fail to match the user's emotional register, over-affect neutral content, or sound warm during moments that call for clinical restraint. [J-Moshi](https://aclanthology.org/2024.emnlp-main.1234/) explicitly uses subjective MOS-based evaluation with no shared held-out test set, which is the TTS inheritance visible. The Mandarin generation-side benchmark [VocalBench](https://arxiv.org/abs/2505.15727) extends MOS to voice-agent scenarios but stays in the naturalness frame.

**Text-LLM paradigm → fixed-test-set reasoning scores.** When GPT-3 and GPT-4 arrived, the evaluation community built fixed-test-set benchmarks — MMLU, HellaSwag, HumanEval, GPQA. These work because text reasoning is a symbol-manipulation task that a static benchmark can capture faithfully. When audio reasoning benchmarks appeared, they adopted the same shape: [Big Bench Audio](https://huggingface.co/blog/big-bench-audio-release) is a 1,000-item audio adaptation of BIG-Bench text questions. Nothing wrong with that as a reasoning probe, but Big Bench Audio is functionally a text reasoning benchmark with audio stimuli. It does not score anything that could not have been scored from the transcript, and it runs one-turn closed-ended questions rather than dialogue.

<div class="callout">
<span class="label">three paradigms, three blind spots</span>

**WER** is transcription without interaction. **MOS** is naturalness without appropriateness. **Audio reasoning** is text reasoning with sound attached. The benchmarks we have are good at what their parent paradigms were good at — and blind to what they were never designed to see.

</div>

{{FIG:f1}}

## Four measurement mismatches

The inherited paradigms produce four specific measurement mismatches when applied to full-duplex STS. Each one is a concrete failure mode, not an abstract critique.

**Mismatch 1. Fixed test sets cannot score live dynamics.** FDB v1 (March 2025), FDB v1.5 (July 2025), SID-Bench, FD-Bench, MTR-DuplexBench all use pre-recorded stimuli. A model is fed an audio file, its output is recorded, and scores are computed post-hoc. Streaming STS does not behave this way in production. Packet jitter, network variability, and real-time pressure produce behaviors that do not appear in offline evaluation. [FDB v2](https://arxiv.org/abs/2510.07838) (October 2025) is the first benchmark to acknowledge this and move to a live WebRTC-style examiner. It is also the first to find that model rankings are not invariant across offline and live protocols. Same model, two scoring paradigms, different ranking. That is evidence the inherited fixed-test-set paradigm was systematically missing something.

**Mismatch 2. Transcript-only judges cannot score paralinguistic output.** FDB v1's user-interruption axis scores relevance via GPT-4-turbo reading a transcript. If the model produces a response that is textually relevant but delivered in a flat, irritable, or emotionally wrong register, the transcript judge rates it correct. No field-level benchmark currently penalizes paralinguistic output failures at scale. [VocalBench](https://arxiv.org/abs/2505.15727) and [MTalk-Bench](https://arxiv.org/abs/2505.15524) point toward the generation-side scoring that would be needed, but neither is adopted by the major full-duplex benchmarks. Paralinguistic output is the largest unmeasured axis in production STS. Users will say "the model sounds wrong" and the benchmark will say "the model scores correctly."

**Mismatch 3. Single-language benchmarks cannot score cross-cultural turn-taking.** Japanese conversational turn-taking includes short backchannels ("hai", "un", "sou desu") at roughly one-to-two-second intervals, substantially more frequent than in English. Run FDB v1's pause-handling test — which uses a take-over-rate detector tuned to English norms — on a Japanese-capable model, and the model's correct Japanese behavior fires as false positives. There is no way to score Japanese turn-taking on an English-designed benchmark, and no Japanese equivalent exists. The [J-Moshi](https://aclanthology.org/2024.emnlp-main.1234/) authors bypassed this by using MOS rather than a shared held-out test set. Every other non-English-dominant market faces the same problem. Arabic conversational overlap is higher than English. Hindi code-switching is dense. Mandarin gets some coverage via [VocalBench-zh](https://arxiv.org/abs/2511.08230) and [CS3-Bench](https://arxiv.org/abs/2510.07881), but the principle is the same: language-specific turn-taking norms cannot be evaluated by benchmarks that assume English norms.

**Mismatch 4. Proprietary runners cannot serve reproducibility.** Artificial Analysis is structural infrastructure for the field. Every commercial STS launch since GPT-Realtime has cited an AA number. But every published score depends on a closed runner. When AA's judge model updates, every score moves. When AA changes weighting across Conversational Dynamics sub-axes, the composite changes silently. This is not a design flaw specific to AA. It is the consequence of closing the loop between commercial marketing and public comparison through a single proprietary intermediary.

<p class="aside-inline">
<span class="aside-lbl">aside</span>
The field ended up with <b>one gateway</b>, and the gateway is not inspectable. When that single pipe re-weights a composite, every public STS scoreboard moves in lockstep without any published changelog. That is not a neutral intermediary — it is load-bearing infrastructure without public accountability.
</p>

{{FIG:f2}}

These four mismatches together explain why the coverage map has so many empty cells. The cells are not empty because no one has gotten around to running the experiments. The cells are empty because the experiments do not fit the inherited measurement paradigms. Paralinguistic output is empty because the parent paradigms scored *either* transcript text (ASR lineage) *or* naturalness of audio (TTS lineage), not the joint question of whether the generated audio's affect matches the requested affect. Safety / emergency barge-in is empty because the parent paradigms never had a notion of "model should interrupt the user." Multilingual full-duplex is empty because every inherited benchmark was designed in English first and translated later.

## What a next-generation STS benchmark would need to measure

Pivot from criticism to construction. Five requirements follow directly from the mismatches above, each derivable from a specific failure mode.

**Requirement 1 — live examiner as default.** A model's full-duplex behavior exists only in live time. Pre-recorded stimuli can be a supplement, but the primary measurement has to happen in a streaming environment that introduces the packet-level and time-pressure effects real users experience. FDB v2 is the proof of concept. A next-generation benchmark makes the live examiner the default protocol, and the offline protocol the fallback for infrastructure-limited environments.

**Requirement 2 — joint audio-and-transcript scoring.** Any conversational-dynamics axis that involves how a model *says* something, not just what it says, needs a judge that hears the audio. The transcript is a projection of the signal that drops half the information. Practical implementation is an LLM examiner with audio input — already technically available from frontier vendors — wrapped in a scoring rubric that explicitly weights paralinguistic output.

**Requirement 3 — multilingual from day one.** A next-generation benchmark designs its protocol so that language-specific turn-taking norms can be encoded in the scoring rule, not hard-coded to English. Japanese backchannel frequency, Mandarin tonal cues in emotional expression, Arabic conversational overlap norms, Hindi-English code-switching rates — these are research-grade linguistic-typology questions, not engineering corner cases, and they need to be in the benchmark's design document, not patched later. [HumDial](https://sites.google.com/view/humdial-2026) at ICASSP 2026 is the first community-scale attempt to include a multilingual track from the start (Chinese + English across 6,356 interruption and 4,842 rejection utterances). That is the shape. It needs four more language tracks.

**Requirement 4 — open methodology including judge selection.** Reproducibility requires four things: the stimuli, the runner code, the prompts, and the judge. Today's benchmarks open varying subsets. FDB v1 is open on stimuli and metrics but uses GPT-4-turbo as an opaque judge. Artificial Analysis is closed on runner, prompts, and weighting. A next-generation benchmark has to publish all four, including the judge model's version and the prompt template. Proprietary-score leaderboards can still exist, but they cannot be the field's reference.

**Requirement 5 — composite scores with transparent weighting.** Any aggregation into a single number publishes its weights and allows users to re-weight based on their product's priorities. If conversational-dynamics composite scoring weights "smooth turn-taking" at 30% and a product team cares 3× more about "interruption handling," the benchmark should expose the weights and support re-aggregation. Today's composites — including Artificial Analysis' Conversational Dynamics composite — do not expose weights.

{{FIG:f3}}

### The dataset side of the same problem

Benchmarks without reference data are impossible. Every benchmark above sits on top of a dataset: FDB v1 on the ICC corpus, Big Bench Audio on a custom audio recording of BIG-Bench text items, VocalBench on its own Mandarin recordings. A next-generation STS benchmark needs reference data it can hold out: two-channel conversations in the target language, with annotations for turn-taking events, overlap, and disfluency. A single-channel mono dataset cannot score full-duplex, because the ground truth for full-duplex behavior is encoded in the separation of the two channels. That is the same shortage in a different domain as [the data ceiling](/blog/data-ceiling) and [the foundation-threshold argument](/blog/foundation-before-vertical) cover: the dataset gap and the benchmark gap rhyme because both sit on the same two-channel supply problem.

## Who could build this

Four plausible builder types, each with a path and a specific weakness.

**Academic consortium.** HumDial at ICASSP 2026 is the proof that this model works. A grand-challenge-style benchmark with multiple co-authoring institutions, released open with training data and a held-out test set. Weakness: the funding and publication cycle doesn't match STS iteration speed. By the time a v2 consortium benchmark ships, the model landscape has moved. HumDial is a single-shot event; FDB v1 has already shipped three successors (v1.5, v2, v3) across thirteen months, which is closer to the iteration speed the field actually operates at.

**Open-source community via Hugging Face.** Big Bench Audio shipped through Hugging Face's blog and dataset hub. This works for lightweight, fixed-test-set benchmarks. It struggles for live examiner paradigms because Hugging Face Spaces does not currently provide the streaming infrastructure — WebRTC, low-latency media pipelines — that a live examiner needs. That could change. If it does, HF becomes a plausible default.

**Independent commercial analyst firm with open methodology.** Artificial Analysis is the current version of this role, closed. If AA open-sources its runner, prompts, and weighting — or if a competitor launches with open methodology — the field gets a commercial bridge that is also reproducible. Weakness: business-model incentives push toward closed. AA's differentiation is its prompt templating and judge selection. Open-sourcing those removes a defensible moat. A pre-commitment to transparency from day one is a plausible strategy; retrofitting transparency onto an established closed leaderboard is harder.

**Dataset-first company.** If the organization that assembled the reference data also defines the scoring standard, the data and the benchmark co-evolve. This is an emerging pattern. [τ-Voice](https://sierra.ai/blog/tau-voice) (Sierra, 2025) is a benchmark published by the company that deploys the underlying agents. [VocalBench](https://arxiv.org/abs/2505.15727) is Mandarin-native and comes from teams building Mandarin STS. Fullduplex is another candidate in this category. Weakness: commercial positioning creates obvious conflicts of interest unless the scoring is published and reproducible. The dataset-first path only produces a credible benchmark if the builder pre-commits to open methodology and external validation.

{{FIG:f4}}

No single builder type solves the whole problem. The honest forecast is that the next few years will see a mix: an ICASSP-class academic consortium for a multilingual full-duplex benchmark (annual cadence, open data), an open-source Hugging Face replacement for Big Bench Audio that includes paralinguistic stimuli (community cadence, modest scope), and at least one commercial leaderboard that competes with Artificial Analysis on open-methodology positioning. A dataset-first company with an open benchmark is the fourth piece, and the most interesting commercially because it aligns evaluation with training data assembly.

<div class="callout dark">
<span class="label">the target zone</span>

The next-generation STS benchmark has to sit in the same quadrant as the text-LLM leaderboards that reshaped that field: **fast iteration** (weekly-to-monthly, not annual) crossed with **reproducible methodology** (open runner, open judge, open weights). Everything else — slow consortia, closed arenas, dataset-first labs without transparency — falls short on one axis or the other.

</div>

## What this means for different readers

Three summaries, one per reader priority.

**For researchers:** the open opportunity is multilingual live examiner benchmarks. Japanese specifically (no FDB-equivalent exists), but Korean, Arabic, Hindi, and Spanish are all publishable gaps. Paralinguistic output is a second opportunity; the methodology is not solved but the audio-input LLM judges needed to solve it are now available from frontier vendors.

**For VCs:** evaluation infrastructure is a real layer of the stack, not a cost center. The question "who is positioned to build the reproducible version of Artificial Analysis" has candidate answers — an open-methodology commercial leaderboard, an academic consortium with commercial partners, a dataset-first company with open scoring — and the winner gets durable commercial positioning because the field needs a reference bridge that is not proprietary. This is adjacent to the model layer rather than competitive with it.

**For product engineers and buyers:** compose coverage from multiple benchmarks until a unified one exists. When a vendor cites "SOTA on full-duplex," ask *which version of Full-Duplex-Bench, which axis, which barge-in definition.* When a vendor cites a single composite score, ask for the weighting. If the weighting is not published, treat the number as advertising rather than measurement. For Japanese, Korean, and other non-English deployments, no benchmark currently answers your question. Budget for internal evaluation accordingly.

## Where this lands

[The benchmark map](/blog/benchmark-landscape) described the benchmarks as they are. This article argued what they would need to become. Together they define the evaluation side of the STS field as of April 2026.

Two claims summarize the argument. First, **the existing benchmarks are not incomplete, they are misaligned.** They inherited their shape from ASR, TTS, and text-LLM paradigms that did not measure bidirectional live conversation. Filling empty cells on the current map with more benchmarks of the same shape produces a taller stack of the same mismeasurement. Second, **the rebuild is buildable**, not speculative. FDB v2's live examiner, HumDial's multilingual track, VocalBench's paralinguistic scoring, and the explicit acknowledgement that Artificial Analysis is a proprietary bridge — these are public work from 2025 and 2026. A next-generation benchmark assembles these five requirements and publishes them openly. The question is who runs it.

[Article 08](/blog/sts-model-landscape) covers which models score where on the benchmarks that exist today. [Article 09](/blog/consent-licensing-opt-in) covers the consent and licensing constraints on the reference data that any next-generation benchmark will need.

---

Fullduplex is working on benchmarks meant to advance the kind of measurement infrastructure this article maps. If your lab or team is working in this area, [get in touch](mailto:hello@fullduplex.ai).

---

_Originally published at [https://fullduplex.ai/blog/why-new-benchmarks](https://fullduplex.ai/blog/why-new-benchmarks)._
_Part of **The STS Series** · 07 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
