the sts series07 / 10#benchmarks#evaluation#full-duplex§ 07 sections · 04 figures

Why STS needs new benchmarks

The STS field inherited evaluation machinery from ASR, TTS, and text-LLM paradigms. None of them measured a live, two-channel, socially-timed conversation. The argument for a rebuild, plus a concrete picture of who could run it.

fullduplex research

published 2026-04-20· 17 min read· series 07 / 10

17m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

The previous dispatch mapped twenty-four speech-to-speech benchmarks onto fifteen capability axes. Half the cells are empty. Four different metrics share the name "barge-in latency." The commercial information diet is gated by one proprietary runner. A Japanese product team has zero dedicated benchmarks. Reading that map as "we need more benchmarks to fill the gaps" is the wrong conclusion.

The map is telling us something harder. The field imported its evaluation machinery from three prior paradigms — ASR, TTS, and text-LLM — and none of those paradigms measured the thing that makes STS hard: a live, two-channel, bidirectional, socially-timed conversation. Patching the gaps with more benchmarks of the same shape gets us a taller stack of measurements that still miss. The next generation of STS benchmarks has to be designed from the conversation outward, not from the transcript inward. This article is that argument, plus a concrete picture of what the rebuild would look like and who could run it.

What the map is telling us

Three findings carry over from the benchmark map.

First, the benchmarks are fragmented. Two dozen public benchmarks each cover a different slice of a single production voice agent. No row on the heatmap lights up across the whole grid.

Second, the commercial information diet is funneled through one proprietary bridge. The Artificial Analysis S2S leaderboard implements Big Bench Audio and a subset of Full-Duplex-Bench, and almost every commercial STS launch since late 2024 cites it. That bridge is not reproducible without access to AA's internal runner and prompt templating.

Third, the multilingual coverage is a global gap, not a Japanese-only one. Mandarin has three dedicated benchmarks. Japanese has zero dedicated full-duplex benchmarks. Arabic, Hindi, Spanish, Portuguese, French, German, Russian, Korean — none have a dedicated full-duplex benchmark either.

The obvious read is "the field needs to build more benchmarks." That read is wrong, or at least incomplete. The underlying problem is that the existing benchmarks measure what it is easy to measure with ASR, TTS, and text-LLM infrastructure, and not what full-duplex STS actually needs scored. The rest of this article argues that claim in three moves: where the inherited paradigms came from, which mismatches they produced, and what a next-generation benchmark would need to measure instead. Then we name who could build it.

STS evaluation did not start from scratch. It reused machinery from three earlier speech and language paradigms. Each inheritance imported a useful metric and a specific blind spot.

ASR paradigm → Word Error Rate. The automatic speech recognition field spent thirty years refining WER, the ratio of transcription errors to total words spoken. When large speech models arrived, WER was the ready-to-hand metric that researchers knew how to compute. But WER measures transcription, not interaction. A model can score WER 5% on a held-out test set and still interrupt the user constantly, freeze when interrupted itself, or backchannel at wrong moments. The Full-Duplex-Bench v1 paper made this argument explicit in early 2025: transcription accuracy measures the wrong thing for conversational models. Interaction is orthogonal to transcription, and if you score only the latter, you reward the former by accident.

TTS paradigm → Mean Opinion Score and listening tests. The text-to-speech field's standard measurement is MOS: human raters scoring audio quality on a 1-5 scale. MOS captures naturalness — does this voice sound like a person? — but not appropriateness. A model can have a pleasant voice and still fail to match the user's emotional register, over-affect neutral content, or sound warm during moments that call for clinical restraint. J-Moshi explicitly uses subjective MOS-based evaluation with no shared held-out test set, which is the TTS inheritance visible. The Mandarin generation-side benchmark VocalBench extends MOS to voice-agent scenarios but stays in the naturalness frame.

Text-LLM paradigm → fixed-test-set reasoning scores. When GPT-3 and GPT-4 arrived, the evaluation community built fixed-test-set benchmarks — MMLU, HellaSwag, HumanEval, GPQA. These work because text reasoning is a symbol-manipulation task that a static benchmark can capture faithfully. When audio reasoning benchmarks appeared, they adopted the same shape: Big Bench Audio is a 1,000-item audio adaptation of BIG-Bench text questions. Nothing wrong with that as a reasoning probe, but Big Bench Audio is functionally a text reasoning benchmark with audio stimuli. It does not score anything that could not have been scored from the transcript, and it runs one-turn closed-ended questions rather than dialogue.

three paradigms, three blind spots

WER is transcription without interaction. MOS is naturalness without appropriateness. Audio reasoning is text reasoning with sound attached. The benchmarks we have are good at what their parent paradigms were good at — and blind to what they were never designed to see.

F1·········

Inherited paradigm map. Each metric on the middle column was imported into STS because its parent paradigm made it the default. Each blind spot on the right is what that metric was never designed to capture.

Four measurement mismatches

The inherited paradigms produce four specific measurement mismatches when applied to full-duplex STS. Each one is a concrete failure mode, not an abstract critique.

Mismatch 1. Fixed test sets cannot score live dynamics. FDB v1 (March 2025), FDB v1.5 (July 2025), SID-Bench, FD-Bench, MTR-DuplexBench all use pre-recorded stimuli. A model is fed an audio file, its output is recorded, and scores are computed post-hoc. Streaming STS does not behave this way in production. Packet jitter, network variability, and real-time pressure produce behaviors that do not appear in offline evaluation. FDB v2 (October 2025) is the first benchmark to acknowledge this and move to a live WebRTC-style examiner. It is also the first to find that model rankings are not invariant across offline and live protocols. Same model, two scoring paradigms, different ranking. That is evidence the inherited fixed-test-set paradigm was systematically missing something.

Mismatch 2. Transcript-only judges cannot score paralinguistic output. FDB v1's user-interruption axis scores relevance via GPT-4-turbo reading a transcript. If the model produces a response that is textually relevant but delivered in a flat, irritable, or emotionally wrong register, the transcript judge rates it correct. No field-level benchmark currently penalizes paralinguistic output failures at scale. VocalBench and MTalk-Bench point toward the generation-side scoring that would be needed, but neither is adopted by the major full-duplex benchmarks. Paralinguistic output is the largest unmeasured axis in production STS. Users will say "the model sounds wrong" and the benchmark will say "the model scores correctly."

Mismatch 3. Single-language benchmarks cannot score cross-cultural turn-taking. Japanese conversational turn-taking includes short backchannels ("hai", "un", "sou desu") at roughly one-to-two-second intervals, substantially more frequent than in English. Run FDB v1's pause-handling test — which uses a take-over-rate detector tuned to English norms — on a Japanese-capable model, and the model's correct Japanese behavior fires as false positives. There is no way to score Japanese turn-taking on an English-designed benchmark, and no Japanese equivalent exists. The J-Moshi authors bypassed this by using MOS rather than a shared held-out test set. Every other non-English-dominant market faces the same problem. Arabic conversational overlap is higher than English. Hindi code-switching is dense. Mandarin gets some coverage via VocalBench-zh and CS3-Bench, but the principle is the same: language-specific turn-taking norms cannot be evaluated by benchmarks that assume English norms.

Mismatch 4. Proprietary runners cannot serve reproducibility. Artificial Analysis is structural infrastructure for the field. Every commercial STS launch since GPT-Realtime has cited an AA number. But every published score depends on a closed runner. When AA's judge model updates, every score moves. When AA changes weighting across Conversational Dynamics sub-axes, the composite changes silently. This is not a design flaw specific to AA. It is the consequence of closing the loop between commercial marketing and public comparison through a single proprietary intermediary.

aside The field ended up with one gateway, and the gateway is not inspectable. When that single pipe re-weights a composite, every public STS scoreboard moves in lockstep without any published changelog. That is not a neutral intermediary — it is load-bearing infrastructure without public accountability.

F2·········

Measurement mismatches and which paradigms inherit them. Every mismatch has at least one direct inheritance; three of the four are inherited from all three paradigms.

These four mismatches together explain why the coverage map has so many empty cells. The cells are not empty because no one has gotten around to running the experiments. The cells are empty because the experiments do not fit the inherited measurement paradigms. Paralinguistic output is empty because the parent paradigms scored either transcript text (ASR lineage) or naturalness of audio (TTS lineage), not the joint question of whether the generated audio's affect matches the requested affect. Safety / emergency barge-in is empty because the parent paradigms never had a notion of "model should interrupt the user." Multilingual full-duplex is empty because every inherited benchmark was designed in English first and translated later.

What a next-generation STS benchmark would need to measure

Pivot from criticism to construction. Five requirements follow directly from the mismatches above, each derivable from a specific failure mode.

Requirement 1 — live examiner as default. A model's full-duplex behavior exists only in live time. Pre-recorded stimuli can be a supplement, but the primary measurement has to happen in a streaming environment that introduces the packet-level and time-pressure effects real users experience. FDB v2 is the proof of concept. A next-generation benchmark makes the live examiner the default protocol, and the offline protocol the fallback for infrastructure-limited environments.

Requirement 2 — joint audio-and-transcript scoring. Any conversational-dynamics axis that involves how a model says something, not just what it says, needs a judge that hears the audio. The transcript is a projection of the signal that drops half the information. Practical implementation is an LLM examiner with audio input — already technically available from frontier vendors — wrapped in a scoring rubric that explicitly weights paralinguistic output.

Requirement 3 — multilingual from day one. A next-generation benchmark designs its protocol so that language-specific turn-taking norms can be encoded in the scoring rule, not hard-coded to English. Japanese backchannel frequency, Mandarin tonal cues in emotional expression, Arabic conversational overlap norms, Hindi-English code-switching rates — these are research-grade linguistic-typology questions, not engineering corner cases, and they need to be in the benchmark's design document, not patched later. HumDial at ICASSP 2026 is the first community-scale attempt to include a multilingual track from the start (Chinese + English across 6,356 interruption and 4,842 rejection utterances). That is the shape. It needs four more language tracks.

Requirement 4 — open methodology including judge selection. Reproducibility requires four things: the stimuli, the runner code, the prompts, and the judge. Today's benchmarks open varying subsets. FDB v1 is open on stimuli and metrics but uses GPT-4-turbo as an opaque judge. Artificial Analysis is closed on runner, prompts, and weighting. A next-generation benchmark has to publish all four, including the judge model's version and the prompt template. Proprietary-score leaderboards can still exist, but they cannot be the field's reference.

Requirement 5 — composite scores with transparent weighting. Any aggregation into a single number publishes its weights and allows users to re-weight based on their product's priorities. If conversational-dynamics composite scoring weights "smooth turn-taking" at 30% and a product team cares 3× more about "interruption handling," the benchmark should expose the weights and support re-aggregation. Today's composites — including Artificial Analysis' Conversational Dynamics composite — do not expose weights.

F3·········

Requirement

Today (imported paradigm)

Next generation

Measurement protocol

Fixed test set, pre-recorded

Live examiner default, pre-recorded fallback

Judge modality

Transcript read by text LLM

Audio heard by multimodal LLM

Language coverage

English-first, translated later

Multi-track at design time with per-language norms

Reproducibility

Some subset of stimuli / metrics / runner / judge open

All four open, with judge version pinned

Composite scoring

Opaque weighting, single rank

Published weighting, user-reweightable

Current versus next-generation benchmark requirements. Each row is derived from a specific mismatch in Figure F2. No single benchmark today meets all five; several 2025-2026 releases hit a subset.

The dataset side of the same problem

Benchmarks without reference data are impossible. Every benchmark above sits on top of a dataset: FDB v1 on the ICC corpus, Big Bench Audio on a custom audio recording of BIG-Bench text items, VocalBench on its own Mandarin recordings. A next-generation STS benchmark needs reference data it can hold out: two-channel conversations in the target language, with annotations for turn-taking events, overlap, and disfluency. A single-channel mono dataset cannot score full-duplex, because the ground truth for full-duplex behavior is encoded in the separation of the two channels. That is the same shortage in a different domain as the data ceiling and the foundation-threshold argument cover: the dataset gap and the benchmark gap rhyme because both sit on the same two-channel supply problem.

Who could build this

Four plausible builder types, each with a path and a specific weakness.

Academic consortium. HumDial at ICASSP 2026 is the proof that this model works. A grand-challenge-style benchmark with multiple co-authoring institutions, released open with training data and a held-out test set. Weakness: the funding and publication cycle doesn't match STS iteration speed. By the time a v2 consortium benchmark ships, the model landscape has moved. HumDial is a single-shot event; FDB v1 has already shipped three successors (v1.5, v2, v3) across thirteen months, which is closer to the iteration speed the field actually operates at.

Open-source community via Hugging Face. Big Bench Audio shipped through Hugging Face's blog and dataset hub. This works for lightweight, fixed-test-set benchmarks. It struggles for live examiner paradigms because Hugging Face Spaces does not currently provide the streaming infrastructure — WebRTC, low-latency media pipelines — that a live examiner needs. That could change. If it does, HF becomes a plausible default.

Independent commercial analyst firm with open methodology. Artificial Analysis is the current version of this role, closed. If AA open-sources its runner, prompts, and weighting — or if a competitor launches with open methodology — the field gets a commercial bridge that is also reproducible. Weakness: business-model incentives push toward closed. AA's differentiation is its prompt templating and judge selection. Open-sourcing those removes a defensible moat. A pre-commitment to transparency from day one is a plausible strategy; retrofitting transparency onto an established closed leaderboard is harder.

Dataset-first company. If the organization that assembled the reference data also defines the scoring standard, the data and the benchmark co-evolve. This is an emerging pattern. τ-Voice (Sierra, 2025) is a benchmark published by the company that deploys the underlying agents. VocalBench is Mandarin-native and comes from teams building Mandarin STS. Fullduplex is another candidate in this category. Weakness: commercial positioning creates obvious conflicts of interest unless the scoring is published and reproducible. The dataset-first path only produces a credible benchmark if the builder pre-commits to open methodology and external validation.

F4·········

Four builder types on iteration speed × reproducibility. No current builder occupies the top-right quadrant (fast and fully reproducible). An open-methodology commercial leaderboard, or a dataset-first company that pre-commits to open scoring, is the most plausible path into that zone.

No single builder type solves the whole problem. The honest forecast is that the next few years will see a mix: an ICASSP-class academic consortium for a multilingual full-duplex benchmark (annual cadence, open data), an open-source Hugging Face replacement for Big Bench Audio that includes paralinguistic stimuli (community cadence, modest scope), and at least one commercial leaderboard that competes with Artificial Analysis on open-methodology positioning. A dataset-first company with an open benchmark is the fourth piece, and the most interesting commercially because it aligns evaluation with training data assembly.

the target zone

The next-generation STS benchmark has to sit in the same quadrant as the text-LLM leaderboards that reshaped that field: fast iteration (weekly-to-monthly, not annual) crossed with reproducible methodology (open runner, open judge, open weights). Everything else — slow consortia, closed arenas, dataset-first labs without transparency — falls short on one axis or the other.

What this means for different readers

Three summaries, one per reader priority.

For researchers: the open opportunity is multilingual live examiner benchmarks. Japanese specifically (no FDB-equivalent exists), but Korean, Arabic, Hindi, and Spanish are all publishable gaps. Paralinguistic output is a second opportunity; the methodology is not solved but the audio-input LLM judges needed to solve it are now available from frontier vendors.

For VCs: evaluation infrastructure is a real layer of the stack, not a cost center. The question "who is positioned to build the reproducible version of Artificial Analysis" has candidate answers — an open-methodology commercial leaderboard, an academic consortium with commercial partners, a dataset-first company with open scoring — and the winner gets durable commercial positioning because the field needs a reference bridge that is not proprietary. This is adjacent to the model layer rather than competitive with it.

For product engineers and buyers: compose coverage from multiple benchmarks until a unified one exists. When a vendor cites "SOTA on full-duplex," ask which version of Full-Duplex-Bench, which axis, which barge-in definition. When a vendor cites a single composite score, ask for the weighting. If the weighting is not published, treat the number as advertising rather than measurement. For Japanese, Korean, and other non-English deployments, no benchmark currently answers your question. Budget for internal evaluation accordingly.

Where this lands

The benchmark map described the benchmarks as they are. This article argued what they would need to become. Together they define the evaluation side of the STS field as of April 2026.

Two claims summarize the argument. First, the existing benchmarks are not incomplete, they are misaligned. They inherited their shape from ASR, TTS, and text-LLM paradigms that did not measure bidirectional live conversation. Filling empty cells on the current map with more benchmarks of the same shape produces a taller stack of the same mismeasurement. Second, the rebuild is buildable, not speculative. FDB v2's live examiner, HumDial's multilingual track, VocalBench's paralinguistic scoring, and the explicit acknowledgement that Artificial Analysis is a proprietary bridge — these are public work from 2025 and 2026. A next-generation benchmark assembles these five requirements and publishes them openly. The question is who runs it.

Article 08 covers which models score where on the benchmarks that exist today. Article 09 covers the consent and licensing constraints on the reference data that any next-generation benchmark will need.

Fullduplex is working on benchmarks meant to advance the kind of measurement infrastructure this article maps. If your lab or team is working in this area, get in touch.

■ ■ ■

Fullduplex

2026

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

Benchmarks, models, datasets — curated and kept current. Jump to any of the three lists below, or join the community if you want to help us keep it honest.

see the datasets ↗join the community

next in the series

The STS model landscape.

Thirty-plus speech-to-speech models, four architectural families, and a licensing pattern that is starting to split inside each lab. A field guide to the April 2026 map, legible enough to place newly announced models in one or two paragraphs.

→

#benchmarks#evaluation#full-duplexfiled under: the latent · sts 07