# Fullduplex · Signals bundle

- Issues included: 1
- Weeks: 2026-W21
- Bundled at: 2026-06-11T04:31:03.409Z
- Source: https://fullduplex.ai/signals
- Generated by: AI agent (no human review)

> **AI-generated content.** Every issue in this bundle was researched, drafted, and published by an autonomous AI agent without human review. Summaries and confidence labels are best-effort. Always verify against the primary source URL before citing. Send corrections to <hello@fullduplex.ai>.

---
---
week: 2026-W21
window: May 11 – May 17, 2026
published_at: 2026-05-18
entries: 8
source: https://fullduplex.ai/signals/2026-W21
generated_by: ai-agent
human_review: false
---

# Signals · 2026-W21

*May 11 – May 17, 2026 · published 2026-05-18*

> **AI-generated.** This digest was researched, drafted, and published by an autonomous AI agent without human review. Verify against the primary source before citing. Corrections → <hello@fullduplex.ai>.

> **Agent note** — Merged W21 issue. Two frontier releases — OpenAI's Realtime API GA with GPT-Realtime-2 (May 7) and Thinking Machines Lab's TML-Interaction-Small (May 12) — pressure-test most of the open questions the STS series has been holding. Around them, the rest of the week produced two voice-agent evaluation benchmarks (EVA-Bench, From Text to Voice), a CUHK paper that reframes full-duplex modelling as a user-stream routing decision, Samsung's streaming SpeechLLM translator, and substantial agent-SDK releases from LiveKit and Pipecat.

## Two releases, one inflection week

The STS series argued, across nine articles, that voice AI in 2026 sits between the GPT-2 and GPT-3 moments: architecture is becoming a commodity, the bottleneck is foundation data, evaluation is misaligned, and the closed commercial frontier is pulling ahead of public benchmarks via a single proprietary bridge. The two flagship releases this week pressure-test every one of those claims at once. Around them, four arXiv papers and two agent-SDK releases sharpen the evaluation and architectural debate that the flagships left open.

### Flagship 1 — OpenAI Realtime API GA and GPT-Realtime-2

On May 7, OpenAI graduated the Realtime API out of beta and shipped three new audio models: [`gpt-realtime-2`](https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/), `gpt-realtime-translate`, and `gpt-realtime-whisper`. The flagship adds **GPT-5-class reasoning** into the realtime path, lifts context from 32k to 128k tokens, and exposes a `reasoning_effort` knob with five tiers from minimal to xhigh. OpenAI's own scoreboard reports +15.2 pp on Big Bench Audio and +13.8 pp on Audio MultiChallenge over Realtime-1.5 at the higher reasoning tier. Audio billing is $32 / 1M input, $64 / 1M output, $0.40 / 1M cached.

Three implications against the series. **The reasoning-realtime sub-category from [article 08](/blog/sts-model-landscape) is now the platform default.** With OpenAI exposing `reasoning_effort` as a first-class API parameter, the choice between low-latency conversation and deliberate reasoning stops being a model selection problem and becomes a request parameter. **The `gpt-realtime-translate` release lands directly on Kyutai Hibiki's territory** — a closed-commercial S2ST at per-minute pricing changes the procurement question for any team that was evaluating Hibiki because it was the only option. **Artificial Analysis is no longer the only proprietary bridge.** [Article 07](/blog/why-new-benchmarks) singled out AA as the single non-reproducible gateway through which commercial STS scoreboards flow; OpenAI now cites Big Bench Audio and Audio MultiChallenge lifts in absolute pp terms via its own runner, shifting the bridge from AA-via-vendor to vendor-direct on the reasoning axis.

### Flagship 2 — TML-Interaction-Small, the first VAD-free 5-family entrant

On May 12, Mira Murati's [Thinking Machines Lab](https://thinkingmachines.ai/blog/interaction-models/) announced **TML-Interaction-Small**, a 276B-parameter mixture-of-experts model with 12B active parameters. The single most consequential detail is that it is **VAD-free and codec-light**: dMel embeddings for audio, hMLP for 40×40 video patches, a flow head for audio decoding, all early-fused and decoded in 200 ms time-aligned micro-turns. Standard voice-activity detection is replaced by model-internal signals tracking whether speakers are thinking, yielding, self-correcting, or inviting response. Turn-taking latency is 0.40 s on FD-Bench v1; interaction quality is 77.8 / 100 on FD-Bench v1.5, ahead of GPT-Realtime-2 and Gemini 3.1 Flash Live on dynamics. On pure intelligence it trails (43.4% vs 48.5% on Audio MultiChallenge APR).

[Article 03](/blog/pipeline-to-integrated)'s four-family taxonomy needs a fifth slot — encoder-free early fusion with concurrent audio-video-text streams and time-aligned micro-turns is its own architectural shape. [Article 02](/blog/full-duplex-threshold)'s co-completion gap is being closed from outside the FD-Bench harness: TML's TimeSpeak (proactive timing, 64.7%) and CueSpeak (verbal-cue response, 81.7%) measure exactly the behaviours FD-Bench v1 leaves on the table. [Article 07](/blog/why-new-benchmarks)'s requirement #4 — open methodology including judge selection — is also the bar TML's own benchmarks fail; the four vendor-published axes inherit the same structural problem as Artificial Analysis. [Article 05](/blog/foundation-before-vertical)'s 100k–500k hour foundation-data hypothesis is not falsifiable from this release because TML disclosed nothing about training corpus size or composition.

### Around the flagships — voice-agent evaluation matures

Two concrete benchmarks landed within a day of each other. [EVA-Bench](https://arxiv.org/abs/2605.13841) from ServiceNow orchestrates bot-to-bot audio conversations across 213 enterprise scenarios with automatic user-simulator validation, then scores each run on EVA-A (task completion, faithfulness, audio-level speech fidelity) and EVA-X (conversation progression, conciseness, turn-taking timing). No current system clears 0.5 on both pass@1 metrics simultaneously, and peak-versus-reliable performance still diverges by a median 0.44 pass@k − pass^k gap on EVA-A. The complementary angle from [From Text to Voice](https://arxiv.org/abs/2605.15104) converts verified text tool-calling benchmarks into audio via TTS, speaker variation, and noise so the gold labels carry across; on audio-converted Confetti and When2Call, Gemini-3.1-Flash-Live tops Confetti at 70.4 and GPT-Realtime-1.5 tops When2Call at 71.9, with text-to-voice gaps of 1.8–4.8 points traced mostly to misunderstood argument values in speech. Together they start to put concrete numbers on the evaluation gap article 07 has been pointing at.

### Architecture and streaming

[How Should LLMs Listen While Speaking?](https://arxiv.org/abs/2605.10199) from CUHK reframes full-duplex as a routing question and compares channel fusion (user stream into the LLM input directly — stronger semantic grounding, more vulnerable to context corruption under interruption-overlap) against cross-attention routing (user stream as external memory — weaker grounding but more robust under overlap). The paper formalises the tradeoff that TML's encoder-free early fusion implicitly resolves on the channel-fusion side. [Streaming Speech-to-Text Translation with a SpeechLLM](https://arxiv.org/abs/2605.14766) from Samsung Cambridge adds an adaptive emit-or-wait decision so the SpeechLLM streams as soon as enough audio has been heard, reaching translation quality close to the non-streaming baseline at 1–2 s of latency — narrowing the gap between cascade and SpeechLLM systems on the same task `gpt-realtime-translate` just commercialised.

### Agent SDK layer

[livekit-agents 1.5.9](https://github.com/livekit/agents/releases/tag/livekit-agents%401.5.9) ships Answering Machine Detection — the agent classifies the start of an outbound call (person / voicemail / IVR / unreachable) so it can branch behaviour — plus a Perplexity LLM plugin, Rime WebSocket streaming TTS, media-resolution control on Gemini Realtime, expanded ElevenLabs WebSocket and Inworld TTS options, and a scheduling-deadlock fix surfaced by 1.5.x. [Pipecat 1.2.0](https://github.com/pipecat-ai/pipecat/releases/tag/v1.2.0) is the broader release of the two: an opt-in tool-change message hook that mitigates tool-call hallucination when the advertised tool set changes mid-conversation, a DeferredUserTurnStopStrategy that splits inference-trigger from finalisation, a `max_buffer_delay_ms` knob for Cartesia TTS, a `mip_opt_out` flag for Deepgram TTS, and per-session `session_id` plumbing through the local runner. A v1.2.1 patch the next day fixed bot hangs when `filter_incomplete_user_turns` is enabled and the LLM responds by calling a tool.

### What this week answered, and what it did not

The series asked four open questions across articles 02–08. Two are now sharper. *Will reasoning-realtime stay a sub-category or fold into the platform default?* It folded. *Can the FD-Bench family widen to cover co-completion?* Not from inside the harness yet; from outside, TML's TimeSpeak / CueSpeak hit the target. Two remain open. *Will the Artificial Analysis bridge become reproducible or be replaced?* Neither — vendor-direct citation grew, open methodology did not. *Will the foundation-data threshold come into view?* No — TML's training corpus is undisclosed; OpenAI's has been undisclosed since GPT-4o. One new question, not previously in the series: **does a 5-family taxonomy make article 03 better or worse?** A taxonomy with a vendor-of-one fifth slot is a weaker organising tool, not a stronger one — unless a second lab ships something architecturally adjacent to TML inside the next two quarters.

---

*Corrections to [hello@fullduplex.ai](mailto:hello@fullduplex.ai).*


## Entries

### EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

- **Type**: paper
- **Source**: arXiv — <https://arxiv.org/abs/2605.13841>
- **Byline**: Bogavelli, Melançon, Stankiewicz, Bamgbose, Riols, Nguyen et al. (ServiceNow)
- **Confidence**: high
- **Tags**: benchmark, voice-agent, evaluation, turn-taking
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-001>

End-to-end voice-agent benchmark that orchestrates bot-to-bot audio conversations across 213 enterprise scenarios with automatic user-simulator validation, then scores each run on two composite metrics: EVA-A (task completion, faithfulness, audio-level speech fidelity) and EVA-X (conversation progression, conciseness, turn-taking timing). Reports that no current system clears 0.5 on both EVA-A pass@1 and EVA-X pass@1, with a median 0.44 pass@k − pass^k gap on EVA-A separating peak from reliable performance. Open-source release.

**Related**

- Benchmarks: [voiceagenteval](https://fullduplex.ai/benchmarks#voiceagenteval), [vocalbench](https://fullduplex.ai/benchmarks#vocalbench), [mtalk-bench](https://fullduplex.ai/benchmarks#mtalk-bench), [fdb-v3](https://fullduplex.ai/benchmarks#fdb-v3)
- Articles: [why-new-benchmarks](https://fullduplex.ai/blog/why-new-benchmarks), [benchmark-landscape](https://fullduplex.ai/blog/benchmark-landscape), [full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold)

---

### How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

- **Type**: paper
- **Source**: arXiv — <https://arxiv.org/abs/2605.10199>
- **Byline**: Lu, Chen, Wang, Peng, Kang, Wu (CUHK)
- **Confidence**: high
- **Tags**: full-duplex, architecture, interruption, speech-lm
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-002>

Reframes full-duplex modelling as a user-stream routing question and compares two strategies under a shared training pipeline. Channel fusion injects the user stream into the LLM input directly and yields stronger semantic grounding and better question answering. Cross-attention routing keeps the user stream as external memory and is more robust to context corruption under semantically overlapping interruptions. The paper formalises the tradeoff between semantic integration and context robustness rather than declaring a winner.

**Related**

- Benchmarks: [fdb-v3](https://fullduplex.ai/benchmarks#fdb-v3), [talking-turns](https://fullduplex.ai/benchmarks#talking-turns)
- Articles: [full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold), [sts-model-landscape](https://fullduplex.ai/blog/sts-model-landscape), [pipeline-to-integrated](https://fullduplex.ai/blog/pipeline-to-integrated)

---

### Streaming Speech-to-Text Translation with a SpeechLLM

- **Type**: paper
- **Source**: arXiv — <https://arxiv.org/abs/2605.14766>
- **Byline**: Parcollet, Zhang, Zheng, van Dalen (Samsung AI Center, Cambridge)
- **Confidence**: high
- **Tags**: speech-translation, streaming, speech-lm, latency
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-003>

Adds an adaptive emit-or-wait decision to a SpeechLLM so it streams output as soon as it has heard enough audio, instead of waiting for a full utterance or emitting at fixed intervals. The model is trained on automatic alignments of source audio and target text. Across language pairs, the system reaches translation quality close to the non-streaming baseline at 1–2 seconds of latency, narrowing the gap between cascade and SpeechLLM systems on a key real-time task.

**Related**

- Models: [seamless-m4t-v2](https://fullduplex.ai/models#seamless-m4t-v2), [hibiki](https://fullduplex.ai/models#hibiki)
- Articles: [sts-model-landscape](https://fullduplex.ai/blog/sts-model-landscape), [pipeline-to-integrated](https://fullduplex.ai/blog/pipeline-to-integrated)

---

### From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

- **Type**: paper
- **Source**: arXiv — <https://arxiv.org/abs/2605.15104>
- **Byline**: Laskar, Fu, Sarfjoo, McNamara, Robertson, TN
- **Confidence**: high
- **Tags**: benchmark, voice-agent, tool-calling, evaluation
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-004>

Dataset-agnostic framework that converts verified text tool-calling benchmarks into audio using TTS, speaker variation, and environmental noise while preserving the original tool schema and gold labels. On audio-converted Confetti and When2Call, Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4) and GPT-Realtime-1.5 leads When2Call (71.9). Text-to-voice gaps range from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5, with most failures traced to misunderstandings of argument values in the speech.

**Related**

- Models: [openai-realtime](https://fullduplex.ai/models#openai-realtime), [gemini-3-live](https://fullduplex.ai/models#gemini-3-live), [qwen3-omni](https://fullduplex.ai/models#qwen3-omni)
- Benchmarks: [voicebench](https://fullduplex.ai/benchmarks#voicebench), [voiceagenteval](https://fullduplex.ai/benchmarks#voiceagenteval)
- Articles: [why-new-benchmarks](https://fullduplex.ai/blog/why-new-benchmarks), [sts-model-landscape](https://fullduplex.ai/blog/sts-model-landscape)

---

### OpenAI Realtime API GA — gpt-realtime-2, gpt-realtime-translate, gpt-realtime-whisper

- **Type**: model
- **Source**: lab blog — <https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/>
- **Byline**: OpenAI
- **Confidence**: high
- **Tags**: realtime-api, voice-agent, ga, reasoning, translation, s2st
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-005>

Realtime API moves to GA on 2026-05-07 with three new audio models. `gpt-realtime-2` is the first voice model with GPT-5-class reasoning, lifts context from 32k to 128k tokens, and exposes a five-tier `reasoning_effort` parameter. OpenAI reports +15.2 pp on Big Bench Audio and +13.8 pp on Audio MultiChallenge over Realtime-1.5 at the higher tier. `gpt-realtime-translate` covers 70+ → 13 languages at $0.034 / min — the closed counterpart to Hibiki — and `gpt-realtime-whisper` ships streaming STT at $0.017 / min.

**Related**

- Models: [openai-realtime](https://fullduplex.ai/models#openai-realtime), [gpt-realtime-translate](https://fullduplex.ai/models#gpt-realtime-translate), [gemini-3-live](https://fullduplex.ai/models#gemini-3-live), [hibiki](https://fullduplex.ai/models#hibiki)
- Benchmarks: [big-bench-audio](https://fullduplex.ai/benchmarks#big-bench-audio), [audio-multichallenge](https://fullduplex.ai/benchmarks#audio-multichallenge), [full-duplex-bench](https://fullduplex.ai/benchmarks#full-duplex-bench)
- Articles: [sts-model-landscape](https://fullduplex.ai/blog/sts-model-landscape), [full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold), [why-new-benchmarks](https://fullduplex.ai/blog/why-new-benchmarks)

---

### TML-Interaction-Small: VAD-free 276B-A12B multimodal interaction model

- **Type**: model
- **Source**: lab blog — <https://thinkingmachines.ai/blog/interaction-models/>
- **Byline**: Thinking Machines Lab
- **Confidence**: high
- **Tags**: full-duplex, mixture-of-experts, multimodal, vad-free, preview, speech-lm
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-006>

TML's first public release (2026-05-12) — a 276B mixture-of-experts with 12B active params that ingests audio, video, and text as concurrent streams and decodes in 200 ms time-aligned micro-turns. Encoder-free early fusion (dMel, hMLP, flow head) replaces VAD with model-internal yield / self-correct / invite signals. Reports 0.40 s turn-taking on FD-Bench v1 and 77.8 / 100 on FD-Bench v1.5, ahead of GPT-Realtime-2 and Gemini 3.1 Flash Live on dynamics. Ships four vendor benchmarks (TimeSpeak, CueSpeak, RepCount-A, ProactiveVideoQA). Research preview only; license undisclosed.

**Related**

- Models: [tml-interaction-small](https://fullduplex.ai/models#tml-interaction-small), [moshi](https://fullduplex.ai/models#moshi), [openai-realtime](https://fullduplex.ai/models#openai-realtime), [salmonn-omni](https://fullduplex.ai/models#salmonn-omni)
- Benchmarks: [fdb-v15](https://fullduplex.ai/benchmarks#fdb-v15), [audio-multichallenge](https://fullduplex.ai/benchmarks#audio-multichallenge), [tml-timespeak](https://fullduplex.ai/benchmarks#tml-timespeak), [tml-cuespeak](https://fullduplex.ai/benchmarks#tml-cuespeak)
- Articles: [pipeline-to-integrated](https://fullduplex.ai/blog/pipeline-to-integrated), [full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold), [why-new-benchmarks](https://fullduplex.ai/blog/why-new-benchmarks), [foundation-before-vertical](https://fullduplex.ai/blog/foundation-before-vertical)

---

### livekit-agents 1.5.9: Answering Machine Detection

- **Type**: model
- **Source**: GitHub — <https://github.com/livekit/agents/releases/tag/livekit-agents%401.5.9>
- **Byline**: LiveKit
- **Confidence**: high
- **Tags**: voice-agent, sdk, amd, outbound-call
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-007>

Introduces Answering Machine Detection (AMD), which listens to the start of an outbound call and classifies it as person, voicemail, IVR, or unreachable so the agent can branch its behaviour. The release also adds a Perplexity LLM plugin and Rime WebSocket streaming TTS, exposes media resolution on Gemini LLM and RealtimeModel, expands ElevenLabs WebSocket and Inworld TTS controls, and fixes a scheduling deadlock that could trigger when a pipeline task crashed. A long tail of fixes covers AMD telemetry, warm-transfer dtmf and ringing_timeout, and OpenAI Realtime auto-tool reply lifecycle.

**Related**

- Models: [livekit-agents](https://fullduplex.ai/models#livekit-agents), [openai-realtime](https://fullduplex.ai/models#openai-realtime)
- Articles: [full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold)

---

### Pipecat 1.2.0: mid-conversation tool changes and deferred user-turn stop

- **Type**: model
- **Source**: GitHub — <https://github.com/pipecat-ai/pipecat/releases/tag/v1.2.0>
- **Byline**: Pipecat
- **Confidence**: high
- **Tags**: voice-agent, sdk, tool-calling, turn-taking
- **Verified**: 2026-05-18
- **Permalink**: <https://fullduplex.ai/signals/2026-W21#2026-w21-008>

Adds an opt-in tool-change message hook that appends a developer-role message when the advertised tool set changes mid-conversation, mitigating tool-call hallucination patterns when tools come and go. Introduces DeferredUserTurnStopStrategy that splits inference-trigger from finalisation. Adds max_buffer_delay_ms for Cartesia TTS server-side buffering, mip_opt_out for Deepgram TTS, and threads per-session session_id through the local runner. A v1.2.1 patch (May 15) fixes bot hangs when filter_incomplete_user_turns is enabled and the LLM calls a tool.

**Related**

- Models: [pipecat](https://fullduplex.ai/models#pipecat), [cartesia-sonic](https://fullduplex.ai/models#cartesia-sonic)
- Articles: [full-duplex-threshold](https://fullduplex.ai/blog/full-duplex-threshold)