Two releases, one inflection week
The STS series argued, across nine articles, that voice AI in 2026 sits between the GPT-2 and GPT-3 moments: architecture is becoming a commodity, the bottleneck is foundation data, evaluation is misaligned, and the closed commercial frontier is pulling ahead of public benchmarks via a single proprietary bridge. The two flagship releases this week pressure-test every one of those claims at once. Around them, four arXiv papers and two agent-SDK releases sharpen the evaluation and architectural debate that the flagships left open.
Flagship 1 — OpenAI Realtime API GA and GPT-Realtime-2
On May 7, OpenAI graduated the Realtime API out of beta and shipped three new audio models: gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper. The flagship adds GPT-5-class reasoning into the realtime path, lifts context from 32k to 128k tokens, and exposes a reasoning_effort knob with five tiers from minimal to xhigh. OpenAI's own scoreboard reports +15.2 pp on Big Bench Audio and +13.8 pp on Audio MultiChallenge over Realtime-1.5 at the higher reasoning tier. Audio billing is $32 / 1M input, $64 / 1M output, $0.40 / 1M cached.
Three implications against the series. The reasoning-realtime sub-category from article 08 is now the platform default. With OpenAI exposing reasoning_effort as a first-class API parameter, the choice between low-latency conversation and deliberate reasoning stops being a model selection problem and becomes a request parameter. The gpt-realtime-translate release lands directly on Kyutai Hibiki's territory — a closed-commercial S2ST at per-minute pricing changes the procurement question for any team that was evaluating Hibiki because it was the only option. Artificial Analysis is no longer the only proprietary bridge. Article 07 singled out AA as the single non-reproducible gateway through which commercial STS scoreboards flow; OpenAI now cites Big Bench Audio and Audio MultiChallenge lifts in absolute pp terms via its own runner, shifting the bridge from AA-via-vendor to vendor-direct on the reasoning axis.
Flagship 2 — TML-Interaction-Small, the first VAD-free 5-family entrant
On May 12, Mira Murati's Thinking Machines Lab announced TML-Interaction-Small, a 276B-parameter mixture-of-experts model with 12B active parameters. The single most consequential detail is that it is VAD-free and codec-light: dMel embeddings for audio, hMLP for 40×40 video patches, a flow head for audio decoding, all early-fused and decoded in 200 ms time-aligned micro-turns. Standard voice-activity detection is replaced by model-internal signals tracking whether speakers are thinking, yielding, self-correcting, or inviting response. Turn-taking latency is 0.40 s on FD-Bench v1; interaction quality is 77.8 / 100 on FD-Bench v1.5, ahead of GPT-Realtime-2 and Gemini 3.1 Flash Live on dynamics. On pure intelligence it trails (43.4% vs 48.5% on Audio MultiChallenge APR).
Article 03's four-family taxonomy needs a fifth slot — encoder-free early fusion with concurrent audio-video-text streams and time-aligned micro-turns is its own architectural shape. Article 02's co-completion gap is being closed from outside the FD-Bench harness: TML's TimeSpeak (proactive timing, 64.7%) and CueSpeak (verbal-cue response, 81.7%) measure exactly the behaviours FD-Bench v1 leaves on the table. Article 07's requirement #4 — open methodology including judge selection — is also the bar TML's own benchmarks fail; the four vendor-published axes inherit the same structural problem as Artificial Analysis. Article 05's 100k–500k hour foundation-data hypothesis is not falsifiable from this release because TML disclosed nothing about training corpus size or composition.
Around the flagships — voice-agent evaluation matures
Two concrete benchmarks landed within a day of each other. EVA-Bench from ServiceNow orchestrates bot-to-bot audio conversations across 213 enterprise scenarios with automatic user-simulator validation, then scores each run on EVA-A (task completion, faithfulness, audio-level speech fidelity) and EVA-X (conversation progression, conciseness, turn-taking timing). No current system clears 0.5 on both pass@1 metrics simultaneously, and peak-versus-reliable performance still diverges by a median 0.44 pass@k − pass^k gap on EVA-A. The complementary angle from From Text to Voice converts verified text tool-calling benchmarks into audio via TTS, speaker variation, and noise so the gold labels carry across; on audio-converted Confetti and When2Call, Gemini-3.1-Flash-Live tops Confetti at 70.4 and GPT-Realtime-1.5 tops When2Call at 71.9, with text-to-voice gaps of 1.8–4.8 points traced mostly to misunderstood argument values in speech. Together they start to put concrete numbers on the evaluation gap article 07 has been pointing at.
Architecture and streaming
How Should LLMs Listen While Speaking? from CUHK reframes full-duplex as a routing question and compares channel fusion (user stream into the LLM input directly — stronger semantic grounding, more vulnerable to context corruption under interruption-overlap) against cross-attention routing (user stream as external memory — weaker grounding but more robust under overlap). The paper formalises the tradeoff that TML's encoder-free early fusion implicitly resolves on the channel-fusion side. Streaming Speech-to-Text Translation with a SpeechLLM from Samsung Cambridge adds an adaptive emit-or-wait decision so the SpeechLLM streams as soon as enough audio has been heard, reaching translation quality close to the non-streaming baseline at 1–2 s of latency — narrowing the gap between cascade and SpeechLLM systems on the same task gpt-realtime-translate just commercialised.
Agent SDK layer
livekit-agents 1.5.9 ships Answering Machine Detection — the agent classifies the start of an outbound call (person / voicemail / IVR / unreachable) so it can branch behaviour — plus a Perplexity LLM plugin, Rime WebSocket streaming TTS, media-resolution control on Gemini Realtime, expanded ElevenLabs WebSocket and Inworld TTS options, and a scheduling-deadlock fix surfaced by 1.5.x. Pipecat 1.2.0 is the broader release of the two: an opt-in tool-change message hook that mitigates tool-call hallucination when the advertised tool set changes mid-conversation, a DeferredUserTurnStopStrategy that splits inference-trigger from finalisation, a max_buffer_delay_ms knob for Cartesia TTS, a mip_opt_out flag for Deepgram TTS, and per-session session_id plumbing through the local runner. A v1.2.1 patch the next day fixed bot hangs when filter_incomplete_user_turns is enabled and the LLM responds by calling a tool.
What this week answered, and what it did not
The series asked four open questions across articles 02–08. Two are now sharper. Will reasoning-realtime stay a sub-category or fold into the platform default? It folded. Can the FD-Bench family widen to cover co-completion? Not from inside the harness yet; from outside, TML's TimeSpeak / CueSpeak hit the target. Two remain open. Will the Artificial Analysis bridge become reproducible or be replaced? Neither — vendor-direct citation grew, open methodology did not. Will the foundation-data threshold come into view? No — TML's training corpus is undisclosed; OpenAI's has been undisclosed since GPT-4o. One new question, not previously in the series: does a 5-family taxonomy make article 03 better or worse? A taxonomy with a vendor-of-one fifth slot is a weaker organising tool, not a stronger one — unless a second lab ships something architecturally adjacent to TML inside the next two quarters.
Corrections to hello@fullduplex.ai.