the sts series01 / 10#audio#primer§ 08 sections · 10 figures

Speech-to-speech AI, a primer.

What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.

fullduplex research

published apr 2026· 10 min read· ~4,100 words· series 01 / 10

10m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

fig.00 · the new telephone — two channels, both always openfullduplex / synthesized

The telephone moment

In a natural conversation between two humans, the gap between one person finishing and the other starting averages about 200 milliseconds. That is roughly a quarter of a blink. It is also one of the most stable numbers in human behavior, measured the same way across ten very different languages from Japanese to Yélî Dnye. Until 2024, voice assistants needed about a full second to do the same thing. That difference is the difference between a conversation and a transaction, and it is why the voice AI demos of the last eighteen months feel qualitatively different from anything before.

Here is a simple way to hold the shift in your head. Old voice assistants worked like a walkie-talkie. One side presses the button, speaks a complete thought, releases, and waits. Interruptions break it. Overlaps break it. Listening and speaking are separate modes and only one happens at a time. The new systems work like a telephone. Two people, two open channels, both able to listen and speak at once, able to interrupt and be interrupted, able to murmur mhm while the other person is still talking.

This is a primer on what changed, what the new models are actually doing, and why the terms you are about to see — speech-to-speech, full-duplex, audio foundation model — are worth distinguishing carefully.

fig.01 · latency ruler·········

Human conversational turn gap vs typical cascaded ASR+LLM+TTS vs an integrated STS model. Sources: Stivers et al. 2009 (PNAS); Défossez et al. 2024 (Moshi).

Four words you will see everywhere

Four phrases do most of the work in this field, and they overlap in ways that quietly trip people up.

speech-to-speech (STS): A model that takes audio in and emits audio out, without converting to text as an intermediate step. The model does its thinking in a representation that lives closer to sound than to written language.
full-duplex: Describes how the conversation flows. A full-duplex system can listen and speak at the same time, the way a telephone can. Half-duplex has to finish one before starting the other, the way a walkie-talkie does.
audio foundation model: A big pretrained model that understands and generates audio. Foundation is borrowed from the text world: pretrained on a broad corpus, adaptable to many tasks. An audio foundation model does the same thing but with waveforms as its native material.
speech language model (SpeechLM): A large model that treats speech the way GPT treats text: a sequence of discrete tokens, predicted one after another. SpeechLMs are usually built on top of a neural audio codec that converts waveforms into tokens.

These terms overlap but are not interchangeable. Moshi, the open-source system Kyutai released in late 2024, is all four at once: a speech-to-speech model, full-duplex, a foundation model for audio, and a speech language model. VALL-E, an earlier Microsoft system, is a SpeechLM but only for text-to-speech, not STS. A traditional cascade of ASR plus LLM plus TTS is speech-in and speech-out at the system level, but there is no STS model at its core.

Before looking at the full landscape, it helps to separate three different kinds of voice AI product that sometimes get lumped together. A speech-to-text service turns audio into a transcript. A text-to-speech service turns text into audio. A conversational AI system does both, in a loop, and has to decide what to say. The first two are components. The third is the system you actually talk to.

fig.02 · three modes·········

STT and TTS are building blocks. Conversational AI is the product built on top of them — either by stitching the components together or by replacing them with a single integrated model.

That distinction matters because the same brand can appear in more than one layer. ElevenLabs sells a TTS service, a STT service, and a conversational AI product built on its own components. VAPI and Retell do not train speech models at all. They orchestrate Deepgram plus an LLM plus ElevenLabs into a voice agent. Moshi and OpenAI's Realtime API sit in a different place on the map. They are the model itself, not a pipeline of third-party components.

fig.03 · the landscape·········

Two axes people often collapse: whether the system uses an integrated audio model (rows) and whether it runs two channels at once (columns). Moshi is the one system where the full-duplex claim is transparent at the architecture level.

How audio becomes tokens optional

A language model works on discrete tokens, not on raw audio. Before any of the approaches above can work on speech, there has to be a way to turn a waveform into a sequence of discrete units and back again, without losing too much of what made the audio sound human.

That job falls to a neural audio codec. Think of it as MP3 encoding with one extra trick. Like MP3, it compresses a waveform into a much smaller representation. Unlike MP3, the compressed representation is a sequence of integers that a language model can read and write directly.

the single detail that mattersKyutai's Mimi codec, released with Moshi, emits tokens at 12.5 Hz — close to the rate at which word-like text tokens arrive in normal speech. That alignment is what lets audio sit side by side with text inside one model. If that single detail is all you take from this section, you have the point.

fig.04 · codec pipeline·········

A neural audio codec turns a waveform into a stack of discrete token streams and back. The key design choice is the frame rate: slower rates let audio coexist with text in one model.

asideThe trick inside most modern codecs is called residual vector quantization (RVQ). A stack of small dictionaries, where each layer encodes what the previous layer missed. Five layers with a vocabulary of 320 each can describe more acoustic variation than a single flat vocabulary of a billion. If that is interesting, the SoundStream and Moshi papers walk through it. If not, skip ahead.

How we got here, in four years

The new wave of voice AI did not fall out of the sky in late 2024. It is the visible end of a research arc that began quietly around 2021 and gathered pace each year since.

In early 2021, Meta published Generative Spoken Language Modeling (GSLM). It showed something that, at the time, felt almost heretical: you could train a language model on raw speech with no text at all, by clustering speech features into pseudo-words and then modeling the sequence of those units. The speech did not have to pass through writing to be learnable.

Later that year, Google released SoundStream, the neural audio codec that delivered the RVQ trick. Together, GSLM and SoundStream were the grammar and the alphabet for a future speech language model.

In 2022, Google combined the two with its AudioLM system, which introduced a hierarchy of semantic tokens and acoustic tokens. Semantic tokens carried content, acoustic tokens carried voice. Also in 2022, Meta's follow-up dGSLM extended GSLM from monologue to two-speaker dialogue, trained on the Fisher corpus, and produced the first textless model with natural turn-taking behavior.

In 2023, two systems generalized the approach in different directions. Microsoft's VALL-E used the codec-plus-language-model recipe for high-quality text-to-speech, cloning a voice from a three-second sample. Fudan's SpeechGPT plugged speech tokens into a text LLM's vocabulary and produced one of the first models that could take a spoken instruction and answer in speech, end to end.

Then, in September 2024, Kyutai released Moshi. Open weights, Apache license, running on a single GPU. The first real-time, full-duplex, speech-text foundation model available to anyone who wanted to study it. That is the moment the research arc met the demo stage, and it is why the second half of 2024 felt different from the first half.

A parallel thread runs through Google's Translatotron (2019 and 2021), which did direct speech-to-speech translation without text. It sat outside the LLM lineage but proved that text is not a mandatory intermediate step for voice.

fig.05 · timeline·········

Six public papers, four years. Each one solved a piece of what Moshi put together.

What the new architecture actually does

Moshi is the clearest public example of how these models are put together. Understanding its shape helps make the whole category concrete.

Moshi models two audio streams at once. One is the user's channel, the audio coming in. The other is the model's own channel, the audio going out. Both streams are represented in the same kind of Mimi tokens, and both are predicted by the same network. That is what gives the model the structural ability to listen and speak simultaneously. There is no push-to-talk state, no moment when the model stops hearing in order to speak.

Alongside the two audio streams, Moshi maintains a third stream: a time-aligned text transcript of what the model itself is saying. At each 80 millisecond frame, the model first predicts a text token, then predicts the audio tokens for that frame. The text token acts as a kind of inner monologue, a semantic handle that lets the model reason linguistically while generating audio. This technique, which Kyutai calls Inner Monologue, keeps the spoken output coherent over long turns.

Every part of the label real-time, full-duplex, speech-text foundation model is literal.— each word earns its place

fig.06 · interleaving·········

In Moshi, a text token is emitted at the start of every audio frame, so speech begins as soon as the first frame is generated. In SpeechGPT, audio waits for the whole text response to finish — ruling out real-time operation.

What the cascade cannot do

The old pipeline — ASR to text LLM to TTS — still works, and in many narrow domains it works very well. OpenAI's developer documentation on voice agents frames voice as two valid tracks: chained pipelines (reliable, easier to debug) and speech-to-speech models (lower latency, more natural conversation). The argument here is narrower than a dismissal of the cascade: on two specific dimensions, it is structurally disadvantaged by design.

But there are two reasons it cannot catch up on the quality of natural conversation, not just on speed.

Paralinguistic loss.Speech carries two kinds of information. The words, and everything about how the words were said: pitch, prosody, emotion, timbre, rate, breath. When ASR converts speech to text, the second channel is thrown away entirely. A text LLM cannot recover information that was never passed to it. Sarcasm becomes sincerity; a panicked question comes back at conversational pace. OpenAI made a similar point in its Realtime API release notes: stitched pipelines tend to lose emotion, emphasis, and accents.
Error propagation.Each stage is independently trained on its own task, and none see the full audio. An ASR mistake on a homophone changes the meaning the LLM reasons about, and the error cannot be corrected downstream because the downstream stages never saw the original waveform. The TTS pronounces the wrong answer with perfect clarity — worse than a garbled one, because it sounds confident.

fig.07 · paralinguistic loss·········

Tone is stripped at ASR and cannot be recovered downstream. The cascade cannot recover what the text stage never saw.

fig.08 · error propagation·········

ASR mistakes cannot be corrected downstream because the downstream stages never see the original audio. The result is a fluent, confident answer to the wrong question — harder to catch than a garbled one.

honest caveatCascades are not dead. For high-accuracy, highly constrained domains, a cascade with domain-tuned ASR still often wins on task accuracy, and modular pipelines remain easier to debug. A recent line of work, including the X-Talk survey on modular systems, argues that well-engineered modular designs with paralinguistic side-channels can close much of the gap. The claim here is narrower: cascades hit a structural ceiling on the naturalness of conversation.

What STS actually solves

Pulling the threads together, an integrated speech-to-speech model is structurally better positioned on three capabilities where the cascade is disadvantaged by design.

Three capabilities, summarized.

Latency at the conversational threshold. Around 200 ms instead of around 1,000 in the best reported measurements — now reported from real systems rather than only from research papers.
Paralinguistic signal preserved. Prosody, emotion, rate, and affect are carried through rather than discarded and reinvented. The best demos from this generation sound like they are responding to how you spoke, not just what you said.
Natural turn-taking. Because the architecture models two audio channels at once, overlaps, interruptions, and backchannels behave the way they do in human conversation. Duplex is built into the model, not bolted on.

fig.09 · what STS solves·········

Latency at the conversational threshold, paralinguistic signal kept through the pipeline, and two audio channels modeled at once. The three things the cascade was not shaped to solve, and the three things an integrated STS model does.

The reason the gap between 1,000 ms and 200 ms matters beyond demos is the size of the categories voice is the natural interface for. Contact centers and customer service. Healthcare intake and clinical triage. Language tutoring and test preparation. Real-time accessibility. Enterprise tools for meetings, outbound calls, and note-taking. Each of these has tens to hundreds of millions of daily users who would rather speak than type, and for whom the cascade era of voice AI has been serviceable without being enjoyable.

fig.10 · domain × capability·········

Each of the three STS capabilities unlocks different categories. Contact centers and accessibility lean on latency. Healthcare and sales lean on paralinguistic signal. Tutoring and enterprise coaching lean on turn-taking. A model that clears all three is a candidate default interface for each.

None of this means the category is finished. STS models still hallucinate, and when they do, there is no intermediate text transcript to point to, so debugging is harder. Specialized ASR and TTS still beat foundation models in narrow, high-accuracy domains. On evaluation, the first STS-native benchmarks have appeared: Full-Duplex-Bench (turn-taking and interruption) and URO-Bench (paralinguistic understanding and response). There is still no single dominant end-to-end standard for is this a good STS agent. Those are the threads later articles in this series pick up.

With gpt-realtime generally available, Gemini Live on Vertex, and open-weight models like Moshi and Sesame CSM downloadable, the architecture side of STS is rapidly becoming a commodity. What separates a demo from a product that works across accents, emotional registers, and full conversational turns is the data the model was trained on.

What comes next: data

Full-duplex models have to learn from conversations that actually look like conversations. Two channels, one per speaker. Overlap left intact. Paralinguistic signals preserved. Not read speech, not scripted dialog, not bulked-up monologue transcripts.

What is scarce is not speech data in general, but clean speaker-wise full-duplex conversational audio at scale. Conversational speech can scale: J-CHAT (2024) is a 76,000-hour Japanese dialogue corpus from the public web. Full-duplex-specific work such as InteractSpeech (2025) and DialogueSidon (2026) is still measured in the low hundreds of hours. The open ceiling for clean two-channel conversation remains Fisher, a 1,960-hour corpus collected by LDC in 2004. Moshi trained on it. Nearly every serious full-duplex effort does. Frontier models are already operating at scales where 2,000 hours of conversational audio is a starting point, not a ceiling. The gap between what the next generation of STS models needs and what is actually available, licensed, and two-channel, is the practical bottleneck the rest of this series looks at.

That is where we go next. What is in the public datasets, what is missing from them, what a full-duplex training set actually has to contain, and what it takes to build one at the scale the models now demand.

■ ■ ■

Fullduplex

2026

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

Benchmarks, models, datasets — curated and kept current. Jump to any of the three lists below, or join the community if you want to help us keep it honest.

see the datasets ↗join the community

next in the series

The full-duplex threshold.

A number, a biology fact, and a small cluster of systems: what the ~200 ms turn-gap has to do with walkie-talkies, and what conversations above the threshold unlock.

→

#audio#sts-series#primer#moshi#codecfiled under: the latent · sts 01

Speech-to-speech AI, a primer.

The telephone moment

Four words you will see everywhere

How audio becomes tokens optional

How we got here, in four years

What the new architecture actually does

What the cascade cannot do

What STS actually solves

Three capabilities, summarized.

What comes next: data

We index the STS / full-duplex / audio foundation-model landscape so you don't have to.

The STS Series · 10 articles, released weekly

A primer on speech-to-speech.

The full-duplex threshold.

From pipeline to integrated.