Speech-to-speech AI, a primer.
What changed in 2024, what the words mean, and why a new class of models treats speech as a first-class language rather than a pipeline of text conversions.
The telephone moment
In a natural conversation between two humans, the gap between one person finishing and the other starting averages about 200 milliseconds. That is roughly a quarter of a blink. It is also one of the most stable numbers in human behavior, measured the same way across ten very different languages from Japanese to Yélî Dnye. Until 2024, voice assistants needed about a full second to do the same thing. That difference is the difference between a conversation and a transaction, and it is why the voice AI demos of the last eighteen months feel qualitatively different from anything before.
Here is a simple way to hold the shift in your head. Old voice assistants worked like a walkie-talkie. One side presses the button, speaks a complete thought, releases, and waits. Interruptions break it. Overlaps break it. Listening and speaking are separate modes and only one happens at a time. The new systems work like a telephone. Two people, two open channels, both able to listen and speak at once, able to interrupt and be interrupted, able to murmur mhm while the other person is still talking.
This is a primer on what changed, what the new models are actually doing, and why the terms you are about to see — speech-to-speech, full-duplex, audio foundation model — are worth distinguishing carefully.
Four words you will see everywhere
Four phrases do most of the work in this field, and they overlap in ways that quietly trip people up.
- speech-to-speech (STS)
- A model that takes audio in and emits audio out, without converting to text as an intermediate step. The model does its thinking in a representation that lives closer to sound than to written language.
- full-duplex
- Describes how the conversation flows. A full-duplex system can listen and speak at the same time, the way a telephone can. Half-duplex has to finish one before starting the other, the way a walkie-talkie does.
- audio foundation model
- A big pretrained model that understands and generates audio. Foundation is borrowed from the text world: pretrained on a broad corpus, adaptable to many tasks. An audio foundation model does the same thing but with waveforms as its native material.
- speech language model (SpeechLM)
- A large model that treats speech the way GPT treats text: a sequence of discrete tokens, predicted one after another. SpeechLMs are usually built on top of a neural audio codec that converts waveforms into tokens.
These terms overlap but are not interchangeable. Moshi, the open-source system Kyutai released in late 2024, is all four at once: a speech-to-speech model, full-duplex, a foundation model for audio, and a speech language model. VALL-E, an earlier Microsoft system, is a SpeechLM but only for text-to-speech, not STS. A traditional cascade of ASR plus LLM plus TTS is speech-in and speech-out at the system level, but there is no STS model at its core.
Before looking at the full landscape, it helps to separate three different kinds of voice AI product that sometimes get lumped together. A speech-to-text service turns audio into a transcript. A text-to-speech service turns text into audio. A conversational AI system does both, in a loop, and has to decide what to say. The first two are components. The third is the system you actually talk to.
That distinction matters because the same brand can appear in more than one layer. ElevenLabs sells a TTS service, a STT service, and a conversational AI product built on its own components. VAPI and Retell do not train speech models at all. They orchestrate Deepgram plus an LLM plus ElevenLabs into a voice agent. Moshi and OpenAI's Realtime API sit in a different place on the map. They are the model itself, not a pipeline of third-party components.
How audio becomes tokens optional
A language model works on discrete tokens, not on raw audio. Before any of the approaches above can work on speech, there has to be a way to turn a waveform into a sequence of discrete units and back again, without losing too much of what made the audio sound human.
That job falls to a neural audio codec. Think of it as MP3 encoding with one extra trick. Like MP3, it compresses a waveform into a much smaller representation. Unlike MP3, the compressed representation is a sequence of integers that a language model can read and write directly.
asideThe trick inside most modern codecs is called residual vector quantization (RVQ). A stack of small dictionaries, where each layer encodes what the previous layer missed. Five layers with a vocabulary of 320 each can describe more acoustic variation than a single flat vocabulary of a billion. If that is interesting, the SoundStream and Moshi papers walk through it. If not, skip ahead.
How we got here, in four years
The new wave of voice AI did not fall out of the sky in late 2024. It is the visible end of a research arc that began quietly around 2021 and gathered pace each year since.
In early 2021, Meta published Generative Spoken Language Modeling (GSLM). It showed something that, at the time, felt almost heretical: you could train a language model on raw speech with no text at all, by clustering speech features into pseudo-words and then modeling the sequence of those units. The speech did not have to pass through writing to be learnable.
Later that year, Google released SoundStream, the neural audio codec that delivered the RVQ trick. Together, GSLM and SoundStream were the grammar and the alphabet for a future speech language model.
In 2022, Google combined the two with its AudioLM system, which introduced a hierarchy of semantic tokens and acoustic tokens. Semantic tokens carried content, acoustic tokens carried voice. Also in 2022, Meta's follow-up dGSLM extended GSLM from monologue to two-speaker dialogue, trained on the Fisher corpus, and produced the first textless model with natural turn-taking behavior.
In 2023, two systems generalized the approach in different directions. Microsoft's VALL-E used the codec-plus-language-model recipe for high-quality text-to-speech, cloning a voice from a three-second sample. Fudan's SpeechGPT plugged speech tokens into a text LLM's vocabulary and produced one of the first models that could take a spoken instruction and answer in speech, end to end.
Then, in September 2024, Kyutai released Moshi. Open weights, Apache license, running on a single GPU. The first real-time, full-duplex, speech-text foundation model available to anyone who wanted to study it. That is the moment the research arc met the demo stage, and it is why the second half of 2024 felt different from the first half.
A parallel thread runs through Google's Translatotron (2019 and 2021), which did direct speech-to-speech translation without text. It sat outside the LLM lineage but proved that text is not a mandatory intermediate step for voice.
What the new architecture actually does
Moshi is the clearest public example of how these models are put together. Understanding its shape helps make the whole category concrete.
Moshi models two audio streams at once. One is the user's channel, the audio coming in. The other is the model's own channel, the audio going out. Both streams are represented in the same kind of Mimi tokens, and both are predicted by the same network. That is what gives the model the structural ability to listen and speak simultaneously. There is no push-to-talk state, no moment when the model stops hearing in order to speak.
Alongside the two audio streams, Moshi maintains a third stream: a time-aligned text transcript of what the model itself is saying. At each 80 millisecond frame, the model first predicts a text token, then predicts the audio tokens for that frame. The text token acts as a kind of inner monologue, a semantic handle that lets the model reason linguistically while generating audio. This technique, which Kyutai calls Inner Monologue, keeps the spoken output coherent over long turns.
Every part of the label real-time, full-duplex, speech-text foundation model is literal.— each word earns its place
What the cascade cannot do
The old pipeline — ASR to text LLM to TTS — still works, and in many narrow domains it works very well. OpenAI's developer documentation on voice agents frames voice as two valid tracks: chained pipelines (reliable, easier to debug) and speech-to-speech models (lower latency, more natural conversation). The argument here is narrower than a dismissal of the cascade: on two specific dimensions, it is structurally disadvantaged by design.
But there are two reasons it cannot catch up on the quality of natural conversation, not just on speed.
- Paralinguistic loss.Speech carries two kinds of information. The words, and everything about how the words were said: pitch, prosody, emotion, timbre, rate, breath. When ASR converts speech to text, the second channel is thrown away entirely. A text LLM cannot recover information that was never passed to it. Sarcasm becomes sincerity; a panicked question comes back at conversational pace. OpenAI made a similar point in its Realtime API release notes: stitched pipelines tend to lose emotion, emphasis, and accents.
- Error propagation.Each stage is independently trained on its own task, and none see the full audio. An ASR mistake on a homophone changes the meaning the LLM reasons about, and the error cannot be corrected downstream because the downstream stages never saw the original waveform. The TTS pronounces the wrong answer with perfect clarity — worse than a garbled one, because it sounds confident.
What STS actually solves
Pulling the threads together, an integrated speech-to-speech model is structurally better positioned on three capabilities where the cascade is disadvantaged by design.
Three capabilities, summarized.
- Latency at the conversational threshold. Around 200 ms instead of around 1,000 in the best reported measurements — now reported from real systems rather than only from research papers.
- Paralinguistic signal preserved. Prosody, emotion, rate, and affect are carried through rather than discarded and reinvented. The best demos from this generation sound like they are responding to how you spoke, not just what you said.
- Natural turn-taking. Because the architecture models two audio channels at once, overlaps, interruptions, and backchannels behave the way they do in human conversation. Duplex is built into the model, not bolted on.
The reason the gap between 1,000 ms and 200 ms matters beyond demos is the size of the categories voice is the natural interface for. Contact centers and customer service. Healthcare intake and clinical triage. Language tutoring and test preparation. Real-time accessibility. Enterprise tools for meetings, outbound calls, and note-taking. Each of these has tens to hundreds of millions of daily users who would rather speak than type, and for whom the cascade era of voice AI has been serviceable without being enjoyable.
None of this means the category is finished. STS models still hallucinate, and when they do, there is no intermediate text transcript to point to, so debugging is harder. Specialized ASR and TTS still beat foundation models in narrow, high-accuracy domains. On evaluation, the first STS-native benchmarks have appeared: Full-Duplex-Bench (turn-taking and interruption) and URO-Bench (paralinguistic understanding and response). There is still no single dominant end-to-end standard for is this a good STS agent. Those are the threads later articles in this series pick up.
With gpt-realtime generally available, Gemini Live on Vertex, and open-weight models like Moshi and Sesame CSM downloadable, the architecture side of STS is rapidly becoming a commodity. What separates a demo from a product that works across accents, emotional registers, and full conversational turns is the data the model was trained on.
What comes next: data
Full-duplex models have to learn from conversations that actually look like conversations. Two channels, one per speaker. Overlap left intact. Paralinguistic signals preserved. Not read speech, not scripted dialog, not bulked-up monologue transcripts.
What is scarce is not speech data in general, but clean speaker-wise full-duplex conversational audio at scale. Conversational speech can scale: J-CHAT (2024) is a 76,000-hour Japanese dialogue corpus from the public web. Full-duplex-specific work such as InteractSpeech (2025) and DialogueSidon (2026) is still measured in the low hundreds of hours. The open ceiling for clean two-channel conversation remains Fisher, a 1,960-hour corpus collected by LDC in 2004. Moshi trained on it. Nearly every serious full-duplex effort does. Frontier models are already operating at scales where 2,000 hours of conversational audio is a starting point, not a ceiling. The gap between what the next generation of STS models needs and what is actually available, licensed, and two-channel, is the practical bottleneck the rest of this series looks at.
That is where we go next. What is in the public datasets, what is missing from them, what a full-duplex training set actually has to contain, and what it takes to build one at the scale the models now demand.
We index the STS / full-duplex / audio foundation-model landscape so you don't have to.
Benchmarks, models, datasets — curated and kept current. Jump to any of the three lists below, or join the community if you want to help us keep it honest.