FullduplexFullduplex/
the sts series · references

References & further reading

Every external source cited across the Fullduplex STS Series — papers, benchmarks, repositories, platforms, and corpora. Grouped by article, then by kind. Walk the graph.

05 articles indexed154 unique referenceslast updated · apr 2026
article · 01

Speech-to-speech AI, a primer

22 references

The research arc that produced Moshi — six public papers over four years, plus the benchmarks and open-weight repos that ship with the modern voice stack.

Research papers16

  1. 01Stivers et al. (2009) — Universals in turn-takingPNAS. Ten languages, same ~200 ms turn-gap. The foundational claim that the conversational threshold is a biological constant.pnas.orgpaper
  2. 02GSLM (Meta, 2021)Generative Spoken Language Modeling — language modeling on raw speech with no text at all.arxiv.orgpaper
  3. 03SoundStream (Google, 2021)End-to-end neural audio codec. Introduced residual vector quantization (RVQ) as the alphabet for audio LMs.arxiv.orgpaper
  4. 04AudioLM (Google, 2022)Hierarchy of semantic + acoustic tokens. Bridged GSLM and SoundStream into a single audio language model.arxiv.orgpaper
  5. 05dGSLM (Meta, 2022)Two-speaker dialogue extension of GSLM, trained on Fisher. First textless model with natural turn-taking.arxiv.orgpaper
  6. 06VALL-E (Microsoft, 2023)Codec + language-model recipe for high-quality TTS. Voice cloning from a three-second sample.arxiv.orgpaper
  7. 07SpeechGPT (Fudan, 2023)Speech tokens plugged into an LLM vocabulary. Early end-to-end spoken-instruction-in, spoken-answer-out.arxiv.orgpaper
  8. 08Translatotron (Google, 2019)Direct speech-to-speech translation without text — parallel thread proving text is not a mandatory intermediate.arxiv.orgpaper
  9. 09Translatotron 2 (Google, 2021)Follow-up to Translatotron with improved quality and robustness.arxiv.orgpaper
  10. 10Moshi paper (Kyutai, 2024)First real-time, full-duplex, speech-text foundation model, released under Apache with open weights.arxiv.orgpaper
  11. 11X-Talk surveySurvey on modular voice systems with paralinguistic side-channels — the steel-man for the cascade approach.arxiv.orgpaper
  12. 12Full-Duplex-BenchFirst benchmark for turn-taking and interruption handling in STS models.arxiv.orgpaper
  13. 13URO-BenchParalinguistic understanding and response evaluation for speech-to-speech systems.arxiv.orgpaper
  14. 14J-CHAT (2024)76,000-hour Japanese dialogue corpus from the public web.arxiv.orgpaper
  15. 15InteractSpeech (EMNLP Findings, 2025)Full-duplex dataset work targeting interactive speech.aclanthology.orgpaper
  16. 16DialogueSidon (2026)Recent dialogue dataset / model release cited as a 2026 data point in the primer.arxiv.orgpaper

Repositories & open weights02

  1. 01Moshi (Kyutai)Open-weights reference implementation of the Moshi full-duplex model.github.comrepo
  2. 02Sesame CSMOpen-weights conversational speech model from Sesame AI Labs.github.comrepo

Platforms & documentation03

  1. 01OpenAI — Voice Agents guideOfficial framing of two valid tracks: chained pipelines vs. speech-to-speech.platform.openai.complatform
  2. 02OpenAI — gpt-realtime releaseRealtime API announcement. Cites loss of emotion, emphasis, and accents in stitched pipelines.openai.complatform
  3. 03Gemini Live on Vertex AIGoogle Cloud documentation for Gemini Live API.cloud.google.complatform

Corpora01

  1. 01Fisher English (LDC2004S13)1,960-hour two-channel conversational English corpus collected by LDC in 2004. Still the workhorse for dialogue training.catalog.ldc.upenn.educorpus
article · 02

The full-duplex threshold

27 references

Where the ~200 ms number comes from, and the small cluster of systems that have actually crossed it — plus the first benchmarks that can tell you so.

Research papers12

  1. 01Stivers et al. (2009)PNAS — measured the ~200 ms turn-gap across ten languages. Reused here as the biological anchor.pnas.orgpaper
  2. 02Levinson & Torreira (2015)Frontiers in Psychology — predictive processing of upcoming turn ends.doi.orgpaper
  3. 03Magyari et al. (2015)Scientific Reports — brain activity during turn-prediction in conversation.nature.compaper
  4. 04Heldner & Edlund (2010)Journal of Phonetics — distribution of silences and overlaps in conversation.doi.orgpaper
  5. 05De Ruiter et al. (2006)Language — projecting the end of a speaker's turn.doi.orgpaper
  6. 06Full-Duplex-BenchThe benchmark that made the threshold measurable on modern STS systems.arxiv.orgpaper
  7. 07Full-Duplex-Bench v3Latest iteration of the benchmark with expanded coverage.arxiv.orgpaper
  8. 08SyncLLMSynchronous speech-text LLM approach to full-duplex.arxiv.orgpaper
  9. 09OmniFlatten (Alibaba Tongyi, 2024)The paper that named the flattened-token architecture family.arxiv.orgpaper
  10. 10Freeze-Omni (Tencent AI Lab et al.)Adapter-based approach that freezes the backbone LLM while unlocking duplex speech.arxiv.orgpaper
  11. 11Mini-Omni2Lightweight end-to-end speech-in speech-out model.arxiv.orgpaper
  12. 12τ-Voice2026 paper cited in the threshold discussion.arxiv.orgpaper

Repositories & open weights03

  1. 01Moshi (Kyutai)Reference full-duplex model that first crossed the threshold in a reproducible open release.github.comrepo
  2. 02Hibiki (Kyutai)Kyutai's follow-up work on simultaneous translation.github.comrepo
  3. 03Full-Duplex-Bench repoEvaluation code and harness for Full-Duplex-Bench.github.comrepo

Platforms & models06

  1. 01OpenAI — Realtime APIDocumentation for gpt-realtime, OpenAI's production STS endpoint.platform.openai.complatform
  2. 02OpenAI — next-generation audio modelsAnnouncement quoting sub-500 ms latency targets for the stack.openai.complatform
  3. 03Hello GPT-4oLaunch post. First consumer-grade STS demo at the threshold.openai.complatform
  4. 04Google DeepMind — GeminiOverview page covering Gemini Live and multimodal capabilities.deepmind.googleplatform
  5. 05Moshi demoPublic hosted demo of the Moshi model.moshi.chatplatform
  6. 06Kyutai — UnmuteLab project page for Kyutai's real-time voice work.kyutai.orgplatform

Background & reference05

  1. 01Duplex (Wikipedia)Telecom background on half- vs. full-duplex.en.wikipedia.orgreference
  2. 02Full duplex (Wikipedia anchor)Section specifically defining simultaneous bi-directional transmission.en.wikipedia.orgreference
  3. 03WHO — vision impairment fact sheet2.2 billion people with some form of vision impairment — accessibility sizing.who.intreference
  4. 04BetterUp — CANDOR researchCandor corpus: 1,600 English conversations with rich metadata.betterup.comcorpus
  5. 05CC BY-NC 4.0 licenseLicense referenced for several corpora and open releases.creativecommons.orgreference

From oto01

  1. 01oto newsletterWeekly dispatch tracking STS, full-duplex, and audio foundation models.oto.earthoto
article · 03

From pipeline to integrated

35 references

The four architectural families of integrated STS as of April 2026 — every model, codec, and backbone cited in the field guide, grouped by kind.

Research papers09

  1. 01Moshi — measured latencySource for the 200 ms measured end-to-end on an NVIDIA L4 figure.arxiv.orgpaper
  2. 02Mimi codec (Moshi technical report)Streaming neural audio codec at 12.5 Hz — the enabling piece for a joint full-duplex model.kyutai.orgpaper
  3. 03NVIDIA PersonaPlex-7B-v1NVIDIA ADLR, Jan 2026 — initializes from a Moshi-family checkpoint.arxiv.orgpaper
  4. 04OmniFlatten (Alibaba Tongyi, 2024)Paper that named the flattened-token architectural family.arxiv.orgpaper
  5. 05LLaMA-Omni 2Meta-LLaMA-based reimplementation of the flatten idea.arxiv.orgpaper
  6. 06Moonshot — Kimi-AudioClaims 13M hours of speech training; released under MIT.arxiv.orgpaper
  7. 07Tencent — Covo-Audio / Covo-Audio-Chat-FDTencent's entry in the flatten / adapter family.huggingface.copaper
  8. 08Freeze-OmniAdapter approach — frozen backbone + streaming speech adapters.arxiv.orgpaper
  9. 09SALMONN-omni (ByteDance)Representative entry in the no-codec family: continuous speech features into the LLM.arxiv.orgpaper

Repositories & open weights13

  1. 01Moshi (Kyutai)Reference implementation for both the model and the Mimi codec (MIT).github.comrepo
  2. 02Chatterbox TTS (Resemble AI)Cited alongside CSM as an open-weights TTS/voice release.github.comrepo
  3. 03Sesame CSM-1BOpen-weights conversational speech model.github.comrepo
  4. 04CosyVoice (FunAudioLLM)Streaming TTS used inside several flattened-token stacks.github.comrepo
  5. 05Qwen2.5-OmniAlibaba's omni-modal Qwen release.github.comrepo
  6. 06Step-Audio 2 (StepFun)Open-weights member of the flatten family.github.comrepo
  7. 07GLM-4-Voice (THUDM)Tsinghua's GLM-family speech model.github.comrepo
  8. 08Qwen2-7B-InstructFrozen backbone used by Freeze-Omni.huggingface.corepo
  9. 09MiniCPM-o 4.5 (OpenBMB)Compact open-weights omni model.github.comrepo
  10. 10SigLIP2 (Google)Vision encoder cited as a component of modern omni stacks.huggingface.corepo
  11. 11Whisper (OpenAI)The ASR backbone that many pipeline and adapter systems still call.github.comrepo
  12. 12Qwen3-8BBackbone LLM used inside several 2025 – 2026 speech stacks.huggingface.corepo
  13. 13SALMONN (ByteDance)Repository for the SALMONN line, including the omni variant.github.comrepo

Platforms & models10

  1. 01KyutaiFrench non-profit AI lab. Home of Moshi, Hibiki, Unmute, and the Mimi codec.kyutai.orgplatform
  2. 02OpenAI — GPT-4o audioConsumer-grade pipeline cited at roughly one second end-to-end on a typical day.openai.complatform
  3. 03Deepgram Voice Agent (Aura)Commercial agent stack quoting sub-second end-to-end latency.deepgram.complatform
  4. 04Cartesia SonicOne of the fastest commercial TTS engines (~90 ms first-audio).cartesia.aiplatform
  5. 05Hello GPT-4oOpenAI's launch of the integrated GPT-4o voice stack.openai.complatform
  6. 06Google DeepMind — GeminiGemini Live umbrella page.deepmind.googleplatform
  7. 07Gemini Live API on VertexDocumentation for Gemini 3.1 Flash Live API on Google Cloud.cloud.google.complatform
  8. 08Amazon Nova SonicAWS Bedrock Nova family — cloud provider STS entry.aws.amazon.complatform
  9. 09Microsoft AI Services (MAI-Voice-1)Azure AI Services — Microsoft's production voice stack.azure.microsoft.complatform
  10. 10Hume EVIEmpathic Voice Interface — emotional and prosodic voice agent.hume.aiplatform

Corpora01

  1. 01Fisher English (LDC2004S13)Two-channel 1,960-hour corpus — still the default starting point for duplex dialogue training.catalog.ldc.upenn.educorpus

From oto02

  1. 01Contact otoGet in touch about STS datasets and partnerships.oto.earthoto
  2. 02oto investor data roomMaterials for investors exploring the STS / full-duplex category.oto.earthoto
article · 04

The data ceiling

27 references

The post-training data problem — separation and diarization ceilings, license and content-shape filters, the phase-fit matrix, and the public corpus catalog that still pivots on a 2004 telephone corpus.

Research papers15

  1. 01Cieri et al. (2004) — Fisher corpus designLREC 2004. Original paper describing the Fisher corpus — 1,960 hours, 11,699 dyadic conversations, each speaker on a separate disk track at collection time.ldc.upenn.edupaper
  2. 02Moshi — Défossez et al. (2024)Kyutai's Moshi paper. ~7M hours of mono pre-training, Fisher for full-duplex fine-tune, 200 ms measured latency.arxiv.orgpaper
  3. 03OmniFlatten (2024)A 0.5B-parameter STS trained on ~2,000 hours of 100% TTS-synthesized dialogue. Proof that the synthetic ceiling is above zero.arxiv.orgpaper
  4. 04SepFormer (2021)Transformer-based monaural source separation. ~22.3 dB SI-SDRi on WSJ0-2mix.arxiv.orgpaper
  5. 05Conv-TasNet (2019)Fully-convolutional time-domain audio separation network. A baseline the rest of the field references.arxiv.orgpaper
  6. 06TDANet (2023)Top-down attention network for separation; a mid-2020s high-water mark alongside SepFormer.arxiv.orgpaper
  7. 07MossFormer2 (2024)State-of-the-art separation on WSJ0-2mix (~24.1 dB SI-SDRi). Strong on synthetic mixes, collapses on LibriCSS at 30% overlap.arxiv.orgpaper
  8. 08pyannote 3.x (2023)Current research-default diarization system. ~22% DER on AMI, ~11% on VoxConverse.arxiv.orgpaper
  9. 09EEND-EDA (2020)End-to-end neural diarization with encoder-decoder attractors. The family pyannote and NeMo descend from.arxiv.orgpaper
  10. 10LibriCSS (Chen et al. 2020)Conversation-style benchmark with controlled overlap rates. Evidence that WER at 30-40% overlap stays above 18% even with a 7-channel array.arxiv.orgpaper
  11. 11Raj et al. (2021) — Integration of separation + ASRThe study that caught compounding error on record: a separation front-end helps overlap but slightly hurts clean audio.arxiv.orgpaper
  12. 12CHiME-8 DASR overview (Cornell et al. 2024)Benchmark organizers state the ceiling plainly: neural SSE techniques still can't reliably handle complex multi-speaker scenarios.arxiv.orgpaper
  13. 13HuBERT (2021)Self-supervised speech representation learning. LARGE model pre-trained on ~60,000 hours of LibriLight mono audiobook audio.arxiv.orgpaper
  14. 14Wav2Vec 2.0 (2020)The other canonical self-supervised speech backbone; same source pool as HuBERT.arxiv.orgpaper
  15. 15Freeze-Omni (2024)A Family-3 STS with a 110,000-hour ASR mid-training corpus. The mono-audio entry point into dialogue-shaped training.arxiv.orgpaper

Repositories & models01

  1. 01pyannote speaker-diarization-3.1 (Hugging Face)The model card behind the DER numbers cited in §2.4. What a production diarization deployment actually uses.huggingface.corepo

Platforms & reporting06

  1. 01CHiME-6 challengeDinner-party audio challenge. Track 1 (oracle diarization) vs Track 2 (system diarization) quantifies the cost of building the label table yourself.chimechallenge.orgplatform
  2. 02YouTube Terms of ServiceThe legal ceiling: explicit prohibition on automated extraction and unauthorized ML training.youtube.complatform
  3. 03YouTube — third-party AI training opt-in controlsThe opt-in control surface creators must actively switch on before their content is a legitimate training input.support.google.complatform
  4. 04Millette v. OpenAI (TechCrunch)Class-action suit over OpenAI's scraping of YouTube creator transcripts. One of the cases pressure-testing the terms through 2024-25.techcrunch.complatform
  5. 05RSL: RSS-for-AI-licensing protocol (TechCrunch)The RSS co-creator's 2025 protocol for declaring training-license intent in podcast feeds. Evidence that the current ecosystem lacks the field.techcrunch.complatform
  6. 06Abaka AIMarch 2026 vendor of a 20,000-hour commercial full-duplex corpus. Direct-to-enterprise pricing, seven languages, 100% real human-to-human.abaka.aiplatform

Corpora03

  1. 01Switchboard (LDC97S62)1997. ~260 hours of two-channel telephone conversation. The smaller predecessor to Fisher, still in use.catalog.ldc.upenn.educorpus
  2. 02CANDOR (BetterUp)2023. ~850 hours of two-channel natural conversation. CC BY-NC 4.0, so unavailable for commercial training.betterup.comcorpus
  3. 03Emilia dataset2024-25. ~216,000 hours of mono web-scraped speech. Headline number hides a license split between an NC core and a YODAS extension.emilia-dataset.github.iocorpus

From oto02

  1. 01oto — dataset inquiryTwo-channel capture at source, per-speaker consent, commercial redistribution, phase-fit labeling. The post-training column, built to order.oto.earthoto
  2. 02oto investor data roomMaterials for investors exploring the STS / full-duplex data market.oto.earthoto
article · 05

Foundation before vertical

43 references

A thesis essay on the foundation threshold — the concept, the three domains that have already crossed it, the 30×–150× gap that full-duplex STS still has to close, and the six plausible routes to 100k+ hours of two-channel conversational data.

Research papers17

  1. 01Radford et al. (2018) — GPT-1Improving language understanding with unsupervised pretraining. 117M params, 0.8B tokens; still required task-specific fine-tuning.cdn.openai.compaper
  2. 02Radford et al. (2019) — GPT-2Language models are unsupervised multitask learners. 1.5B params, ~10B tokens; zero-shot was interesting but unreliable.cdn.openai.compaper
  3. 03Brown et al. (2020) — GPT-3Language models are few-shot learners. 175B params, 300B tokens. The text-LLM foundation threshold crossing.arxiv.orgpaper
  4. 04Singhal et al. (2022) — Med-PaLMLarge language models encode clinical knowledge. 67.6% on MedQA, adapter on PaLM rather than from-scratch medical LLM.arxiv.orgpaper
  5. 05Singhal et al. (2023) — Med-PaLM 2Towards expert-level medical QA. 86.5% on MedQA, built on PaLM 2. The vertical adapter pattern, matured.arxiv.orgpaper
  6. 06Rozière et al. (2023) — Code LlamaOpen foundation models for code. 500B additional code tokens on Llama 2 — roughly 10% extra training for a specialized vertical.arxiv.orgpaper
  7. 07Radford et al. (2021) — CLIPLearning transferable visual models from natural language supervision. 400M image-text pairs; the vision zero-shot threshold.arxiv.orgpaper
  8. 08Ma et al. (2024) — MedSAMNature Communications. +22.51 DICE over zero-shot SAM across 86/86 internal tasks, using 1.57M medical mask annotations.nature.compaper
  9. 09Zhang et al. (2023) — BiomedCLIP15M biomedical image-text pairs on a CLIP base. Confirms the two-to-three orders-of-magnitude-smaller adapter pattern.arxiv.orgpaper
  10. 10Radford et al. (2022) — WhisperRobust speech recognition via large-scale weak supervision. 680,000 hours; the ASR foundation threshold crossing.arxiv.orgpaper
  11. 11Wu et al. (2023) — BloombergGPTA large language model for finance. 50B params, 363B finance + 345B general tokens. Matched or exceeded by GPT-4 within twelve months.arxiv.orgpaper
  12. 12Taylor et al. (2022) — GalacticaA large language model for science. Withdrawn after three days — narrow-corpus hallucinations that sounded plausible.arxiv.orgpaper
  13. 13Défossez et al. (2024) — MoshiKyutai. ~7B parameters, the first open full-duplex STS model. The GPT-2 analog of the STS scaling arc.arxiv.orgpaper
  14. 14Nakata et al. (2024) — J-CHAT69,000 hours of Japanese audio — mono single-speaker, so unusable for full-duplex fine-tune despite the headline volume.arxiv.orgpaper
  15. 15Korfiatis et al. (2022) — PriMock57A primary-care mock consultation dataset. Mocked with patient actors as a HIPAA workaround.arxiv.orgpaper
  16. 16Yim et al. (2023) — ACI-BenchAmbient Clinical Intelligence benchmark. Mocked medical dialogues for evaluation; the authors are explicit about the regulatory constraint.arxiv.orgpaper
  17. 17Chiu et al. (2017) — Google Health medical dialogue14,000 hours of institutional medical conversations. Never released — institutional corpora trapped by regulation.arxiv.orgpaper

Companies & platforms21

  1. 01Harvey — Series A announcement (TechCrunch)Five months after ChatGPT. The post-foundation-compression benchmark for a vertical LLM.techcrunch.complatform
  2. 02Hippocratic AI — $50M seed (Reuters)Six months after ChatGPT. Would not have been financeable eighteen months earlier.reuters.complatform
  3. 03Abridge$5.3B valuation (June 2025). The post-Whisper vertical winner for medical scribing.abridge.complatform
  4. 04Decagon$4.5B Series D (January 2026). Customer-service STS agents — pipeline stack, not native full-duplex.decagon.aiplatform
  5. 05Deepgram$1.3B Series C (January 2026). Enterprise voice AI.deepgram.complatform
  6. 06VapiDeveloper voice platform. ~$130M valuation reported, on $20M Series A capital.vapi.aiplatform
  7. 07Retell AIThe honest counter-nuance: $50M ARR on ~$5M funding suggests some verticals compound without foundation-level investment.retellai.complatform
  8. 08Abaka AI20,000 hours bidirectional commercial release (2026). The single data point above 10k h for Route 3. Vendor-claimed, not independently audited.abaka.aiplatform
  9. 09Nexdata15k-hour multilingual conversational corpus — mono 8 kHz, fails the two-channel bar.nexdata.aiplatform
  10. 10AppenManaged crowdsourced data collection. Project-based, not standing corpora.appen.complatform
  11. 11TELUS DigitalDigital customer-experience vendor offering managed audio collection at enterprise scale.telusdigital.complatform
  12. 12Linguistic Data Consortium (LDC)The academic-consortium institution behind Switchboard and Fisher. Commercial tier yields in-year redistribution rights.ldc.upenn.eduplatform
  13. 13Reddit–Google licensing deal (TechCrunch)$60M/year (February 2024). Proof that platforms can monetize UGC corpora to AI labs.techcrunch.complatform
  14. 14YouTube — third-party training opt-inDecember 2024 creator control surface. The opt-in plumbing for Route 6 audio licensing exists.blog.youtubeplatform
  15. 15RSL — Really Simple LicensingLaunched September 2025. 1,500+ publishers by late 2025. Watch Route 6 for a surprise inflection.rslstandard.orgplatform
  16. 16Spotify — Developer Policy (May 2025)Explicit prohibition on training models from Spotify content. One vendor's stated closure of the audio-licensing door.developer.spotify.complatform
  17. 17Mozilla Common Voice31,841 hours across 286 languages, CC0. The crowdsourced ceiling — but all single-speaker read or monologue.commonvoice.mozilla.orgplatform
  18. 18ReplikaSince 2017. Luka Inc. hit with €5M Italian DPA fine (April 2025) after a provisional ban — Route 1 under regulatory ceiling.replika.complatform
  19. 19Character.AISince 2021. Consumer companion app with >10k h of in-app conversation. Corpus has never been released.character.aiplatform
  20. 20SesameBeta 2025. Companion STS app; another volume-rich source structurally trapped inside its app.sesame.complatform
  21. 21Oyez Project5,000+ hours of public-domain US Supreme Court oral argument audio. Configuration-wrong (mono-mixed), not data-poor.oyez.orgplatform

Corpora & references02

  1. 01Switchboard (LDC97S62)1991. DARPA + Texas Instruments. 260 hours of two-channel telephone conversation. The Route-5 origin point.catalog.ldc.upenn.educorpus
  2. 02Fisher corpus (Cieri et al. 2004)LREC 2004. 1,960 hours, 11,699 dyadic conversations. Still the default full-duplex fine-tune corpus twenty-two years later.ldc.upenn.educorpus

From Fullduplex03

  1. 01Article 04 — The data ceilingThe data-supply side of the same coin. Why separation AI and YouTube scraping don't rescue the gap./blog/data-ceilingoto
  2. 02Article 01 — Speech-to-speech AI, a primerSets the vocabulary — what an STS system is, and what full-duplex changes./blog/sts-primeroto
  3. 03Fullduplex — datasetsThe curated index of conversational speech datasets underlying every article in the series./datasetsoto