Datasets.
Corpora for training and evaluating audio foundation models — grouped by role (pretrain / dialog-interactive / eval / specialty), tagged with interactivity (how useful for full-duplex modeling) and license class (permissive / share-alike / gated / non-commercial / custom). Always verify with the primary source before shipping. Open the channel if you spot an error.
Which dataset for which job?
Rows are the 42 public STS-adjacent corpora, split into Full-duplex (channel-separated) and Others. Columns are the training / eval roles they actually support. The full-duplex band opens by default; click a group header to fold it, hover a column for its definition, click a row to jump to the full card below.
Dialog-interactive hours by language
Scope: the 19 real-recorded dialog-interactive public corpora we track (synthetic / TTS-generated corpora like Behavior-SD are excluded so they don't inflate acoustic totals), mapped onto the world's ten most-spoken languages. For multi-lingual corpora, hours are split evenly across every declared language — languages outside the top ten land in Other. Dashed rows = zero public hours in this scope; linear X axis leaves the English dominance literal. Toggle channel-separated only to isolate the corpora whose audio is actually usable for full-duplex acoustic training (per-speaker tracks, not mixed-channel).
How full-duplex ready are public corpora?
Every public STS-adjacent corpus with a known release month and hour count (1993 – 2026), plotted by release date and scale.Default lens is channel-separated only — the per-speaker-track subset that is actually usable for full-duplex acoustic training — so the view opens with the corpora that matter most for FD builders. Untick the chip to see the full pool. Fill colour scores full-duplex fit (FD-ready = high interactivity and/or explicit overlap / turn-taking / back-channel labels); red ring marks channel-separated audio; dashed outline marks synthetic corpora (TTS-generated, useful for behaviour supervision but not a substitute for real recordings). Chips narrow by FD fit, licence class, commercial use, or channel separation. Click any dataset name for the one-screen briefing.
Full-duplex & interactive
Dyadic, multi-party, or meeting audio where overlap and turn boundaries matter. The raw material for full-duplex models.
IndicVoices (conversational slice)
hours ~3,500 h (conv.)lang hi · bn · ta · te · …role dialogreleased 2024-03verified 2026-04~3,500 hours of conversational speech across all 22 scheduled Indic languages — the single largest permissive source for Hindi/Bengali/Tamil/Telugu dialogue.
interactivity · mediumcommercial okpermissiveCC-BY-4.0IndicVoices as a whole is a 23.7K-hour natural-speech corpus spanning 22 languages from 51K speakers across 400+ districts; 15% of that is conversational, giving ~3,500 hours of dialog-interactive audio under CC-BY-4.0 (commercial use permitted). 11.2K hours transcribed as of Dec 2025. This row counts only the conversational slice; the read/extempore majority sits upstream as pretraining fodder. Because hours are split evenly across many languages, bucket columns on D-B show how even the largest Indic corpus is thinly spread per-language.
#indic#multilingual#hindi#bengali#conversationalFisher English
hours 2,000 hlang enrole dialogreleased 2004-12verified 2026-042,000 hours of 10-minute telephone conversations — a classic for conversational ASR and turn-taking baselines.
interactivity · highrestrictedgatedLDC (paid)Ten-minute English phone calls between randomly paired US speakers, with transcripts. Distributed through LDC (paid), so mind the commercial terms.
#telephone#conversation#paid#channel-separatedsiteconversationKsponSpeech
hours 969 hlang korole dialogreleased 2020-03verified 2026-04969 hours of Korean spontaneous dyadic conversation — the canonical Korean analogue of Fisher / Switchboard.
interactivity · highrestrictedgatedAI Hub (registration)About 2,000 native Korean speakers recorded freely conversing in pairs on open-domain topics in a clean indoor environment. Dual transcription (orthography + pronunciation) plus disfluency tags for fillers, repetitions, and fragments. Distributed through the Korean government's AI Hub under a registration-gated licence; free for research on application.
#korean#spontaneous#dyadic#disfluencyCANDOR
hours 850 hlang enrole dialogreleased 2023-03verified 2026-04850 hours of naturalistic video-chat conversation across 1,656 dyads — the go-to open corpus for full-duplex turn-taking research.
interactivity · highresearch onlynon-commercialCC-BY-NC-4.01,656 dyadic video-chat conversations recorded over TokBox OpenTok between 1,456 US adults (2020). Ships per-speaker audio + video tracks, three turn parses (Audiophile / Cliffhanger / Backbiter), backchannels, prosody, facial-action features, and rich pre/post surveys. Science Advances (2023) paper is CC BY-NC 4.0 — the corpus inherits the non-commercial clause, so commercial redistribution is not permitted. Free for research on signed data-request.
#turn-taking#dialog#full-duplex#video#channel-separatedSwitchboard-1 (LDC97S62)
hours 260 hlang enrole dialogreleased 1993-01verified 2026-04260 hours of 2,400 U.S. telephone conversations — the original turn-taking corpus on which the field was built.
interactivity · highrestrictedgatedLDC (paid)Collected by Texas Instruments in 1990-91 under DARPA. 543 speakers; no pair talks twice and no speaker repeats a topic. 8 kHz 2-channel ulaw with orthographic transcripts. Distributed through LDC (paid).
#telephone#classic#turn-taking#paid#channel-separatedsiteconversationSpokenWOZ
hours 249 hlang enrole dialogreleased 2023-05verified 2026-045.7K human-to-human spoken task-oriented dialogues (203K turns, 249 h) — the canonical spoken task-oriented corpus.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-4.0Multi-domain task-oriented dialogues collected by crowdsourcing pairs of human speakers. 5,700 dialogues, 203,000 turns, ~249 hours across eight domains including restaurant, hotel, and taxi. Each turn ships audio, transcript, and task-oriented dialog-state labels (goals, slots, intents), plus a new 'cross-turn slot' annotation for values that span multiple utterances. Fills the 'spoken task-oriented' drawer that was missing between open-domain chat corpora and ASR benchmarks. CC-BY-NC-4.0, so research-only.
#task-oriented#dialog-state#multi-domain#spoken-dstMagicData-RAMC
hours 180 hlang zhrole dialogreleased 2022-03verified 2026-04180 hours of Mandarin spontaneous two-party conversation — the canonical CN-side analogue of Switchboard, free for academic use.
interactivity · highresearch onlynon-commercialCC-BY-NC-ND-4.0663 speakers from different accent regions of China, freely conversing in pairs over mobile recordings in a quiet indoor environment. Manually transcribed and proofed, with speaker and topic metadata. Train/val/test split 15:1:2. Released on OpenSLR as SLR-123. Licensed CC-BY-NC-ND-4.0 — explicitly free for academic work but not redistribution or commercial derivatives.
#mandarin#chinese#mobile-phone#turn-taking#overlapCALLHOME (JA / ZH / ES / DE / AR …)
hours ~120 hlang ja · zh · es · de · …role dialogreleased 1996-01verified 2026-04Native-speaker telephone conversations across Japanese, Mandarin, Spanish, German, Arabic, and English — the classic non-English turn-taking set.
interactivity · highrestrictedgatedLDC (paid, per-release)The CALLHOME series is a family of 6 LDC catalogs (English, Japanese, Mandarin, Spanish, German, Egyptian Arabic). Each release contains ~120 unscripted phone calls of up to 30 min between native-speaker family members or close friends. Channel-separated 8 kHz two-channel ulaw with orthographic transcripts; the Japanese, Mandarin, and Arabic releases also include aligned Romanisation. One of the only ways to get dyadic native-speaker telephone conversation outside English at scale. Distributed through LDC (paid).
#telephone#classic#turn-taking#paid#channel-separated#multilingualsiteconversation · multilingualAMI Meeting Corpus
hours 100 hlang enrole dialogreleased 2006-06verified 2026-04100 hours of multi-party meeting recordings with close-talking mics, video, and overlap annotation — one of the few CC-BY corpora rich in overlapping speech.
interactivity · highcommercial okpermissiveCC-BY-4.0~2/3 of the corpus is 4-person fictional design-team meetings, the rest mixed meeting types. Ships synchronized close-talking and far-field microphones, video, word-level transcripts, dialog acts, topic segmentation, gesture, and gaze annotations.
#multi-party#overlap#meetings#full-duplexCHiME-6
hours 40 hlang enrole dialogreleased 2020-04verified 2026-0440+ hours of real home-party dinner conversation captured over 20 parties with multi-array + close-talk mics — overlap-rich multi-party speech.
interactivity · highresearch onlynon-commercialResearch use (CHiME Challenge terms)Twenty genuine dinner parties held in private homes, each recorded with six distributed Kinect-style microphone arrays plus binaural close-talking references on each participant. ~40 hours total, 4 participants per session, unscripted conversation over ~2 hours. High backchannel / overlap density makes it one of the best real-recorded targets for multi-party full-duplex modelling — it is dyadic-friendly only in parse, but it is the reference corpus for distant conversational ASR and for overlap-aware diarisation. Audio and transcripts released under a research-use license by the CHiME organisers.
#multi-party#overlap#dinner-party#far-field#backchannelDailyTalk
hours 20 hlang enrole dialogreleased 2022-07verified 2026-0420 hours of two-speaker daily conversations with emotion and dialog-act labels — staple for dialogue TTS.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-SA-4.02,541 dialogues. Every utterance carries emotion and dialog-act labels, which is useful when you want to train TTS that preserves conversational context.
#dialog-TTS#emotion
Large-scale pretraining
Bulk read / broadcast / web-scraped speech used for self-supervised and supervised pretraining — multilingual bases alongside high-volume single-language corpora.
YODAS / YODAS2
hours 500k hlang 100+role pretrainreleased 2024-06verified 2026-04500k hours of Creative-Commons YouTube speech across 100+ languages — currently the largest open multilingual corpus.
interactivity · lowrestrictedshare-alikeCC-BY / CC-BY-SA (per-source)A massive automated multilingual speech dataset curated from CC-licensed YouTube. YODAS2 tightens quality filters. Downstream use inherits the CC-BY-SA share-alike clause.
#massive#multilingual#youtubeVoxPopuli
hours 400k hlang en · de · fr · es · …role pretrainreleased 2021-01verified 2026-04400k hours of unlabeled European Parliament speech in 23 languages + 1.8k transcribed + 17.3k hours of interpreter audio — dataset released CC0.
interactivity · lowcommercial okpermissiveCC0 (data) / CC BY-NC 4.0 (code & models)Collected from 2009-2020 plenary sessions. Unlabeled pool is ~9-18k hours per language. Includes 29 hours of accented non-native English. The official README splits the license stack clearly: VoxPopuli data is CC0 (see the European Parliament's legal notice for the raw recordings), while the repository's code and pre-trained models are CC BY-NC 4.0. Commercial derivatives trained on the audio alone are straightforward; re-using the released checkpoints is not.
#massive#multilingual#parliament#interpretation#cc0trainsSeamlessM4T v2Emilia
hours 101k hlang en · zh · ja · ko · …role pretrainreleased 2024-07verified 2026-04101k hours of in-the-wild multilingual speech — a de-facto standard for large TTS/S2S training.
interactivity · lowrestrictedshare-alikeCC-BY-SA-4.0 (research)Built from YouTube and podcast audio cleaned with VAD, denoising, and speaker separation. 101k hours across English, Chinese, Japanese, Korean, German, and French. Used by CosyVoice and Emilia-TTS.
#large-scale#multilingual#TTSMultilingual LibriSpeech (MLS)
hours 50k hlang en · de · nl · fr · …role pretrainreleased 2020-12verified 2026-0450k hours of LibriSpeech-style audiobooks across 8 languages — the permissive multilingual pretraining baseline.
interactivity · lowcommercial okpermissiveCC-BY-4.0Extends the LibriSpeech recipe to English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. One of the largest CC-BY-4.0 multilingual speech corpora — friction-free for commercial use.
#multilingual#large-scale#permissivetrainsSeamlessM4T v2ReazonSpeech
hours 35k hlang jarole pretrainreleased 2023-03verified 2026-0435,000 hours of Japanese, pseudo-labelled from TV broadcasts with Whisper — the largest practical JP training pool.
interactivity · mediumrestrictedshare-alikeCDLA-Sharing-1.0Live-broadcast diversity combined with a Japanese-only focus. Labels are Whisper-generated, so expect noise at the tail. CDLA-Sharing means derivative datasets inherit the share-alike clause.
#japanese#broadcast#large-scalesitespeech-read · conversationCommon Voice 18
hours 30k+ hlang 120+role pretrainreleased 2019-06verified 2026-04Crowdsourced multilingual read speech — 120+ languages and 30k+ hours under CC0.
interactivity · lowcommercial okpermissiveCC0-1.0Volunteers around the world donate their voice. Coverage is uneven, but for many low-resource languages this remains the only public option for initial ASR / TTS work. CC0 makes it the most portable multilingual corpus.
#crowdsourced#multilingual#cc0trainsSeamlessM4T v2sitespeech-read · multilingualPeople's Speech
hours 30k+ hlang enrole pretrainreleased 2021-11verified 2026-0430k+ hours of diverse English speech harvested from Archive.org — the largest permissive English ASR corpus.
interactivity · lowcommercial okshare-alikeCC-BY-SA-4.0 / CC-BY-4.023.7 million examples in FLAC with auto-matched transcripts, built from public-domain / CC-BY / CC-BY-SA sources. A Baidu / Harvard / Intel / Landing AI / NVIDIA collaboration. Commercial use is explicitly supported within the share-alike constraint.
#english#large-scale#commercial-okGigaSpeech
hours 10k hlang enrole pretrainreleased 2021-06verified 2026-0410k hours of English sourced from audiobooks, podcasts, and YouTube — broad acoustic coverage but research-only audio access.
interactivity · mediumresearch onlygatedResearch-only (gated, Apache-2.0 code)Designed to offset LibriSpeech's read-speech bias with conversation, lectures, and news. Released in XS / S / M / L / XL splits. The repository code is Apache-2.0, but the distributed audio requires agreement to SpeechColab's Terms of Access (Tsinghua-hosted) limiting use to non-commercial research and educational purposes. SpeechColab does not own copyright on the underlying audio.
#english#large-scale#podcast#youtubeLibriSpeech
hours 960 hlang enrole pretrainreleased 2015-04verified 2026-04960 hours of read English audiobooks — the canonical ASR benchmark since 2015.
interactivity · lowcommercial okpermissiveCC-BY-4.0Derived from LibriVox audiobooks. Traditionally split into train-clean-100, train-clean-360, and train-other-500. Still the baseline any new ASR system reports against.
#classic#english#ASRCorpus of Spontaneous Japanese (CSJ)
hours 661 hlang jarole pretrainreleased 2004-01verified 2026-04661 hours of Japanese spontaneous speech — the canonical academic JP corpus for prosody, POS, and dependency.
interactivity · mediumrestrictedgatedNINJAL (tiered)3,302 recordings at 16 kHz / 16 bit. ~90% monologue (academic presentations, simulated public speaking), ~10% dialog. Ships 7.5M-word transcripts, morphology, 500k prosodic-label units, and dependency annotations. Tiered pricing — academic / general / commercial — administered by NINJAL.
#japanese#prosody#academicsiteconversation · speech-read
Fine-tune & TTS-grade
Smaller, curated corpora used for TTS and instruction-tuning — typically parallel-read, multi-speaker, single-language by construction.
JVS Corpus
hours 30 hlang jarole finetunereleased 2019-08verified 2026-0430 hours of Japanese parallel readings across 100 speakers — the staple for multi-speaker JP TTS.
interactivity · lowrestrictedcustom licenseCC-BY-SA-4.0 (with conditions)Each of 100 speakers reads the same set of prompts. 30 hours of parallel speech that remains the default choice for Japanese multi-speaker TTS and voice conversion.
#japanese#multi-speaker#TTSsitespeech-read
Paralinguistic & expressive
Laughter, whisper, emotion — the data that teaches speech-LMs the non-verbal layer.
Expresso
hours 40 hlang enrole specialtyreleased 2023-08verified 2026-0440 hours of studio-quality expressive speech — 4 speakers, 8 read styles, 26 improvised dialogue styles including laughter and whisper.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-4.011 h of read speech plus 30 h of improvised dialogues recorded at 48 kHz / 24-bit. Styles include laughter, whisper, confused, enunciated, happy, sad, angry, and others. Accompanies the Interspeech 2023 benchmark for textless expressive resynthesis.
#expressive#prosody#laughter#whisperIEMOCAP
hours 12 hlang enrole specialtyreleased 2008-12verified 2026-0412 hours of scripted and improvised dyadic acting — the emotion-recognition classic.
interactivity · mediumresearch onlygatedUSC SAIL (request)Actors perform improvised and scripted scenes, captured as synchronized video, audio, and motion. Labelled with six emotion classes; a staple baseline for emotion benchmarks.
#emotion#classic#dyadicsiteemotion · dialog-paired
Multilingual eval & translation
Evaluation-grade corpora for cross-lingual ASR and S2ST. Small, curated, and properly licensed.
MuST-C
hours ~4k hlang en → 14role evalreleased 2019-06verified 2026-04English TED-talk speech translated into 14 target languages — hundreds of hours per pair, the reference training corpus for ST.
interactivity · n/aresearch onlynon-commercialCC-BY-NC-ND-4.0Multilingual Speech Translation Corpus built from English TED Talks. Each English audio segment is paired with aligned text translations in 14 languages (German, Spanish, French, Italian, Dutch, Portuguese, Romanian, Russian, Arabic, Farsi, Turkish, Vietnamese, Chinese, Japanese). Per-direction sizes range from ~385 h (DE) down to ~100 h (smaller pairs). Unlike CVSS, all source audio is real human speech — target is text only — so MuST-C is the standard supervised base for speech-to-text translation, and an upstream for cascaded S2ST. Released under CC-BY-NC-ND-4.0 (research-only).
#st#ted-talks#multilingual#alignedCVSS
hours 3,809 hlang 21 → enrole evalreleased 2022-01verified 2026-04Massively multilingual S2ST corpus — 21 source languages into English, with synthetic target speech — CC-BY-4.0.
interactivity · n/acommercial okpermissiveCC-BY-4.0Common Voice-based Speech-to-Speech. Source audio comes from Common Voice; English target speech is synthesised by Google's Parallel-Tacotron with a single-speaker voice (CVSS-C) and with speaker-transfer voices (CVSS-T). 3,809 hours across 21 source languages. Because target speech is synthetic, CVSS is best read as 'paired supervision for S2ST models' rather than naturalistic target speech — the permissive license makes it the portable baseline for speech-to-speech translation research.
#s2st#translation#synthetic-target#multilingualCoVoST-2
hours 2,880 hlang 21 + enrole evalreleased 2020-07verified 2026-042,880 hours of multilingual S2T translation built on top of Common Voice — the standard supervised S2ST base.
interactivity · n/acommercial okpermissiveCC0-1.021-to-English and English-to-15 speech-to-text translation pairs. Frequently used for indirect S2ST evaluation and SeamlessM4T reproductions. CC0 makes it the most portable translation corpus.
#translation#multilingual#cc0trainsSeamlessM4T v2FLEURS
hours ~1k hlang 102+role evalreleased 2022-05verified 2026-04Few-shot Learning Evaluation of Universal Representations of Speech — 102 languages, ~10 h each, built on FLoRes translation pairs.
interactivity · n/acommercial okpermissiveCC-BY-4.02,009 n-way parallel sentences recorded by native speakers. Supports ASR, speech-to-text translation, language ID, and speech-text retrieval. FLEURS-R (2024) restores audio quality for TTS-grade evaluation.
#multilingual#eval#translation#low-resourcetrainsSeamlessM4T v2
Frontier (2024 – 2026)
Recent full-duplex / speech-LM oriented corpora, including niche releases. Skewed toward research licenses and fresh paper drops — sorted newest first.
otoSpeech-full-duplex-280h
hours 280 hlang enrole dialogreleased 2026-02verified 2026-04280 hours of channel-separated two-speaker English conversation at 48 kHz under CC-BY-4.0 — among the largest permissive, commercially-usable real-recorded FD corpora currently indexed on Hugging Face (as of 2026-04).
interactivity · highcommercial okpermissiveCC-BY-4.0Dyadic English conversation recorded in diverse real-world conditions. Stereo FLAC: channel 0 = speaker A, channel 1 = speaker B. Preserves natural overlaps, interruptions, and laughter. Every sample ships audio + session metadata + speaker profiles + redaction intervals + participant surveys, packaged as WebDataset tar shards. Released under CC-BY-4.0, so commercial use is permitted with attribution. A 141-hour human-reviewed + denoised variant (otoSpeech-full-duplex-processed-141h) is also available.
#channel-separated#48khz#overlap#laughter#full-duplex#dyadicsiteconversationotoSpeech-full-duplex-processed-141h
hours 141 hlang enrole dialogreleased 2026-02verified 2026-04141 hours of human-reviewed, denoised, channel-separated English FD conversation — the processed sibling of otoSpeech-280h.
interactivity · highcommercial okpermissiveCC-BY-4.0Curated from otoSpeech-full-duplex-280h: high-quality conversations selected by human review, processed with noise reduction and speech enhancement, plus new samples collected after the 280h release. 44.1 kHz channel-separated FLAC with session metadata, speaker profiles, redaction intervals, and surveys. Same CC-BY-4.0 license as the raw variant, so commercial use is permitted.
#channel-separated#44.1khz#denoised#curated#full-duplex#dyadicsiteconversationInteractSpeech
hours 150 hlang enrole dialogreleased 2025-11verified 2026-04150 hours of English speech-interaction data — synthesised + filtered real-world dialogues with precise speaker timestamps for interrupts and backchannels.
interactivity · highrestrictedcustom licenseResearch (see repo)Combines synthetic interactive dialogues with interactive segments mined from real speech corpora. Provides a formal framework for interaction dynamics and demonstrates a fine-tuned LLaMA-3 8B that classifies interactional events from audio. Findings of EMNLP 2025.
#interruption#backchannel#synthetic#full-duplexOleSpeech-IV
hours 100 h (open subset)lang enrole dialogreleased 2025-09verified 2026-04English conversational speech with accents from all world regions — human-refined speaker turns and transcripts, 100-hour open subset.
interactivity · mediumresearch onlynon-commercialNon-commercial researchThe IV tier of the Olewave dataset series. OleSpeech-IV-2025-EN-AR-100 is 100 hours of English-only conversation (EN = English, AR = accents from all regions — *not* Arabic). Drawn from public podcasts, talk shows, and teleconferences; ships FLAC mono 16 kHz with speaker labels, turn info, timestamps, and confidence scores. The underlying Tier-IV collection holds 5,000+ hours across G20 languages, but only this English subset is publicly downloadable.
#multi-speaker#podcast#teleconference#diarization#accent-diverseMLC-SLM (Interspeech 2025)
hours 1,604 hlang en · fr · de · it · …role dialogreleased 2025-09verified 2026-041,604 hours of two-speaker conversational speech in 11 languages — the first genuinely multilingual public conversational Speech-LM benchmark.
interactivity · highrestrictedgatedChallenge access (CC-BY-SA-4.0 for eval)Released alongside the 1st Multilingual Conversational Speech LM challenge at Interspeech 2025. Covers English (500 h across 5 regional accents), French, German, Italian, Portuguese, Spanish, Japanese, Korean, Russian, Thai, and Vietnamese (100 h each). Every recording is a natural two-speaker conversation on assigned topics, 16 kHz mobile-phone indoor capture with oracle segmentation + speaker labels. Eval-1/Eval-2 ground truth (96 h) is openly on Hugging Face under CC-BY-SA-4.0; the 1,507-hour training set is distributed to registered challenge participants. 2nd challenge (2026) expands to 14 languages / 2,100 hours.
#multilingual#diarization#turn-taking#challenge#interspeechMM-F2F
hours 210 hlang enrole dialogreleased 2025-07verified 2026-04210 hours of multi-modal face-to-face conversation with turn-taking and backchannel labels at word level.
interactivity · highrestrictedcustom licenseResearch (see repo)Collected via an automatic pipeline from human conversation video, de-identified by replacing faces and perturbing voiceprints. 1.5M words and ~20M frames. Trained end-to-end predictor reaches +10% F1 on turn-taking and +33% on backchannel over previous SOTA. ACL 2025.
#multimodal#turn-taking#backchannel#videoVoxDialogue
hours —lang enrole evalreleased 2025-05verified 2026-044,500 multi-turn spoken dialog samples × 12 acoustic attributes — probing whether spoken dialog systems catch what text can't.
interactivity · n/arestrictedcustom licenseResearch (see repo)Benchmarks speech-LMs on 12 acoustic attributes (speech rate, volume, emphasis, background sound, intonation, rhythm, gender, accent, emotion …). Shows that direct speech models pick up cues ASR-pipelines lose. Presented at ICLR 2025, data + code open-sourced.
#paralinguistic#eval#speech-lm#multi-turnBehavior-SD
hours 2,164 hlang enrole dialogreleased 2025-04verified 2026-04108K LLM-synthesised full-duplex dialogues (2,164 h) with explicit backchannel / interruption / filler labels — the largest publicly-downloadable FD-labeled corpus.
interactivity · highcommercial okpermissiveCC-BY-4.0Behavior-driven Spoken Dialogues: natural-language narratives rendered to speech via TTS, conditioned on speaker-wise behavioural traits (talkativeness, backchannelling, interruption rate, filler frequency). Every utterance carries turn-level timing and behaviour annotations. Explicitly synthetic — no real recordings — so it complements rather than replaces real-audio FD corpora. Useful for supervised FD behaviour learning; not for acoustic pretraining. Hugging Face dataset card tags CC-BY-4.0.
#synthetic#full-duplex#backchannel#interruption#behavior-labelsMulti-stream Spontaneous Conversation (zh+en)
hours 15 h (10 zh + 5 en)lang zh · enrole dialogreleased 2024-11verified 2026-0415 hours of dual-track two-speaker conversation (10 h Mandarin + 5 h English) with per-speaker audio channels — the rare public FD-native open corpus.
interactivity · highresearch onlynon-commercialCC-BY-NC-ND-4.0Released as an open-source sample of MagicData's commercial multi-stream corpus. Each speaker is recorded on their own channel, so natural interruptions, overlaps, and backchannels are preserved in a form a full-duplex model can actually learn from. 16 kHz mobile-phone recordings, CC-BY-NC-ND-4.0. Although small in absolute terms, this is one of the cleanest public examples of the channel-separated setup most FD systems need — and crucially covers both zh and en.
#channel-separated#multi-stream#dual-track#full-duplex#overlap#interruption#mandarinInstructS2S-200K
hours —lang enrole finetunereleased 2024-09verified 2026-04200k multi-turn speech-to-speech conversations tailored for instruction-following speech models.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-4.0Synthesised dialogues constructed to match speech-interaction characteristics (concise turns, spoken style). The core SFT corpus for LLaMA-Omni. An extended multi-turn version, Multiturn-Speech-Conversations, landed in May 2025.
#instruction#s2s#sft#synthetictrainsLLaMA-Omni 2J-CHAT
hours 76k hlang jarole pretrainreleased 2024-07verified 2026-04~76,000 hours of Japanese spoken dialogue scraped from podcasts and YouTube — the first JP corpus targeted at dialogue-oriented speech-LMs.
interactivity · mediumresearch onlynon-commercialResearch-only (JP Copyright Art. 30-4)Built with a language-independent automatic pipeline for acoustic cleanliness and spontaneity. Used to pretrain J-Moshi, the first JP full-duplex system. Available on Hugging Face but restricted to non-commercial use under Japanese Copyright Act Art. 30-4.
#japanese#podcast#youtube#speech-lmMultiDialog
hours 340 hlang enrole dialogreleased 2024-06verified 2026-04340 hours of 9k audio-visual dialogues between 12 fluent English speakers, with emotion annotations on top of TopicalChat.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-SA-4.0Professional-studio recordings in parallel audio+video, built from the TopicalChat open-domain dialogue dataset. Supports multimodal dialogue generation, ASR, and TTS. Released at ACL 2024.
#multimodal#audio-visual#emotion#dialogSD-Eval
hours 8.76 hlang enrole evalreleased 2024-06verified 2026-048.76 hours of curated evaluation across emotion, accent, age, and environment — non-commercial eval-only release under CC-BY-NC-4.0.
interactivity · n/aresearch onlynon-commercialCC-BY-NC-4.0Aggregates 7,303 utterances from 8 public sources (RAVDESS, JL Corpus, MEAD, VCTK, Common Voice, MyST, …). NeurIPS 2024. A 1,052-hour training split is also provided for fine-tuning. The Hugging Face dataset card is licensed CC-BY-NC-4.0, so commercial productisation requires a separate arrangement with the upstream sources; use this corpus for research evaluation only.
#paralinguistic#eval#emotion#accent#non-commercial
What the corpus map is telling us.
- gap · 01
Non-English interactive corpus is thin
Classic-drawer corpora outside English are overwhelmingly monologue, broadcast, or parallel-reading — few are dyadic conversational at CANDOR / AMI scale. The frontier has only started to fill the gap (J-CHAT for pretraining; MagicHub duplex sets as early dialog entries), but a permissive, large-scale, non-English conversational corpus remains unclaimed for most languages. - gap · 02
Commercial-friendly turn-taking is thin
The classical turn-taking corpora (Fisher, Switchboard) are gated LDC paid releases; CANDOR (850 h) is free for research but CC-BY-NC-4.0, so not directly deployable. Among the truly permissive "high interactivity" sets, AMI (CC-BY-4.0, 100 h) and otoSpeech-full-duplex-280h (CC-BY-4.0, 280 h) stand out. The rest of the frontier wave (InteractSpeech, MM-F2F, MagicHub duplex) still ships under research / custom licenses. - gap · 03
Scale vs license is a pareto frontier
YODAS (500k h) is CC-BY-SA, VoxPopuli (400k h audio) is CC0 (code & pretrained models CC-BY-NC-4.0), Emilia (101k h) is share-alike, People's Speech (30k h) is CC-BY-SA, Common Voice (30k+ h) is CC0. For a strictly permissive and large multilingual base, MLS (50k h, CC-BY-4.0) remains the cleanest option — with the tradeoff that it is still audiobook-flavoured and English-heavy. - frontier watch
Interactive data is finally picking up
In the 2024–2026 window, 14 FD-oriented corpora landed: J-CHAT, MultiDialog, InstructS2S-200K, SD-Eval, VoxDialogue, MM-F2F, InteractSpeech, OleSpeech-IV, the MagicHub duplex set, and otoSpeech-280h. Modalities are diversifying (audio-only → audio+video), and channel-separated stereo is becoming standard. Licenses remain the bottleneck.
Want to add a dataset, or update a license note? submit an entry