§D · corpora

Datasets.

Corpora for training and evaluating audio foundation models — grouped by role (pretrain / dialog-interactive / eval / specialty), tagged with interactivity (how useful for full-duplex modeling) and license class (permissive / share-alike / gated / non-commercial / custom). Always verify with the primary source before shipping. Open the channel if you spot an error.

total tracked

commercial ok

high interactivity

frontier (2024 – 26)

observatory refresh · 2026-04verified this month · 0 / 42awaiting re-check · 42

interactivity · highpermissiveshare-alikenon-commercialgatedcustom license

D-A · use-case matrix

Which dataset for which job?

Rows are the 42 public STS-adjacent corpora, split into Full-duplex (channel-separated) and Others. Columns are the training / eval roles they actually support. The full-duplex band opens by default; click a group header to fold it, hover a column for its definition, click a row to jump to the full card below.

Hours

Quality

Ch-sep

Pretrain

SFT / Instr.

Dialog

Eval

High int.

Conv.

Emotion

Multi-lang

Comm.-OK

Fisher EnglishLDCEN

2.0k

—

CANDORBetterUp Labs (Reece, Cooney et al.)EN

850h

—

otoSpeech-full-duplex-280hotoearthEN

280h

48 kHz

Switchboard-1 (LDC97S62)LDCEN

260h

8 kHz

otoSpeech-full-duplex-processed-141hotoearthEN

141h

44.1 kHz

CALLHOME (JA / ZH / ES / DE / AR …)LDCJAZHESDEAR+1

120h

8 kHz

Multi-stream Spontaneous Conversation (zh+en)Magic Data TechnologyZHEN

15h

16 kHz

Japanese Duplex Conversation DatasetMagicDataJA

10h

16 kHz

supportedpartial / implicitnot a fit

D-B · dialog corpora only

Dialog-interactive hours by language

Scope: the 19 real-recorded dialog-interactive public corpora we track (synthetic / TTS-generated corpora like Behavior-SD are excluded so they don't inflate acoustic totals), mapped onto the world's ten most-spoken languages. For multi-lingual corpora, hours are split evenly across every declared language — languages outside the top ten land in Other. Dashed rows = zero public hours in this scope; linear X axis leaves the English dominance literal. Toggle channel-separated only to isolate the corpora whose audio is actually usable for full-duplex acoustic training (per-speaker tracks, not mixed-channel).

permissiveshare-alikegatednon-commercialcustom

D-C · release × FD fit

How full-duplex ready are public corpora?

Every public STS-adjacent corpus with a known release month and hour count (1993 – 2026), plotted by release date and scale.Default lens is channel-separated only — the per-speaker-track subset that is actually usable for full-duplex acoustic training — so the view opens with the corpora that matter most for FD builders. Untick the chip to see the full pool. Fill colour scores full-duplex fit (FD-ready = high interactivity and/or explicit overlap / turn-taking / back-channel labels); red ring marks channel-separated audio; dashed outline marks synthetic corpora (TTS-generated, useful for behaviour supervision but not a substitute for real recordings). Chips narrow by FD fit, licence class, commercial use, or channel separation. Click any dataset name for the one-screen briefing.

FD-ready (high interactivity / FD tags)partial (dialog-interactive, medium)non-FD (pretrain / eval / SFT)channel-separatedsynthetic (TTS)

§01 · turn-taking rich

Full-duplex & interactive

Dyadic, multi-party, or meeting audio where overlap and turn boundaries matter. The raw material for full-duplex models.

AI4Bharat (IIT Madras)
IndicVoices (conversational slice)
hours ~3,500 h (conv.)lang hi · bn · ta · te · …role dialogreleased 2024-03verified 2026-04
~3,500 hours of conversational speech across all 22 scheduled Indic languages — the single largest permissive source for Hindi/Bengali/Tamil/Telugu dialogue.
interactivity · mediumcommercial okpermissiveCC-BY-4.0
IndicVoices as a whole is a 23.7K-hour natural-speech corpus spanning 22 languages from 51K speakers across 400+ districts; 15% of that is conversational, giving ~3,500 hours of dialog-interactive audio under CC-BY-4.0 (commercial use permitted). 11.2K hours transcribed as of Dec 2025. This row counts only the conversational slice; the read/extempore majority sits upstream as pretraining fodder. Because hours are split evenly across many languages, bucket columns on D-B show how even the largest Indic corpus is thinly spread per-language.
#indic#multilingual#hindi#bengali#conversational
site paperconversation · multilingual
LDC
Fisher English
hours 2,000 hlang enrole dialogreleased 2004-12verified 2026-04
2,000 hours of 10-minute telephone conversations — a classic for conversational ASR and turn-taking baselines.
interactivity · highrestrictedgatedLDC (paid)
Ten-minute English phone calls between randomly paired US speakers, with transcripts. Distributed through LDC (paid), so mind the commercial terms.
#telephone#conversation#paid#channel-separated
trainsMoshi PersonaPlex
siteconversation
ETRI / Korean AI Hub
KsponSpeech
hours 969 hlang korole dialogreleased 2020-03verified 2026-04
969 hours of Korean spontaneous dyadic conversation — the canonical Korean analogue of Fisher / Switchboard.
interactivity · highrestrictedgatedAI Hub (registration)
About 2,000 native Korean speakers recorded freely conversing in pairs on open-domain topics in a clean indoor environment. Dual transcription (orthography + pronunciation) plus disfluency tags for fillers, repetitions, and fragments. Distributed through the Korean government's AI Hub under a registration-gated licence; free for research on application.
#korean#spontaneous#dyadic#disfluency
site paperconversation
BetterUp Labs (Reece, Cooney et al.)
CANDOR
hours 850 hlang enrole dialogreleased 2023-03verified 2026-04
850 hours of naturalistic video-chat conversation across 1,656 dyads — the go-to open corpus for full-duplex turn-taking research.
interactivity · highresearch onlynon-commercialCC-BY-NC-4.0
1,656 dyadic video-chat conversations recorded over TokBox OpenTok between 1,456 US adults (2020). Ships per-speaker audio + video tracks, three turn parses (Audiophile / Cliffhanger / Backbiter), backchannels, prosody, facial-action features, and rich pre/post surveys. Science Advances (2023) paper is CC BY-NC 4.0 — the corpus inherits the non-commercial clause, so commercial redistribution is not permitted. Free for research on signed data-request.
#turn-taking#dialog#full-duplex#video#channel-separated
site paperconversation
LDC
Switchboard-1 (LDC97S62)
hours 260 hlang enrole dialogreleased 1993-01verified 2026-04
260 hours of 2,400 U.S. telephone conversations — the original turn-taking corpus on which the field was built.
interactivity · highrestrictedgatedLDC (paid)
Collected by Texas Instruments in 1990-91 under DARPA. 543 speakers; no pair talks twice and no speaker repeats a topic. 8 kHz 2-channel ulaw with orthographic transcripts. Distributed through LDC (paid).
#telephone#classic#turn-taking#paid#channel-separated
siteconversation
Alibaba DAMO Academy
SpokenWOZ
hours 249 hlang enrole dialogreleased 2023-05verified 2026-04
5.7K human-to-human spoken task-oriented dialogues (203K turns, 249 h) — the canonical spoken task-oriented corpus.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-4.0
Multi-domain task-oriented dialogues collected by crowdsourcing pairs of human speakers. 5,700 dialogues, 203,000 turns, ~249 hours across eight domains including restaurant, hotel, and taxi. Each turn ships audio, transcript, and task-oriented dialog-state labels (goals, slots, intents), plus a new 'cross-turn slot' annotation for values that span multiple utterances. Fills the 'spoken task-oriented' drawer that was missing between open-domain chat corpora and ASR benchmarks. CC-BY-NC-4.0, so research-only.
#task-oriented#dialog-state#multi-domain#spoken-dst
site paperconversation · dialog-paired
Magic Data Technology / OpenSLR SLR-123
MagicData-RAMC
hours 180 hlang zhrole dialogreleased 2022-03verified 2026-04
180 hours of Mandarin spontaneous two-party conversation — the canonical CN-side analogue of Switchboard, free for academic use.
interactivity · highresearch onlynon-commercialCC-BY-NC-ND-4.0
663 speakers from different accent regions of China, freely conversing in pairs over mobile recordings in a quiet indoor environment. Manually transcribed and proofed, with speaker and topic metadata. Train/val/test split 15:1:2. Released on OpenSLR as SLR-123. Licensed CC-BY-NC-ND-4.0 — explicitly free for academic work but not redistribution or commercial derivatives.
#mandarin#chinese#mobile-phone#turn-taking#overlap
site paperconversation
LDC
CALLHOME (JA / ZH / ES / DE / AR …)
hours ~120 hlang ja · zh · es · de · …role dialogreleased 1996-01verified 2026-04
Native-speaker telephone conversations across Japanese, Mandarin, Spanish, German, Arabic, and English — the classic non-English turn-taking set.
interactivity · highrestrictedgatedLDC (paid, per-release)
The CALLHOME series is a family of 6 LDC catalogs (English, Japanese, Mandarin, Spanish, German, Egyptian Arabic). Each release contains ~120 unscripted phone calls of up to 30 min between native-speaker family members or close friends. Channel-separated 8 kHz two-channel ulaw with orthographic transcripts; the Japanese, Mandarin, and Arabic releases also include aligned Romanisation. One of the only ways to get dyadic native-speaker telephone conversation outside English at scale. Distributed through LDC (paid).
#telephone#classic#turn-taking#paid#channel-separated#multilingual
siteconversation · multilingual
University of Edinburgh / Idiap
AMI Meeting Corpus
hours 100 hlang enrole dialogreleased 2006-06verified 2026-04
100 hours of multi-party meeting recordings with close-talking mics, video, and overlap annotation — one of the few CC-BY corpora rich in overlapping speech.
interactivity · highcommercial okpermissiveCC-BY-4.0
~2/3 of the corpus is 4-person fictional design-team meetings, the rest mixed meeting types. Ships synchronized close-talking and far-field microphones, video, word-level transcripts, dialog acts, topic segmentation, gesture, and gaze annotations.
#multi-party#overlap#meetings#full-duplex
site paperconversation
CHiME Challenge
CHiME-6
hours 40 hlang enrole dialogreleased 2020-04verified 2026-04
40+ hours of real home-party dinner conversation captured over 20 parties with multi-array + close-talk mics — overlap-rich multi-party speech.
interactivity · highresearch onlynon-commercialResearch use (CHiME Challenge terms)
Twenty genuine dinner parties held in private homes, each recorded with six distributed Kinect-style microphone arrays plus binaural close-talking references on each participant. ~40 hours total, 4 participants per session, unscripted conversation over ~2 hours. High backchannel / overlap density makes it one of the best real-recorded targets for multi-party full-duplex modelling — it is dyadic-friendly only in parse, but it is the reference corpus for distant conversational ASR and for overlap-aware diarisation. Audio and transcripts released under a research-use license by the CHiME organisers.
#multi-party#overlap#dinner-party#far-field#backchannel
site paperconversation
KAIST
DailyTalk
hours 20 hlang enrole dialogreleased 2022-07verified 2026-04
20 hours of two-speaker daily conversations with emotion and dialog-act labels — staple for dialogue TTS.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-SA-4.0
2,541 dialogues. Every utterance carries emotion and dialog-act labels, which is useful when you want to train TTS that preserves conversational context.
#dialog-TTS#emotion
site paperdialog-paired

§02 · scale corpus

Large-scale pretraining

Bulk read / broadcast / web-scraped speech used for self-supervised and supervised pretraining — multilingual bases alongside high-volume single-language corpora.

WavLab (CMU)
YODAS / YODAS2
hours 500k hlang 100+role pretrainreleased 2024-06verified 2026-04
500k hours of Creative-Commons YouTube speech across 100+ languages — currently the largest open multilingual corpus.
interactivity · lowrestrictedshare-alikeCC-BY / CC-BY-SA (per-source)
A massive automated multilingual speech dataset curated from CC-licensed YouTube. YODAS2 tightens quality filters. Downstream use inherits the CC-BY-SA share-alike clause.
#massive#multilingual#youtube
site papermultilingual · speech-read
Meta AI
VoxPopuli
hours 400k hlang en · de · fr · es · …role pretrainreleased 2021-01verified 2026-04
400k hours of unlabeled European Parliament speech in 23 languages + 1.8k transcribed + 17.3k hours of interpreter audio — dataset released CC0.
interactivity · lowcommercial okpermissiveCC0 (data) / CC BY-NC 4.0 (code & models)
Collected from 2009-2020 plenary sessions. Unlabeled pool is ~9-18k hours per language. Includes 29 hours of accented non-native English. The official README splits the license stack clearly: VoxPopuli data is CC0 (see the European Parliament's legal notice for the raw recordings), while the repository's code and pre-trained models are CC BY-NC 4.0. Commercial derivatives trained on the audio alone are straightforward; re-using the released checkpoints is not.
#massive#multilingual#parliament#interpretation#cc0
trainsSeamlessM4T v2
site papermultilingual · speech-read
Amphion (CUHK-Shenzhen)
Emilia
hours 101k hlang en · zh · ja · ko · …role pretrainreleased 2024-07verified 2026-04
101k hours of in-the-wild multilingual speech — a de-facto standard for large TTS/S2S training.
interactivity · lowrestrictedshare-alikeCC-BY-SA-4.0 (research)
Built from YouTube and podcast audio cleaned with VAD, denoising, and speaker separation. 101k hours across English, Chinese, Japanese, Korean, German, and French. Used by CosyVoice and Emilia-TTS.
#large-scale#multilingual#TTS
site paperspeech-read · multilingual
Meta AI
Multilingual LibriSpeech (MLS)
hours 50k hlang en · de · nl · fr · …role pretrainreleased 2020-12verified 2026-04
50k hours of LibriSpeech-style audiobooks across 8 languages — the permissive multilingual pretraining baseline.
interactivity · lowcommercial okpermissiveCC-BY-4.0
Extends the LibriSpeech recipe to English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. One of the largest CC-BY-4.0 multilingual speech corpora — friction-free for commercial use.
#multilingual#large-scale#permissive
trainsSeamlessM4T v2
site paperspeech-read · multilingual
Reazon Holdings
ReazonSpeech
hours 35k hlang jarole pretrainreleased 2023-03verified 2026-04
35,000 hours of Japanese, pseudo-labelled from TV broadcasts with Whisper — the largest practical JP training pool.
interactivity · mediumrestrictedshare-alikeCDLA-Sharing-1.0
Live-broadcast diversity combined with a Japanese-only focus. Labels are Whisper-generated, so expect noise at the tail. CDLA-Sharing means derivative datasets inherit the share-alike clause.
#japanese#broadcast#large-scale
sitespeech-read · conversation
Mozilla
Common Voice 18
hours 30k+ hlang 120+role pretrainreleased 2019-06verified 2026-04
Crowdsourced multilingual read speech — 120+ languages and 30k+ hours under CC0.
interactivity · lowcommercial okpermissiveCC0-1.0
Volunteers around the world donate their voice. Coverage is uneven, but for many low-resource languages this remains the only public option for initial ASR / TTS work. CC0 makes it the most portable multilingual corpus.
#crowdsourced#multilingual#cc0
trainsSeamlessM4T v2
sitespeech-read · multilingual
MLCommons
People's Speech
hours 30k+ hlang enrole pretrainreleased 2021-11verified 2026-04
30k+ hours of diverse English speech harvested from Archive.org — the largest permissive English ASR corpus.
interactivity · lowcommercial okshare-alikeCC-BY-SA-4.0 / CC-BY-4.0
23.7 million examples in FLAC with auto-matched transcripts, built from public-domain / CC-BY / CC-BY-SA sources. A Baidu / Harvard / Intel / Landing AI / NVIDIA collaboration. Commercial use is explicitly supported within the share-alike constraint.
#english#large-scale#commercial-ok
site paperspeech-read
SpeechColab
GigaSpeech
hours 10k hlang enrole pretrainreleased 2021-06verified 2026-04
10k hours of English sourced from audiobooks, podcasts, and YouTube — broad acoustic coverage but research-only audio access.
interactivity · mediumresearch onlygatedResearch-only (gated, Apache-2.0 code)
Designed to offset LibriSpeech's read-speech bias with conversation, lectures, and news. Released in XS / S / M / L / XL splits. The repository code is Apache-2.0, but the distributed audio requires agreement to SpeechColab's Terms of Access (Tsinghua-hosted) limiting use to non-commercial research and educational purposes. SpeechColab does not own copyright on the underlying audio.
#english#large-scale#podcast#youtube
site paperspeech-read · conversation
Vassil Panayotov et al.
LibriSpeech
hours 960 hlang enrole pretrainreleased 2015-04verified 2026-04
960 hours of read English audiobooks — the canonical ASR benchmark since 2015.
interactivity · lowcommercial okpermissiveCC-BY-4.0
Derived from LibriVox audiobooks. Traditionally split into train-clean-100, train-clean-360, and train-other-500. Still the baseline any new ASR system reports against.
#classic#english#ASR
site paperspeech-read
NINJAL / NICT / Tokyo Tech
Corpus of Spontaneous Japanese (CSJ)
hours 661 hlang jarole pretrainreleased 2004-01verified 2026-04
661 hours of Japanese spontaneous speech — the canonical academic JP corpus for prosody, POS, and dependency.
interactivity · mediumrestrictedgatedNINJAL (tiered)
3,302 recordings at 16 kHz / 16 bit. ~90% monologue (academic presentations, simulated public speaking), ~10% dialog. Ships 7.5M-word transcripts, morphology, 500k prosodic-label units, and dependency annotations. Tiered pricing — academic / general / commercial — administered by NINJAL.
#japanese#prosody#academic
siteconversation · speech-read

§03 · fine-tune corpus

Fine-tune & TTS-grade

Smaller, curated corpora used for TTS and instruction-tuning — typically parallel-read, multi-speaker, single-language by construction.

Takamichi Lab (University of Tokyo)
JVS Corpus
hours 30 hlang jarole finetunereleased 2019-08verified 2026-04
30 hours of Japanese parallel readings across 100 speakers — the staple for multi-speaker JP TTS.
interactivity · lowrestrictedcustom licenseCC-BY-SA-4.0 (with conditions)
Each of 100 speakers reads the same set of prompts. 30 hours of parallel speech that remains the default choice for Japanese multi-speaker TTS and voice conversion.
#japanese#multi-speaker#TTS
sitespeech-read

§04 · prosody & affect

Paralinguistic & expressive

Laughter, whisper, emotion — the data that teaches speech-LMs the non-verbal layer.

Meta AI
Expresso
hours 40 hlang enrole specialtyreleased 2023-08verified 2026-04
40 hours of studio-quality expressive speech — 4 speakers, 8 read styles, 26 improvised dialogue styles including laughter and whisper.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-4.0
11 h of read speech plus 30 h of improvised dialogues recorded at 48 kHz / 24-bit. Styles include laughter, whisper, confused, enunciated, happy, sad, angry, and others. Accompanies the Interspeech 2023 benchmark for textless expressive resynthesis.
#expressive#prosody#laughter#whisper
site paperemotion · speech-read
USC SAIL
IEMOCAP
hours 12 hlang enrole specialtyreleased 2008-12verified 2026-04
12 hours of scripted and improvised dyadic acting — the emotion-recognition classic.
interactivity · mediumresearch onlygatedUSC SAIL (request)
Actors perform improvised and scripted scenes, captured as synchronized video, audio, and motion. Labelled with six emotion classes; a staple baseline for emotion benchmarks.
#emotion#classic#dyadic
siteemotion · dialog-paired

§05 · curated eval

Multilingual eval & translation

Evaluation-grade corpora for cross-lingual ASR and S2ST. Small, curated, and properly licensed.

FBK (Fondazione Bruno Kessler)
MuST-C
hours ~4k hlang en → 14role evalreleased 2019-06verified 2026-04
English TED-talk speech translated into 14 target languages — hundreds of hours per pair, the reference training corpus for ST.
interactivity · n/aresearch onlynon-commercialCC-BY-NC-ND-4.0
Multilingual Speech Translation Corpus built from English TED Talks. Each English audio segment is paired with aligned text translations in 14 languages (German, Spanish, French, Italian, Dutch, Portuguese, Romanian, Russian, Arabic, Farsi, Turkish, Vietnamese, Chinese, Japanese). Per-direction sizes range from ~385 h (DE) down to ~100 h (smaller pairs). Unlike CVSS, all source audio is real human speech — target is text only — so MuST-C is the standard supervised base for speech-to-text translation, and an upstream for cascaded S2ST. Released under CC-BY-NC-ND-4.0 (research-only).
#st#ted-talks#multilingual#aligned
site papers2st · multilingual
Google Research
CVSS
hours 3,809 hlang 21 → enrole evalreleased 2022-01verified 2026-04
Massively multilingual S2ST corpus — 21 source languages into English, with synthetic target speech — CC-BY-4.0.
interactivity · n/acommercial okpermissiveCC-BY-4.0
Common Voice-based Speech-to-Speech. Source audio comes from Common Voice; English target speech is synthesised by Google's Parallel-Tacotron with a single-speaker voice (CVSS-C) and with speaker-transfer voices (CVSS-T). 3,809 hours across 21 source languages. Because target speech is synthetic, CVSS is best read as 'paired supervision for S2ST models' rather than naturalistic target speech — the permissive license makes it the portable baseline for speech-to-speech translation research.
#s2st#translation#synthetic-target#multilingual
site papers2st · multilingual
Meta AI
CoVoST-2
hours 2,880 hlang 21 + enrole evalreleased 2020-07verified 2026-04
2,880 hours of multilingual S2T translation built on top of Common Voice — the standard supervised S2ST base.
interactivity · n/acommercial okpermissiveCC0-1.0
21-to-English and English-to-15 speech-to-text translation pairs. Frequently used for indirect S2ST evaluation and SeamlessM4T reproductions. CC0 makes it the most portable translation corpus.
#translation#multilingual#cc0
trainsSeamlessM4T v2
site papers2st · multilingual
Google Research
FLEURS
hours ~1k hlang 102+role evalreleased 2022-05verified 2026-04
Few-shot Learning Evaluation of Universal Representations of Speech — 102 languages, ~10 h each, built on FLoRes translation pairs.
interactivity · n/acommercial okpermissiveCC-BY-4.0
2,009 n-way parallel sentences recorded by native speakers. Supports ASR, speech-to-text translation, language ID, and speech-text retrieval. FLEURS-R (2024) restores audio quality for TTS-grade evaluation.
#multilingual#eval#translation#low-resource
trainsSeamlessM4T v2
site papermultilingual · s2st

§06 · bleeding edge

Frontier (2024 – 2026)

Recent full-duplex / speech-LM oriented corpora, including niche releases. Skewed toward research licenses and fresh paper drops — sorted newest first.

otoearth
otoSpeech-full-duplex-280h
hours 280 hlang enrole dialogreleased 2026-02verified 2026-04
280 hours of channel-separated two-speaker English conversation at 48 kHz under CC-BY-4.0 — among the largest permissive, commercially-usable real-recorded FD corpora currently indexed on Hugging Face (as of 2026-04).
interactivity · highcommercial okpermissiveCC-BY-4.0
Dyadic English conversation recorded in diverse real-world conditions. Stereo FLAC: channel 0 = speaker A, channel 1 = speaker B. Preserves natural overlaps, interruptions, and laughter. Every sample ships audio + session metadata + speaker profiles + redaction intervals + participant surveys, packaged as WebDataset tar shards. Released under CC-BY-4.0, so commercial use is permitted with attribution. A 141-hour human-reviewed + denoised variant (otoSpeech-full-duplex-processed-141h) is also available.
#channel-separated#48khz#overlap#laughter#full-duplex#dyadic
siteconversation
otoearth
otoSpeech-full-duplex-processed-141h
hours 141 hlang enrole dialogreleased 2026-02verified 2026-04
141 hours of human-reviewed, denoised, channel-separated English FD conversation — the processed sibling of otoSpeech-280h.
interactivity · highcommercial okpermissiveCC-BY-4.0
Curated from otoSpeech-full-duplex-280h: high-quality conversations selected by human review, processed with noise reduction and speech enhancement, plus new samples collected after the 280h release. 44.1 kHz channel-separated FLAC with session metadata, speaker profiles, redaction intervals, and surveys. Same CC-BY-4.0 license as the raw variant, so commercial use is permitted.
#channel-separated#44.1khz#denoised#curated#full-duplex#dyadic
siteconversation
InteractSpeech authors (EMNLP 2025)
InteractSpeech
hours 150 hlang enrole dialogreleased 2025-11verified 2026-04
150 hours of English speech-interaction data — synthesised + filtered real-world dialogues with precise speaker timestamps for interrupts and backchannels.
interactivity · highrestrictedcustom licenseResearch (see repo)
Combines synthetic interactive dialogues with interactive segments mined from real speech corpora. Provides a formal framework for interaction dynamics and demonstrates a fine-tuned LLaMA-3 8B that classifies interactional events from audio. Findings of EMNLP 2025.
#interruption#backchannel#synthetic#full-duplex
site paperconversation · dialog-paired
Olewave / Stanford / Microsoft / PingAn / KU / CMU
OleSpeech-IV
hours 100 h (open subset)lang enrole dialogreleased 2025-09verified 2026-04
English conversational speech with accents from all world regions — human-refined speaker turns and transcripts, 100-hour open subset.
interactivity · mediumresearch onlynon-commercialNon-commercial research
The IV tier of the Olewave dataset series. OleSpeech-IV-2025-EN-AR-100 is 100 hours of English-only conversation (EN = English, AR = accents from all regions — *not* Arabic). Drawn from public podcasts, talk shows, and teleconferences; ships FLAC mono 16 kHz with speaker labels, turn info, timestamps, and confidence scores. The underlying Tier-IV collection holds 5,000+ hours across G20 languages, but only this English subset is publicly downloadable.
#multi-speaker#podcast#teleconference#diarization#accent-diverse
site paperconversation
Nexdata / Interspeech 2025 challenge
MLC-SLM (Interspeech 2025)
hours 1,604 hlang en · fr · de · it · …role dialogreleased 2025-09verified 2026-04
1,604 hours of two-speaker conversational speech in 11 languages — the first genuinely multilingual public conversational Speech-LM benchmark.
interactivity · highrestrictedgatedChallenge access (CC-BY-SA-4.0 for eval)
Released alongside the 1st Multilingual Conversational Speech LM challenge at Interspeech 2025. Covers English (500 h across 5 regional accents), French, German, Italian, Portuguese, Spanish, Japanese, Korean, Russian, Thai, and Vietnamese (100 h each). Every recording is a natural two-speaker conversation on assigned topics, 16 kHz mobile-phone indoor capture with oracle segmentation + speaker labels. Eval-1/Eval-2 ground truth (96 h) is openly on Hugging Face under CC-BY-SA-4.0; the 1,507-hour training set is distributed to registered challenge participants. 2nd challenge (2026) expands to 14 languages / 2,100 hours.
#multilingual#diarization#turn-taking#challenge#interspeech
site paperconversation · multilingual
Xiamen University
MM-F2F
hours 210 hlang enrole dialogreleased 2025-07verified 2026-04
210 hours of multi-modal face-to-face conversation with turn-taking and backchannel labels at word level.
interactivity · highrestrictedcustom licenseResearch (see repo)
Collected via an automatic pipeline from human conversation video, de-identified by replacing faces and perturbing voiceprints. 1.5M words and ~20M frames. Trained end-to-end predictor reaches +10% F1 on turn-taking and +33% on backchannel over previous SOTA. ACL 2025.
#multimodal#turn-taking#backchannel#video
site paperconversation · dialog-paired
VoxDialogue authors (ICLR 2025)
VoxDialogue
hours —lang enrole evalreleased 2025-05verified 2026-04
4,500 multi-turn spoken dialog samples × 12 acoustic attributes — probing whether spoken dialog systems catch what text can't.
interactivity · n/arestrictedcustom licenseResearch (see repo)
Benchmarks speech-LMs on 12 acoustic attributes (speech rate, volume, emphasis, background sound, intonation, rhythm, gender, accent, emotion …). Shows that direct speech models pick up cues ASR-pipelines lose. Presented at ICLR 2025, data + code open-sourced.
#paralinguistic#eval#speech-lm#multi-turn
site paperemotion · conversation
MagicData
Japanese Duplex Conversation Dataset
hours 10.35 hlang jarole dialogreleased 2025-04verified 2026-04
10.35 hours of channel-separated two-speaker Japanese conversation — tiny, but one of the only JP open corpora with per-speaker tracks.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-ND-4.0
Recorded indoors with mobile phones; each speaker on a separate track (16 kHz / 16-bit WAV, two-speaker conversation). CC-BY-NC-ND-4.0 for academic research. Small in absolute terms, but fills a real gap: most JP corpora are monologue or mixed-channel broadcast, so having per-speaker tracks makes this disproportionately useful for full-duplex experiments.
#japanese#duplex#small#mobile#channel-separated#full-duplex
siteconversation
Seoul National University (NAACL 2025)
Behavior-SD
hours 2,164 hlang enrole dialogreleased 2025-04verified 2026-04
108K LLM-synthesised full-duplex dialogues (2,164 h) with explicit backchannel / interruption / filler labels — the largest publicly-downloadable FD-labeled corpus.
interactivity · highcommercial okpermissiveCC-BY-4.0
Behavior-driven Spoken Dialogues: natural-language narratives rendered to speech via TTS, conditioned on speaker-wise behavioural traits (talkativeness, backchannelling, interruption rate, filler frequency). Every utterance carries turn-level timing and behaviour annotations. Explicitly synthetic — no real recordings — so it complements rather than replaces real-audio FD corpora. Useful for supervised FD behaviour learning; not for acoustic pretraining. Hugging Face dataset card tags CC-BY-4.0.
#synthetic#full-duplex#backchannel#interruption#behavior-labels
site paperdialog-paired · conversation
Magic Data Technology
Multi-stream Spontaneous Conversation (zh+en)
hours 15 h (10 zh + 5 en)lang zh · enrole dialogreleased 2024-11verified 2026-04
15 hours of dual-track two-speaker conversation (10 h Mandarin + 5 h English) with per-speaker audio channels — the rare public FD-native open corpus.
interactivity · highresearch onlynon-commercialCC-BY-NC-ND-4.0
Released as an open-source sample of MagicData's commercial multi-stream corpus. Each speaker is recorded on their own channel, so natural interruptions, overlaps, and backchannels are preserved in a form a full-duplex model can actually learn from. 16 kHz mobile-phone recordings, CC-BY-NC-ND-4.0. Although small in absolute terms, this is one of the cleanest public examples of the channel-separated setup most FD systems need — and crucially covers both zh and en.
#channel-separated#multi-stream#dual-track#full-duplex#overlap#interruption#mandarin
site paperconversation · multilingual
ICTNLP (LLaMA-Omni team)
InstructS2S-200K
hours —lang enrole finetunereleased 2024-09verified 2026-04
200k multi-turn speech-to-speech conversations tailored for instruction-following speech models.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-4.0
Synthesised dialogues constructed to match speech-interaction characteristics (concise turns, spoken style). The core SFT corpus for LLaMA-Omni. An extended multi-turn version, Multiturn-Speech-Conversations, landed in May 2025.
#instruction#s2s#sft#synthetic
trainsLLaMA-Omni 2
site paperdialog-paired
Sarulab (UTokyo) / Keio
J-CHAT
hours 76k hlang jarole pretrainreleased 2024-07verified 2026-04
~76,000 hours of Japanese spoken dialogue scraped from podcasts and YouTube — the first JP corpus targeted at dialogue-oriented speech-LMs.
interactivity · mediumresearch onlynon-commercialResearch-only (JP Copyright Art. 30-4)
Built with a language-independent automatic pipeline for acoustic cleanliness and spontaneity. Used to pretrain J-Moshi, the first JP full-duplex system. Available on Hugging Face but restricted to non-commercial use under Japanese Copyright Act Art. 30-4.
#japanese#podcast#youtube#speech-lm
site paperconversation · speech-read
IVLLab (Yonsei / Naver)
MultiDialog
hours 340 hlang enrole dialogreleased 2024-06verified 2026-04
340 hours of 9k audio-visual dialogues between 12 fluent English speakers, with emotion annotations on top of TopicalChat.
interactivity · mediumresearch onlynon-commercialCC-BY-NC-SA-4.0
Professional-studio recordings in parallel audio+video, built from the TopicalChat open-domain dialogue dataset. Supports multimodal dialogue generation, ASR, and TTS. Released at ACL 2024.
#multimodal#audio-visual#emotion#dialog
site paperdialog-paired · conversation
Amphion (CUHK-Shenzhen)
SD-Eval
hours 8.76 hlang enrole evalreleased 2024-06verified 2026-04
8.76 hours of curated evaluation across emotion, accent, age, and environment — non-commercial eval-only release under CC-BY-NC-4.0.
interactivity · n/aresearch onlynon-commercialCC-BY-NC-4.0
Aggregates 7,303 utterances from 8 public sources (RAVDESS, JL Corpus, MEAD, VCTK, Common Voice, MyST, …). NeurIPS 2024. A 1,052-hour training split is also provided for fine-tuning. The Hugging Face dataset card is licensed CC-BY-NC-4.0, so commercial productisation requires a separate arrangement with the upstream sources; use this corpus for research evaluation only.
#paralinguistic#eval#emotion#accent#non-commercial
site paperemotion · speech-read

§N · observatory notes

What the corpus map is telling us.

gap · 01
Non-English interactive corpus is thin
Classic-drawer corpora outside English are overwhelmingly monologue, broadcast, or parallel-reading — few are dyadic conversational at CANDOR / AMI scale. The frontier has only started to fill the gap (J-CHAT for pretraining; MagicHub duplex sets as early dialog entries), but a permissive, large-scale, non-English conversational corpus remains unclaimed for most languages.
gap · 02
Commercial-friendly turn-taking is thin
The classical turn-taking corpora (Fisher, Switchboard) are gated LDC paid releases; CANDOR (850 h) is free for research but CC-BY-NC-4.0, so not directly deployable. Among the truly permissive "high interactivity" sets, AMI (CC-BY-4.0, 100 h) and otoSpeech-full-duplex-280h (CC-BY-4.0, 280 h) stand out. The rest of the frontier wave (InteractSpeech, MM-F2F, MagicHub duplex) still ships under research / custom licenses.
gap · 03
Scale vs license is a pareto frontier
YODAS (500k h) is CC-BY-SA, VoxPopuli (400k h audio) is CC0 (code & pretrained models CC-BY-NC-4.0), Emilia (101k h) is share-alike, People's Speech (30k h) is CC-BY-SA, Common Voice (30k+ h) is CC0. For a strictly permissive and large multilingual base, MLS (50k h, CC-BY-4.0) remains the cleanest option — with the tradeoff that it is still audiobook-flavoured and English-heavy.
frontier watch
Interactive data is finally picking up
In the 2024–2026 window, 14 FD-oriented corpora landed: J-CHAT, MultiDialog, InstructS2S-200K, SD-Eval, VoxDialogue, MM-F2F, InteractSpeech, OleSpeech-IV, the MagicHub duplex set, and otoSpeech-280h. Modalities are diversifying (audio-only → audio+video), and channel-separated stereo is becoming standard. Licenses remain the bottleneck.

Want to add a dataset, or update a license note? submit an entry

Datasets.

Which dataset for which job?

Dialog-interactive hours by language

How full-duplex ready are public corpora?

Full-duplex & interactive

IndicVoices (conversational slice)

Fisher English

KsponSpeech

CANDOR

Switchboard-1 (LDC97S62)

SpokenWOZ

MagicData-RAMC

CALLHOME (JA / ZH / ES / DE / AR …)

AMI Meeting Corpus

CHiME-6

DailyTalk

Large-scale pretraining

YODAS / YODAS2

VoxPopuli

Emilia

Multilingual LibriSpeech (MLS)

ReazonSpeech

Common Voice 18

People's Speech

GigaSpeech

LibriSpeech

Corpus of Spontaneous Japanese (CSJ)

Fine-tune & TTS-grade

JVS Corpus

Paralinguistic & expressive

Expresso

IEMOCAP

Multilingual eval & translation

MuST-C

CVSS

CoVoST-2

FLEURS

Frontier (2024 – 2026)

otoSpeech-full-duplex-280h

otoSpeech-full-duplex-processed-141h

InteractSpeech

OleSpeech-IV

MLC-SLM (Interspeech 2025)

MM-F2F

VoxDialogue

Japanese Duplex Conversation Dataset

Behavior-SD

Multi-stream Spontaneous Conversation (zh+en)

InstructS2S-200K

J-CHAT

MultiDialog

SD-Eval

What the corpus map is telling us.

Non-English interactive corpus is thin

Commercial-friendly turn-taking is thin

Scale vs license is a pareto frontier

Interactive data is finally picking up