FullduplexFullduplex/blog
§D · corpora

Datasets.

Corpora for training and evaluating audio foundation models — grouped by role (pretrain / dialog-interactive / eval / specialty), tagged with interactivity (how useful for full-duplex modeling) and license class (permissive / share-alike / gated / non-commercial / custom). Always verify with the primary source before shipping. Open the channel if you spot an error.

total tracked
42
commercial ok
13
high interactivity
15
frontier (2024 – 26)
14
observatory refresh · 2026-04verified this month · 0 / 42awaiting re-check · 42
interactivity · highpermissiveshare-alikenon-commercialgatedcustom license
D-A · use-case matrix

Which dataset for which job?

Rows are the 42 public STS-adjacent corpora, split into Full-duplex (channel-separated) and Others. Columns are the training / eval roles they actually support. The full-duplex band opens by default; click a group header to fold it, hover a column for its definition, click a row to jump to the full card below.

Hours
Quality
Ch-sep
Pretrain
SFT / Instr.
Dialog
Eval
High int.
Conv.
Emotion
Multi-lang
Comm.-OK
Fisher EnglishLDCEN
2.0k
CANDORBetterUp Labs (Reece, Cooney et al.)EN
850h
otoSpeech-full-duplex-280hotoearthEN
280h
48 kHz
Switchboard-1 (LDC97S62)LDCEN
260h
8 kHz
otoSpeech-full-duplex-processed-141hotoearthEN
141h
44.1 kHz
CALLHOME (JA / ZH / ES / DE / AR …)LDCJAZHESDEAR+1
120h
8 kHz
Multi-stream Spontaneous Conversation (zh+en)Magic Data TechnologyZHEN
15h
16 kHz
Japanese Duplex Conversation DatasetMagicDataJA
10h
16 kHz
supportedpartial / implicitnot a fit
D-B · dialog corpora only

Dialog-interactive hours by language

Scope: the 19 real-recorded dialog-interactive public corpora we track (synthetic / TTS-generated corpora like Behavior-SD are excluded so they don't inflate acoustic totals), mapped onto the world's ten most-spoken languages. For multi-lingual corpora, hours are split evenly across every declared language — languages outside the top ten land in Other. Dashed rows = zero public hours in this scope; linear X axis leaves the English dominance literal. Toggle channel-separated only to isolate the corpora whose audio is actually usable for full-duplex acoustic training (per-speaker tracks, not mixed-channel).

Scope
0 h1k h2k h3k h4k h5k hdialog-interactive hours (linear)English5k hMandarin208 hHindi159 hSpanish166 hFrench146 hArabic20 hBengali159 hPortuguese146 hRussian146 hUrdu159 hOther5k h
permissiveshare-alikegatednon-commercialcustom
D-C · release × FD fit

How full-duplex ready are public corpora?

Every public STS-adjacent corpus with a known release month and hour count (1993 – 2026), plotted by release date and scale.Default lens is channel-separated only — the per-speaker-track subset that is actually usable for full-duplex acoustic training — so the view opens with the corpora that matter most for FD builders. Untick the chip to see the full pool. Fill colour scores full-duplex fit (FD-ready = high interactivity and/or explicit overlap / turn-taking / back-channel labels); red ring marks channel-separated audio; dashed outline marks synthetic corpora (TTS-generated, useful for behaviour supervision but not a substitute for real recordings). Chips narrow by FD fit, licence class, commercial use, or channel separation. Click any dataset name for the one-screen briefing.

FD fit
Licence
Commercial
Channel
showing 8 / 40
1 h10 h100 h1k h10k h1993200320132023hours (log)release dateSwitchboard-1CANDORFisher EnglishCALLHOMEJapanese Duplex ConversationMulti-stream Spontaneous ConversationotoSpeech-full-duplex-280hotoSpeech-full-duplex-processed-141h
FD-ready (high interactivity / FD tags)partial (dialog-interactive, medium)non-FD (pretrain / eval / SFT)channel-separatedsynthetic (TTS)
§01 · turn-taking rich

Full-duplex & interactive

Dyadic, multi-party, or meeting audio where overlap and turn boundaries matter. The raw material for full-duplex models.

§02 · scale corpus

Large-scale pretraining

Bulk read / broadcast / web-scraped speech used for self-supervised and supervised pretraining — multilingual bases alongside high-volume single-language corpora.

§03 · fine-tune corpus

Fine-tune & TTS-grade

Smaller, curated corpora used for TTS and instruction-tuning — typically parallel-read, multi-speaker, single-language by construction.

§04 · prosody & affect

Paralinguistic & expressive

Laughter, whisper, emotion — the data that teaches speech-LMs the non-verbal layer.

§05 · curated eval

Multilingual eval & translation

Evaluation-grade corpora for cross-lingual ASR and S2ST. Small, curated, and properly licensed.

§06 · bleeding edge

Frontier (2024 – 2026)

Recent full-duplex / speech-LM oriented corpora, including niche releases. Skewed toward research licenses and fresh paper drops — sorted newest first.

§N · observatory notes

What the corpus map is telling us.

  1. gap · 01

    Non-English interactive corpus is thin

    Classic-drawer corpora outside English are overwhelmingly monologue, broadcast, or parallel-reading — few are dyadic conversational at CANDOR / AMI scale. The frontier has only started to fill the gap (J-CHAT for pretraining; MagicHub duplex sets as early dialog entries), but a permissive, large-scale, non-English conversational corpus remains unclaimed for most languages.
  2. gap · 02

    Commercial-friendly turn-taking is thin

    The classical turn-taking corpora (Fisher, Switchboard) are gated LDC paid releases; CANDOR (850 h) is free for research but CC-BY-NC-4.0, so not directly deployable. Among the truly permissive "high interactivity" sets, AMI (CC-BY-4.0, 100 h) and otoSpeech-full-duplex-280h (CC-BY-4.0, 280 h) stand out. The rest of the frontier wave (InteractSpeech, MM-F2F, MagicHub duplex) still ships under research / custom licenses.
  3. gap · 03

    Scale vs license is a pareto frontier

    YODAS (500k h) is CC-BY-SA, VoxPopuli (400k h audio) is CC0 (code & pretrained models CC-BY-NC-4.0), Emilia (101k h) is share-alike, People's Speech (30k h) is CC-BY-SA, Common Voice (30k+ h) is CC0. For a strictly permissive and large multilingual base, MLS (50k h, CC-BY-4.0) remains the cleanest option — with the tradeoff that it is still audiobook-flavoured and English-heavy.
  4. frontier watch

    Interactive data is finally picking up

    In the 2024–2026 window, 14 FD-oriented corpora landed: J-CHAT, MultiDialog, InstructS2S-200K, SD-Eval, VoxDialogue, MM-F2F, InteractSpeech, OleSpeech-IV, the MagicHub duplex set, and otoSpeech-280h. Modalities are diversifying (audio-only → audio+video), and channel-separated stereo is becoming standard. Licenses remain the bottleneck.

Want to add a dataset, or update a license note? submit an entry