FullduplexFullduplex/
the sts series04 / 10#data#full-duplex§ 09 sections · 08 figures · 01 matrix

The data ceiling.

Full-duplex conversational recordings at internet scale do not exist. The two escape hatches engineers reach for first — better separation AI and bigger YouTube scrapes — do not escape. Full-duplex speech-to-speech still leans on a 2004 telephone corpus for its post-training recipe. This article explains why, and what a workable data map looks like in April 2026.

fig.00 · the full-duplex training data problem, at a glancefullduplex / synthesized

Terms and thesis

Full-duplex conversational recordings at internet scale do not exist.

That is the sentence this article defends. Before the defense, two terms need pinning down.

Full-duplex describes a conversation in which both parties can speak and listen at the same time, the way humans actually talk. It is the opposite of a walkie-talkie, where one side speaks and the other side waits. A full-duplex speech-to-speech (STS) model has to handle overlap, barge-in, backchannel, and pause without pretending a conversation is strictly turn-by-turn. Article 02 treats this threshold in depth.

Full-duplex training data is recorded conversation that preserves the information a model needs to learn full-duplex behavior. The minimum bar is speaker isolation at the source: each participant written to a separate audio track, so an overlap between two people is two clearly attributed events rather than one acoustic blur. In the speech-research literature this property is almost always called “two-channel” or “dyadic two-track.” This article uses “full-duplex,” “full-duplex-ready,” and “two-channel” interchangeably to mean the same thing: recordings from which a full-duplex model can actually learn turn-taking.

Now the thesis. The largest open corpus of full-duplex conversational speech on the internet was collected in 2004. It is called Fisher, it was organized by the U.S. Linguistic Data Consortium (Cieri et al. 2004), it contains approximately 1,960 hours of English telephone speech across 11,699 conversations, and each speaker was written to a separate disk track at collection time. No downstream separation was ever needed. That is not a historical footnote. As of April 2026, Fisher is still the default post-training corpus for state-of-the-art open-weights full-duplex STS models.

Everything the public internet has released since then is one of four things: smaller, or mono, or synthetic, or not legally redistributable for AI training. Switchboard (1997) is smaller at ~260 hours. CANDOR (2023) is CC BY-NC, which forbids commercial training use. Emilia (2024-2025) reaches 216,000 hours but is mono with source-level license ambiguity. OmniFlatten's internal training corpus (2024) is 100% TTS-synthesized. Abaka AI's 20,000-hour full-duplex corpus (March 2026) is commercial and direct-to-enterprise with gated pricing. Not one of these satisfies all three of (full-duplex at source, public, commercially usable at scale).

The asymmetry this leaves is categorical, not gradual. Modern STS models pre-train on millions of hours of mono audio. Moshi's backbone sees roughly 7,000,000 hours of web-scraped English speech (Défossez et al. 2024). Sesame CSM scales past 1,000,000 hours. Both then fine-tune their actual full-duplex behavior on a few thousand hours of Fisher. Twenty-two-year-old telephone calls are carrying the load the rest of the pipeline cannot.

This article defends the thesis in three moves. First, it shows why the two obvious escape hatches out of full-duplex scarcity do not escape. Can separation AI rescue mono recordings? Can YouTube and podcasts supply the conversational data we need? Both are the first questions a careful engineer asks. Both have answers that require engaging with the counterargument fairly rather than dismissing it. Second, it walks what open corpora actually contain, corpus by corpus. Third, it proposes a map of what data is fit for what training phase, which turns the scarcity into an operational requirements list.

fig.01 · the twenty-two-year peak·········
hours (log scale)1001k10k100k1M10M20002010201520202026SwitchboardFisherAMILibriSpeechLibriLightCANDOR (NC)GigaSpeechEmilia (mixed NC)Moshi pre-trainSesame CSMAbaka (commercial)2-channel dyadicmononon-commercial license
Public speech corpora by release year and scale. Two-channel dyadic (blue) peaks at Fisher in 2004 and Abaka in 2026. The giants to the upper right are mono pre-training corpora. The gap between the blue column and the gray column is the full-duplex training data problem.

Why mono isn't enough, and why separation AI can't rescue it

2.1 The mono collapse, briefly

Full-duplex training data breaks the moment two speakers collapse onto a single channel. A 200 ms overlap that was two clearly attributed events becomes one acoustic blur; backchannels get absorbed into the main speaker's envelope; turn boundaries move from deterministic timestamps to inferred ones. Article 05 (The two-channel imperative) gives the signal-processing treatment of this collapse. The rest of this section takes it for granted and engages the strongest counterargument: can better separation AI undo the mono collapse after the fact?

2.2 The separation counterargument, stated fairly

A natural reply is: source separation AI is improving fast, and diarization is a mature field. If mono recordings can be split into per-speaker tracks with high accuracy, the gap between mono web audio and clean full-duplex recordings should close. Current best-in-class separation models include SepFormer (2021), Conv-TasNet (2019), TDANet (2023), and MossFormer2 (2024). Current best-in-class diarization includes pyannote 3.x, NVIDIA NeMo, and EEND-EDA. These systems exist, they are public, and they are actively used in production pipelines. The question is not whether separation AI works. It is whether it works well enough, under the conditions real conversation contains, to produce training labels clean enough for a model that has to place a turn onset within ±50 ms of the right moment.

2.3 Where the ceiling actually is

Two scoreboards appear in this section, so it is worth pausing on what each one means. SI-SDRi (scale-invariant signal-to-distortion ratio improvement) measures how cleanly a separation model pulls one voice out of a mix. It is reported in decibels on a log scale. As a rule of thumb: 10 dB is “cleaner than raw noise,” 20 dB is “close to a clean studio voice,” 25 dB is “indistinguishable from the original by ear.” WER (Word Error Rate) measures how often the downstream speech-to-text system gets words wrong, as a percentage where 0% is perfect and 50% is garbage. The first score is the separation stage; the second is the transcription stage of the same pipeline.

WSJ0-2mix is the canonical benchmark and the research ideal: two studio recordings added together in software, then asked to be pulled apart again. On it, SepFormer scores about 22.3 dB and MossFormer2 about 24.1 dB. Near-clean by the rule of thumb above. Move to WHAMR!, which folds in noise and reverberation, and the best numbers fall to roughly 14 to 17 dB. Move to a benchmark recorded in an actual room at natural overlap rates and the numbers collapse.

The most load-bearing evidence sits in LibriCSS (Chen et al. 2020), a benchmark designed to measure WER after separation on recordings with controlled overlap rates. At 30% overlap, the condition closest to natural conversation, single-channel ASR with no separation produces a 34.6% WER. Roughly one word in three is wrong. A 7-channel microphone array with neural masking brings that down to 18.4%, still roughly one word in five. At 40% overlap the pair is 43.2% and 21.6%. These are not error rates that support supervised fine-tuning of a model whose job is to place a turn onset within ±50 ms of the correct moment.

fig.02 · where separation collapses·········
WER on LibriCSS vs overlap rate (lower is better)WER % (x axis: 0 to 50)0%11.58.310%18.311.620%26.416.030%34.618.440%43.221.6single channel, no separation7-channel + neural mask
WER on LibriCSS by overlap rate. Even with a 7-channel microphone array and neural masking, WER at 30–40% overlap (the natural conversation range) stays between 18% and 22%. Single-channel mono with no separation doubles that. Neither number supports supervised training of turn-taking at ±50 ms precision.

2.4 Diarization error on real conversation

Separation is only half of the pipeline. Even if you have two clean tracks, you still have to answer: which track belongs to which speaker, turn by turn. That is diarization, the “who spoke when” stage. Its error metric is DER (Diarization Error Rate), the fraction of audio time labeled with the wrong speaker, a missed speaker, or a hallucinated one.

pyannote 3.1, the current research default, reports a DER of about 22.4% on AMI (meeting audio from a single distant microphone) and 11.3% on VoxConverse (YouTube-style interviews). These are good numbers for research purposes.

At 22% DER, roughly one turn in four is mis-attributed to the wrong speaker.

A full-duplex model trained on labels at this quality learns a world where “the model” and “the user” swap voices at random one time out of four. That is not the kind of label noise that averages out at scale. It corrupts the exact structure, who-speaks-when, that the model is supposed to learn.

fig.03 · what 22% DER looks like·········
What 22% DER looks like in practiceFour conversation turns. Three labeled correctly. One flipped.Turn 1label: user ✓Turn 2label: model ✓Turn 3label: model ✗ (was user)Turn 4label: model ✓The model trains on the red label as if it were true. The voices flip once every four turns.
DER visualization. A diarization system running at 22% DER, projected onto a four-turn window, places one turn on the wrong speaker. The training pipeline cannot tell the difference, so the downstream model inherits a world in which speaker identity flips every fourth turn. This is the structure a full-duplex model is supposed to model cleanly.

The CHiME-6 dinner-party challenge makes the same point from a different angle. CHiME-6 is a dinner-party audio set: six people, real room, real background noise, the everyday hard case. The challenge has two tracks. Track 1 hands the system a perfect speaker label table (a human wrote it). Track 2 asks the same system to build its own label table first and then transcribe. On Track 1, the baseline reaches ~51% WER. On Track 2, WER rises into the high 60s to 80s depending on the setup. The 15 to 30 percentage-point gap is the cost of building the label table yourself. It is the price every real-world pipeline pays. Oracle-label quality is what training needs. System-label quality is what training pipelines actually produce.

2.5 Compounding error is the real argument

Accuracy does not add across a pipeline. It multiplies.

Picture an assembly line with three stations. Station 1 is separation, which splits the mono mix into per-speaker tracks. Station 2 is diarization, which labels which track is which speaker at which time. Station 3 is ASR, which transcribes what was said. Each station has its own error rate. The labeled training data that drops off the end of the belt inherits the combined error of all three. And fixing one station can damage another.

Raj et al. 2021 caught this composition on record. They added a separation front-end to a LibriCSS pipeline. On the hard sections, where speakers overlap, it dropped concatenated-minimum-permutation WER from 27.1% to 13.4%. A clear win. On the easy sections, where only one person was talking, the same front-end raised WER from 11.2% to 12.4%. A small loss, but a loss. The separator introduced artifacts the downstream ASR had never been trained on. Fix the overlap, and a fraction of the non-overlap breaks.

The CHiME-8 DASR organizers in 2024 state the ceiling plainly. Cornell et al. write that “accurate speaker counting in the first diarization pass is crucial to avoid compounding errors,” and that “all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios.” That is the peer-reviewed position of the people running the benchmark in 2024.

The takeaway is not “separation is broken.” It is that separation plus diarization plus ASR, applied to mono web audio, produces training labels of a quality that the downstream model cannot tolerate. The model's job is to place a turn onset within tens of milliseconds. The upstream pipeline cannot deliver labels at that precision from a mono source. That is the ceiling.

fig.04 · oracle vs real·········
The gap between oracle and real pipelinesCHiME-6 Track 1 vs Track 2 WERoracle diarization~51%system diarization~80%LibriCSS 0% overlap WER (Raj 2021)no separation11.2%with separation12.4%↑ separator hurts clean audio
Top: the CHiME-6 gap between Track 1 (oracle diarization) and Track 2 (system diarization). Bottom: on LibriCSS zero-overlap audio, adding a separation front-end slightly worsens WER because the ASR was not trained on separator artifacts. Both illustrate that pipeline composition is non-linear.

Why YouTube and podcasts do not supply full-duplex data

3.1 The apparent abundance, and the scale illusion

YouTube hosts billions of hours of speech. Podcasts, tens of millions. Two-speaker interview formats are a well-established genre in both. At first glance this supply dwarfs Fisher by four or five orders of magnitude, and the careful engineer's second question follows naturally: if mono is hard to rescue, isn't there enough already-two-track content on the internet to skip the problem entirely?

The answer is no, and the gap is not a rounding difference. Once the license filter, the content-shape filter, and the training-phase filter are applied to the apparent abundance, the fraction that can legitimately be used to teach full-duplex turn-taking collapses by roughly seven orders of magnitude. The rest of this section walks each filter in turn.

fig.05 · the scale illusion·········
Apparent scale vs full-duplex-usable scaleHorizontal bars on a log scale in hours. The three tiers are separated by roughly seven orders of magnitude.1k10k100k1M10M100M1B10Bhours (log scale)Public full-duplex-ready~2-3k h (Fisher + Switchboard + small)Podcast catalog~100M h (order-of-magnitude)YouTube total content~10B h (order-of-magnitude)
The apparent supply of two-speaker audio on the internet and the supply that can legitimately be used to teach full-duplex turn-taking are separated by roughly seven orders of magnitude. The difference is categorical, not quantitative. YouTube and podcast totals are order-of-magnitude estimates for visualization only.

3.2 License, consent, and content shape

The top filter is legal. YouTube's Terms of Service explicitly prohibit automated extraction and forbid training machine learning models on YouTube content without authorization. YouTube has added opt-in third-party AI training controls that default off. A creator must actively grant permission before their audio is a legitimate training input. Cases such as Millette v. OpenAI and YouTube's public statement via Neal Mohan that scraping would be a “clear violation” of its terms pressure-tested this through 2024 and 2025.

Podcasts are a softer problem that is still a problem. RSS delivers the audio but does not license it. The RSS co-creator Dave Winer launched a separate protocol called RSL in 2025 precisely because RSS contains no training-license field. Interview guests and background participants are rarely under any contract that allows their voices to be used for model training.

Beneath the license filter is a content-shape filter. The majority of YouTube and podcast audio is not spontaneous dyadic conversation. It is monologue, scripted interview, edited panel, sports commentary, or audiobook narration. The shows with two speakers at once tend to be professionally produced, which means cross-talk has been cut out in post. Editing removes the backchannel, the repair, the hesitation at the turn-transition point, which are precisely the phenomena a full-duplex model has to learn.

3.3 What each training phase actually needs

Even if licensing and content-shape were somehow resolved, a third filter applies. STS models do not train in one phase. They train in at least three: a self-supervised pre-training phase that learns audio representations from raw waveforms, a mid-training phase that teaches the model to handle dialogue-shaped inputs, and a post-training phase that shapes the actual turn-taking behavior. Each phase tolerates different defects in its input.

HuBERT pre-trained its LARGE model on LibriLight, roughly 60,000 hours of mono audiobook audio. Wav2Vec 2.0 used the same source. Neither model needed full-duplex data, dialogue structure, or turn boundaries to learn useful representations. For this phase, scraped-quality mono works. Moshi's backbone is pre-trained on about 7 million hours of web-scale mono speech with Whisper-generated pseudo-labels. Sesame CSM pre-trains on about 1 million hours of similar material. The pre-training phase consumes mono audio by the million-hour. That is what YouTube-like corpora are fit for.

The problem is that pre-training is not where a model learns to listen while it speaks. Sesame CSM, with its 1 million hours of mono pre-training, does not have a native full-duplex mode. Scaling mono pre-training is not sufficient. The full-duplex behavior is learned downstream, at the post-training stage where both Moshi and NVIDIA PersonaPlex converge on the same answer: Fisher, or small samples of it. PersonaPlex in particular fine-tunes on 1,217 hours of Fisher alongside 2,250 hours of synthetic. Real full-duplex dialogue carries 35% of the fine-tuning weight in a state-of-the-art open-weights STS model from January 2026.

YouTube-grade mono audio is fit for pre-training, partly fit for mid-training with careful curation, and structurally unfit for post-training. The post-training phase is the one the full-duplex behavior lives in, and no amount of YouTube scraping fills it.

The training phase × data type matrix

Put the findings from §2 and §3 together and the picture is a grid, not a list. Three training phases. Four data types. Twelve cells, each with a different fitness answer.

The phases are pre-training (self-supervised representation learning on unlabeled audio), mid-training (continued pre-training and modality alignment on dialogue-shaped inputs), and post-training (supervised fine-tuning plus RLHF or DPO, where turn-taking behavior is actually shaped). The data types are web mono (YouTube, podcast, audiobook), public full-duplex dyadic (Fisher, CANDOR, AMI), synthetic dialogue (OmniFlatten's 2,000-hour CosyVoice-generated corpus, PersonaPlex's Chatterbox-rendered dialogs), and commercial full-duplex (Abaka AI's 20,000 hours, in-house collections like Kyutai's 170-hour seed).

fig.06 · phase × data type·········
Web mono
Public full-duplex dyadic
Synthetic dialogue
Commercial full-duplex
Pre-training
fitHuBERT-LL 60kh, Moshi 7Mh, CSM 1Mh
unfitToo small to drive self-supervised scale
unfitNo public STS recipe pre-trains on synthetic
unfitNo recipe at pre-training scale yet
Mid-training
fitFreeze-Omni 110kh ASR; Moshi diarized multi-stream
partialUsed sparingly; size not dominant
partialOmniFlatten Stage 1 modality-alignment pairs
unfitRarely documented at mid-training scale
Post-training
unfitTurn-taking not learnable from mono
fitMoshi Fisher fine-tune; PersonaPlex 1,217h Fisher
partialPersonaPlex 2,250h, OmniFlatten 2kh synthetic
fitAbaka 20kh; Kyutai 170h seed (closed)
fit — cited model recipepartial — works additively / hybridunfit — structurally or not documented
Training phase × data type matrix. Reading the bottom row: the phase where full-duplex turn-taking is actually learned is served by exactly one column of public data that exists in meaningful quantities (Fisher), plus a small commercial tier (Abaka), plus synthetic data that works only when a real seed is available.

Reading the grid row by row. At pre-training, web mono is the default across the field. HuBERT LARGE uses 60,000 hours of LibriLight. Moshi's backbone uses 7,000,000 hours of web-scale speech. Sesame CSM uses 1,000,000 hours. Public full-duplex dyadic corpora are not used at this phase because they are too small; Fisher's 2,000 hours is a rounding error against 7 million. Synthetic pre-training is rare in the STS literature because it offers no scale advantage. Commercial full-duplex at pre-training scale does not exist as a public recipe.

At mid-training, the picture diversifies. Freeze-Omni's 110,000-hour ASR corpus sits in the web-mono column. Moshi's diarization-simulated multi-stream pre-training is a hybrid that treats mono web audio as if it were full-duplex by labeling speaker activity. Synthetic mid-training appears in OmniFlatten's Stage 1 modality-alignment pairs. Public full-duplex dyadic at mid-training scale is mostly absent from the recipes we have read.

At post-training, the grid collapses toward two columns. Web mono is not used, because turn-taking behavior is not learned from content without speaker separation. Public full-duplex dyadic is where Fisher and its relatives carry the day: Moshi's full-duplex fine-tune, PersonaPlex's 1,217-hour Fisher portion, every top-performing open-weights full-duplex STS that has published a recipe. Synthetic post-training is used additively, never as a replacement. PersonaPlex's 2,250 hours of synthetic sit alongside, not instead of, Fisher. Commercial full-duplex is the column that is structurally available but not yet dominant in published recipes. Abaka AI's March 2026 announcement proves the tier exists commercially. No peer-reviewed open-weights model has yet been published with commercial full-duplex as the majority post-training source.

The phase where full-duplex behavior is actually learned is served by exactly one column of data that exists in meaningful quantities today, and that column is full-duplex dyadic conversation at real overlap rates with redistribution rights.

Fisher is ~2,000 hours of it. Abaka AI claims 20,000 hours of it commercially. Everything else is either additive or unfit. This is the scarcity the rest of the article develops.

Public corpus walkthrough

What exists publicly is worth naming specifically, because the catalog drives the scarcity argument.

fig.07 · the catalog at a glance·········
CorpusHoursChannelLicenseCommercial?
Fisher~1,9602-channelLDC feetier
Switchboard~2602-channelLDC feetier
CANDOR~8502-channelCC BY-NC 4.0no
AMI~100multiCC BY 4.0yes
ICSI~72multimixedcheck
Emilia (core)~101,000monoCC BY-NC 4.0no
Emilia-YODAS~114,000monoCC BY 4.0**source-dep
Abaka AI 2-ch~20,0002-channelcommercialyes
Public and commercial speech corpora at a glance. The column that matters for full-duplex post-training is “2-channel” AND “commercial yes”. Two rows qualify: Fisher (paid LDC license) and Abaka AI (commercial procurement). The ~216,000-hour Emilia headline hides a license split between the NC core and the YODAS extension, which inherits upstream YouTube CC-BY tags.

Across the public catalog, the intersection of (full-duplex at source) and (commercial redistribution permitted) and (sufficient scale for fine-tuning) contains essentially Fisher, Switchboard, and a handful of smaller academic sets.

That intersection has not meaningfully expanded in twenty years of public dataset releases. Several 2025-2026 academic dyadic sets (InterActSpeech, DialogueSidon, MultiDialog, DeepDialogue, MLC-SLM) add full-duplex or dialogue-shaped material at smaller scale. Useful as research anchors. Licenses mixed, hour counts mostly in the tens to low hundreds. Too small individually to carry a fine-tune, potentially useful in aggregate.

What synthetic data can and can't do

Synthetic dialogue is the third escape hatch the literature has explored, and it is the most interesting one because it works, partially, in ways that clarify what real data is for.

The strongest evidence that synthetic data works is OmniFlatten, which trained a 0.5B-parameter model to usable full-duplex behavior on roughly 2,000 hours of dialogue that was 100% generated by the CosyVoice TTS system. No real full-duplex recordings at any stage. The result was not state-of-the-art, but it crossed the threshold of “the model does the behavior.” So the ceiling on synthetic is not “zero.”

The ceiling argument is more subtle. Synthetic dialogue is bounded in distribution by the TTS model that generates it. Prosody collapses toward the TTS's prior. Backchannel frequency becomes rule-based because the generator has to be told when to emit a backchannel. Disfluencies are either absent or scripted. Overlap structure reflects the script's turn taking, not the spontaneous timing humans produce. You cannot learn a behavior that was not in the generator's output distribution.

The honest working pattern in state-of-the-art recipes is additive. Moshi's 20,000 hours of synthetic instruction data are generated by Kyutai's own multi-stream TTS, which was itself trained on 170 hours of real full-duplex Kyutai recordings. That is a 100× amplification from a real seed to a synthetic extension, but the real seed is not removable. PersonaPlex's 2,250 hours of synthetic customer-service and QA dialogs sit alongside 1,217 hours of real Fisher. The mix is roughly 35% real and 65% synthetic. The real portion is not the larger half, but it is the half that carries the in-distribution anchor.

Synthetic dialogue shifts the training curve. It does not replace the need for real full-duplex data at the post-training stage. It multiplies the real seed.

The scarcity economics therefore sit on the size of the seed, not on the total hours fed to the model. A lab with 170 hours of real full-duplex can produce 20,000 hours of synthetic. A lab with zero hours of real full-duplex produces zero hours of useful synthetic. Articles 05 and 06 develop this further.

Commercial full-duplex data market

Outside the public corpus catalog sits a commercial tier that has begun to price full-duplex conversational data directly. Abaka AI announced a 20,000-hour commercial corpus in March 2026, described as “100% real human-to-human” and delivered with full-duplex physical source isolation across seven languages. Pricing is direct-to-enterprise and not public. Adjacent suppliers in the call-center outsourcing industry have sold recorded audio for ASR training for years, but the full-duplex requirement is a newer ask and the supply is thinner than the headline numbers suggest.

The terms worth verifying per vendor are redistribution rights, consent documentation for every speaker, language coverage, and true channel isolation at source (as opposed to post-hoc separation). Article 10 covers the consent and licensing layer in detail. The public corpus is not the complete picture, and a commercial procurement path exists for the full-duplex post-training phase.

Eight requirements preview and scarcity economics

The §4 matrix produces a natural requirements list for training data intended for full-duplex post-training. Article 06 develops each in depth. In one sentence each: full-duplex capture at source, spontaneous dyadic structure, realistic overlap rate distribution, multi-register coverage across topics and emotions, speaker diversity sufficient to generalize, documented per-speaker consent for AI training use, commercial redistribution rights, and phase-fit labeling so the data can be routed to the training stage it actually helps. This is a long list because each item independently blocks usability. A corpus that satisfies six of eight is not 75% useful. It is zero percent useful for whichever training run needs the two it fails.

Measured in hours available for legitimate commercial training as of April 2026, the intersection of all eight requirements runs to the low thousands in the public domain plus whatever the commercial tier will sell.

For comparison, language-model pre-training corpora are measured in trillions of tokens. The asymmetry is the investment thesis. Full-duplex speech-to-speech is architecturally solved enough that four distinct model families are shipping public weights. The bottleneck is not architecture. It is how many hours of full-duplex dyadic audio exist with the right rights attached.

fig.08 · the scarcity venn·········
full-duplex at sourcecommercial rightsscale > 1,000 hfit for full-duplex post-trainingFisher ~2kh · Abaka ~20khcomparable to one Saturday of YouTube uploads
The scarcity Venn. The intersection of two-channel capture at source, commercial redistribution rights, and scale sufficient for fine-tuning is measured in the low tens of thousands of hours as of April 2026. For comparison, YouTube ingests on the order of that many hours every day.

Forward pointers

This article framed the data side of the speech-to-speech stack. Three other articles go deeper on the pieces it touched.

Article 05 (the two-channel imperative) is the longer argument for why mono audio breaks full-duplex training specifically at the label level. §2 of this article references it, and readers who want the signal-processing treatment of the mono-vs-stereo question should read that piece next.

Article 06 (eight requirements for next-gen STS training data) takes the one-sentence list in §8 and turns it into a spec sheet with operational definitions. If §4's matrix is the map, Article 06 is the site survey.

Article 10 (consent, licensing, and the opt-in economy for conversational data) takes the legal layer in §3 and the commercial market in §7 and treats them as one market-design problem. That piece is where the AI training lawsuit landscape and the opt-in economics get the full treatment.

Data for full-duplex STS is a phase-fit problem, not a total-hours problem.

The bottleneck is how many hours of full-duplex dyadic conversation exist with commercial redistribution rights, for the post-training phase where turn-taking is actually learned. Everything else in the stack is already moving.

The bottleneck is not architecture. It is how many hours of full-duplex dyadic audio exist with the right rights attached. Pre-training scales. Post-training is where the behavior lives, and where the data disappears.— the 2026 full-duplex data map, in one sentence
■ ■ ■
Fullduplex
2026

If the §4 matrix looks familiar, the post-training column is where oto collects.

Two-channel capture at source, per-speaker consent, commercial redistribution, phase-fit labeling. oto builds full-duplex dyadic corpora for teams shipping production STS — talk to us about data, or access the investor data room.

#data#full-duplex#sts-series#fisher#licensingfiled under: the latent · sts 04