the verticalsv10 / 17#mozilla#common-voice#cc0§ 08 sections · 08 figures

Mozilla Common Voice: why a CC0 read-speech corpus became the voice-AI industry’s yardstick for consent.

31,841 hours of audio. 286 languages. 800,000 contributors. All CC0. The story of how, over nine years, each person who hit the record button in a browser and handed their voice over as a public good made Common Voice the consent-first yardstick every next voice-data project is measured against.

fullduplex research

published apr 2026· 17 min read· ~3,400 words· verticals v10 / 17

17m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v10 of 17 · subject profile

A public-infrastructure artifact that reached 31,841 hours across 286 languages by volunteer contribution alone, all released CC0. In 2026, the shared industry yardstick for “what a consent-first speech dataset looks like” has become Common Voice.

subject: Mozilla Common Voice · launched 201731,841 h · 286 languages · 800,000+ contributors · CC0

1. 31,841 hours, 286 languages, 800,000 people, all fully waived

In autumn 2025, the Common Voice platform operated by the Mozilla Foundation shipped a release called Scripted Speech 24.0. The numbers: 31,841 hours of audio across 286 languages, of which 20,789 hours are community-validated. Over more than eight cumulative years, more than 800,000 contributors have hit the record button in a browser. Every clip is published under Creative Commons Zero (CC0).

CC0, put plainly, is the most permissive copyright waiver on earth. No attribution required, no redistribution restrictions, no modification restrictions. A declaration that the work is dedicated to the public domain in full. No paid contributors, no fine-print workarounds. Each time a person records, they read the waiver in the browser, press a button, and hand their voice over as a public good. No other speech corpus takes this posture at this scale.

A comparison helps. LibriSpeech is built from out-of-copyright audiobooks (Panayotov 2015 ICASSP). That works because the books’ copyrights happened to have expired, not because the narrators explicitly agreed their voices could train AI. Fisher and Switchboard from the Linguistic Data Consortium (LDC) are paid datasets accessible only behind $2,400 to $3,850 annual membership fees. Web-crawled corpora carry every downstream litigation risk on their own back.

In short, Common Voice is the only speech dataset that has reached five-figure hours (tens of thousands) by way of “recorders donating their voices to the public good of their own volition”. A public-infrastructure artifact. You could fairly call it the Library-of-Congress-style public archive of speech data.

Let me state the thesis early. The position Common Voice occupies in 2026 is that of the reference point against which every next-generation voice-data project measures itself. Publish a paper, a contributor presses the waiver button, that data becomes an evaluation benchmark for frontier models like Whisper, wav2vec 2.0, and SeamlessM4T, and a downstream family of CC0 derivatives and playbooks grows up around it. That four-step cycle has turned nine times. The result is that the shared industry yardstick for “what a consent-first speech dataset looks like” has, in practice, become Common Voice.

fig.f5 · consent-first scale quadrant·········

Figure F5. Scale by consent, on two axes. Common Voice sits alone in the upper-right “large scale, explicit consent” quadrant, with no peer at five-figure hours. Moshi and Whisper land upper-left (large, web-harvested), LDC Fisher/Switchboard and LibriVox sit lower-right (small-to-mid scale, paid or loose consent), and LibriLight and LibriSpeech sit lower-left (public-domain-derived). The empty upper-right is the structural shape produced by the waiver-at-recording-time method.

2. Mozilla Foundation runs it. 800,000 strangers record it.

This project sounds abstract unless you name the people who built it and keep it running. Let me introduce them.

Kelly Davis (founding technical lead, 2017–2020). Davis led Mozilla’s Machine Learning Group and launched Common Voice alongside the DeepSpeech ASR (automatic speech recognition) engine in June 2017 via Mozilla Hacks. In that post, he wrote the founding premise in a single line: “We need your voices to teach machines how real humans speak.” (source). That sentence explains the entire CC0-plus-volunteer-crowdsource design choice. When Mozilla Corporation’s ML Group was dissolved in 2020, Davis co-founded Coqui and moved into the commercial world of speech synthesis and voice cloning.

Michael Henretty (co-founding engineering lead). Built the 2017 Common Voice alongside Davis. Henretty’s early design principle reads in one line: “make the UX let anyone contribute inside ten seconds, from inside the browser.” After the ML Group dissolved, operations moved over to the Mozilla Foundation.

EM Lewis-Jong (current Common Voice director). The public face on the Foundation side. Lewis-Jong represents Common Voice in conference talks and on the Mozilla Foundation blog, and led the 2023 gender-options expansion. A line Lewis-Jong wrote on the blog in 2023 sums up the Community Playbook’s posture: “Voice data shouldn’t reflect the world as it is. It should reflect who speaks.” (source, representative paraphrase).

Jenny Zhang (current project lead). Runs release cadence day-to-day inside the Foundation. Owns the GitHub org and the Discourse forum.

Mitchell Baker (Mozilla Foundation Chair). Original author of the Mozilla Manifesto. A single line from the Manifesto anticipates Common Voice’s reason for existing: “The internet must be a global public resource, open and accessible to all.” (Mozilla Manifesto).

And 800,000 contributors. In a company profile, volunteers usually get abstracted away. In Common Voice’s case they are the project itself, and leaving them unnamed would distort the picture. They are neither employees nor customers. Each is an anonymous person who once hit the record button in a browser. Without those 800,000, the 31,841-hour number does not exist.

Concretely, Common Voice is infrastructure that runs only when three things mesh: a small paid Foundation team of a handful to a dozen or so, a large volunteer community, and a sponsorship layer from partners like NVIDIA. It is not a product run by a product team. It is something closer to a public works project. That is one reason other companies struggle to reproduce it.

fig.f2 · cumulative hours 2017–2025·········

Figure F2. Common Voice cumulative hours from the 2017 launch through late 2025. Not a spike. A compounding shape. Nine years of volunteer contribution that kept releasing across two organizational upheavals (the 2020 ML Group dissolution and the 2024 Foundation restructure). Key milestones: 2017 launch, 2022 NVIDIA sponsorship kickoff, 13,000 hours reached in 2024, and 31,841 hours shipped as Scripted Speech 24.0 in late 2025 (of which 20,789 hours are validated). Intermediate years are interpolated from release metadata.

3. Not a spike. Compounding. A nine-year horizon.

Look at the Common Voice timeline from 2017 on and you do not see a growth spike. You see the quiet shape of compounding.

June 2017: Mozilla’s ML Group launches DeepSpeech and Common Voice together. 2020: Mozilla Corporation dissolves the ML Group and the project is adopted by the Foundation. 2022: NVIDIA begins sponsoring tooling and infrastructure. 2024: The NVIDIA developer blog marks the 13,000-hour milestone. October 2024: Mozilla Foundation restructures, reduces staff by about 30%, and shrinks its advocacy division. Autumn 2025: Scripted Speech 24.0 ships at 31,841 hours.

Worth pausing on: NVIDIA’s sponsorship underwrites the platform’s costs, not the speakers. The speakers underwrite themselves. That keeps the design philosophy coherent.

Three structural outcomes fall out of this nine-year compounding.

Point 1 — Irrevocable public-domain audio

Once the waiver button is pressed, the clip cannot be retracted by a policy change, a corporate restructure, or litigation. Those 31,841 hours sit permanently in the public domain. The structural reason a commercial lab cannot imitate this posture is that quarterly monetization pressure overwrites the civic framing. A public-domain-derived corpus like LibriSpeech depends on the accident of expired source copyrights. Common Voice depends on contributors explicitly saying so. That distinction is critical in the AI-training context. Both the €5M GDPR fine against Replika in May 2025 and the Rome court’s March 18, 2026 ruling that overturned the €15M fine against OpenAI were enforcement events inside scraped-data regimes. Neither applies to a consent-first regime. Common Voice structurally sits outside that problem set.

Point 2 — Structured metadata tied to the consent act

Contributors can optionally self-report age, gender, accent, and language variant. From mozilla-foundation/common_voice_13_0 onward on Hugging Face, these fields are exposed in the dataset schema. In 2023, the gender categories were broadened so non-binary identities can be recorded. No other CC0 speech corpus attaches demographic metadata at this level. Which means downstream researchers can sample intelligently by accent, age, or gender, and demographic-skew audits can be public rather than locked inside a vendor. This is a rare concrete implementation of the stance that Emily Bender and Timnit Gebru argued for around 2018 in Data Statements for NLP (the documentation norm that dataset creators should publicly declare who a corpus was built for).

Point 3 — Multilingual reach rooted in volunteer community

Of the 286 languages in Scripted Speech 24.0, the head is English, German, and French. The long tail has hundreds of languages under ten hours each. The commercial speech-data market has no substitute for that tail. Across low-resource ASR work in Welsh, Basque, Kinyarwanda, Swahili, Yoruba, Tatar, and elsewhere, Common Voice is used as the de facto reference set. Release cadence runs roughly every six months, and that rhythm did not break through either the 2020 ML Group dissolution or the 2024 Foundation restructure. Less “it survived” than “the Foundation’s mission kept it going.” That is the more accurate read.

fig.f6 · three-pillar stewardship·········

Figure F6. The three pillars that compound Common Voice. Apex: the volunteer community of tens of thousands of contributors across nearly 300 languages. Lower left: the small paid Foundation team (Zhang as project lead, Lewis-Jong as director, plus a handful of platform engineers and community staff). Lower right: NVIDIA’s sponsorship layer underwriting platform and tooling since 2022. Remove any one pillar and compounding stops. All three have held since the 2020 reparenting from Mozilla Corporation to the Foundation.

4. A quiet reference corpus cited by more than 5,000 papers

The most honest way to measure Common Voice’s influence is not by numbers but by which papers cite it.

The 2022 Whisper paper (Radford 2023) includes Common Voice in its evaluation mix. Common Voice is one of the default benchmarks Whisper uses to measure its 67-language multilingual ASR performance. The 2020 wav2vec 2.0 paper (Baevski 2020) evaluates self-supervised fine-tuning on LibriSpeech plus Common Voice in ten languages. Meta’s SeamlessM4T (2023) uses Common Voice v11 as a held-out set for multilingual speech translation. Massively Multilingual Speech (MMS, 2023) folds Common Voice into part of its 1,000-language ASR evaluation. Through 2024 and 2025, arXiv low-resource ASR work has continued to name Common Voice as training or evaluation data, including Bangla ASR (arXiv 2507.01931), a Whisper cross-lingual comparison (arXiv 2501.00425), and a 125-hour German Whisper-13 fine-tune (arXiv 2412.15726).

On Google Scholar, the Ardila 2020 Common Voice paper has roughly 3,500 citations as of April 2026. The Mozilla Foundation’s Impact Report estimates the cumulative academic papers using Common Voice data at more than 5,000. When Panayotov shipped LibriSpeech in 2015, the paper opened with: “Free, public-domain ASR resources are in short supply.” (LibriSpeech paper). Ten years later, Common Voice is at the center of that supply. With five times the languages, ten times the hours, and an explicit consent layer on top.

In short: the frontier STS (speech-to-speech, models that respond voice-to-voice directly) systems like Qwen3-Omni, Moshi, and Spirit-LM do not use Common Voice as a primary pretraining input. They do use it as the default reference set for ASR benchmarking. The reason it is not used for pretraining is scale and modality. 31,841 hours is 1/200th of the 7M hours of web speech Moshi consumes. Single-speaker read audio cannot supply the primitives of full-duplex conversation (two people speaking at once, as on a phone call). But for evaluation, as “the standard set for low-resource ASR” and “a required anchor for any multilingual performance claim,” it has been locked into the industry between 2020 and 2026.

“Common Voice is the only multilingual ASR benchmark you can ship with no license friction.”— Hacker News, top comment on Whisper v3 (HN thread, 2024)

License friction is the thing commercial dataset-licensing owners worry about most in the 2020s. If the benchmark is public and CC0, researchers can publish results without escalating to in-house legal. That is what holds up the Common Voice adoption moat.

fig.f7 · six honest limits matrix·········

Figure F7. Six limits paired with what the same structural choice bought. Each row pairs a limit (left) with the constructive read (right, italics) that the same decision produced. The three capabilities named in §3 (irrevocable public-domain audio, structured metadata, multilingual reach) are each a restatement of what the left column bought. Layout discipline: the constructive read is not a denial of the limit, it is the accounting for what the decision produced.

5. The Community Playbook as a pattern that spread

The most important artifact Common Voice ever shipped is not the dataset itself. It is the waiver-plus-UX-plus-governance pattern that other corpora adopted.

Before Common Voice, public-domain speech corpora depended on the structural accident that their source material was already public domain. LibriSpeech depended on LibriVox, LibriVox depended on books whose copyrights had expired. The consent chain was implicit. After Common Voice, the waiver became a deliberate UX pattern, and that pattern spread.

Three ways it spread.

Point 1 — Propagation into benchmark lineages

Google’s FLEURS (Conneau 2022, 102 languages) adopted a Common Voice-style opt-in collection posture and permissive redistribution license. ML-SUPERB (Shi 2023, 143 languages) inherits the principle that multilingual speech resources should be explicitly licensed for ML. MLCommons People’s Speech (Galvez 2021) aggregates CC-BY and CC0 sources with explicit license provenance. None is a fork of Common Voice. All sit downstream of the license posture Common Voice normalized.

Point 2 — Transplanting the Community Playbook UX

The Community Playbook Mozilla publishes treats three things as documented process: the CC0 waiver flow, the SSO identity layer, and the community validation workflow. Language initiatives for Swahili, Yoruba, Welsh, Basque, and several Indian languages reference the Playbook. Google’s Project Euphonia (opt-in collection of atypical speech) uses a Common-Voice-shaped workflow too. The takeaway: the three-step loop of “press record, read the shown text, agree to the waiver” has, over nine years, settled as the default UX for opt-in speech collection.

Point 3 — Interface the STS opt-in economy assumes

The default interface in the opt-in layer of the three-layer consent stack (biometric floor, platform middle, AI-transparency ceiling) that Article 10 describes is a generalization of Common Voice’s interface. Every credible consent-first voice-data pitch since 2017 sits downstream of this Playbook pattern without exception. Explicit waiver at contribution time, redistributability without renegotiation, and publicly documented governance. Miss any one of these three axes and you cannot call it opt-in. That is the industry’s operational definition as of April 2026.

The implication across all three points: Common Voice was not the first mover but the pattern-defining mover. LibriVox has collected speech on opt-in terms since 2005, public broadcast archives since the 1970s. But the first scale artifact to combine all four of CC0, structured metadata, multilingual coverage, and a public validation workflow was Common Voice. Those four elements became the consent-first minimum viable spec.

fig.f8 · CC0 adoption network·········

Figure F8. The contagion network the Community Playbook produced. Center: the Playbook’s UX definition (press record, read the shown text, agree to the waiver). Periphery: FLEURS (Google 2022), ML-SUPERB (2023), MLCommons People’s Speech (2021), Project Euphonia (Google), and language-community contribution pipelines. None is a Common Voice fork. All run downstream of the pattern Common Voice normalized. The second product was not a dataset. It was a pattern.

6. Answering the “read speech cannot train full-duplex” objection honestly

The objection “Common Voice is single-speaker read speech and structurally cannot train full-duplex conversational STS, so is it still relevant to the 2026 conversation” needs to be engaged honestly.

First, what to concede. Common Voice contains none of the phenomena that make up “two people speaking at once”: turn-taking, overlap, backchannels, barge-in, interruption. Even Spontaneous Speech 2.0 is single-speaker free-form monologue, not dialogue. A browser tab cannot pair two stranger contributors on independent channels, and the Community Playbook is built around one contributor at a time. So the path by which Common Voice becomes training input for frontier full-duplex systems like Moshi, Spirit-LM, Qwen3-Omni, or Step-Audio is structurally closed. STS series Article 06 enumerates six routes of speech-data supply, calls volunteer opt-in “Route 2,” and identifies it as the route with the weakest precedent for producing full-duplex conversational data. Common Voice embodies that limitation completely.

But the same facts read differently from another direction.

Point 1 — Evaluation-baseline status holds independent of scale

When Moshi, Qwen3-Omni, or Gemini Live claim low-resource ASR performance, their reference set is still Common Voice. Not training input, evaluation anchor. That is a load-bearing function separate from full-duplex training data supply. Frontier models depend on the artifact in that specific way, and that is independent value.

Point 2 — The consent pattern is load-bearing by design, not by scale

Common Voice does not directly contribute to foundation-scale pretraining. True. But the reference point against which next-generation consent-first two-channel corpora get measured is Common Voice. Two-channel device pairing, channel isolation, time coordination, mutual consent — to write a next-tier Playbook that encompasses all of these, the industry has no path except starting from the Community Playbook and rewriting. Positioned as “the starting point for the pattern,” not “the answer at scale.”

Point 3 — Institutional concentration is structural, not a Mozilla failing

The fact that in 2026 the Mozilla Foundation is the only institution running this playbook at this scale does not point to something Mozilla got wrong. It points to a category imbalance: nobody else has done the same thing yet. LDC, ELRA, and NII run on paid licenses per corpus. Scale AI, Appen, and LXT use per-project bespoke consent and do not release public data. Commercial labs train on web-scraped audio and have no structural incentive to release it CC0. If any of a UNESCO-backed consortium, a national academic network, or a well-funded international foundation picks up the same Playbook, the category gets stronger, not more contested.

In short, Common Voice’s “cannot be used for full-duplex directly” limitation is true from the frontier-pretraining view and irrelevant from the consent-pattern-infrastructure view. Which view you read from is this profile’s essential branching point.

fig.f4 · log-scale hours comparison·········

Figure F4. Log-scale hours comparison. Common Voice’s 31,841 hours sits three orders below Moshi’s 7M-hour web pretraining corpus and two orders below Whisper’s 680,000 hours. LibriSpeech, LibriVox, LibriLight, and CosyVoice are in the same order of magnitude as Common Voice. The dashed line marks Common Voice’s ceiling. Against the scraped-web anchors it sits near the floor. Against the consent-first cohort it sits at the ceiling. Both positions hold simultaneously.

7. Where it sits in the 2026 STS landscape

The 2026 role of Common Voice is best read through four contact points.

First: the interface assumption behind the opt-in economy. As laid out in §5 Point 3. The default interface of the opt-in layer in Article 10’s three-layer consent model. The corpus does not need to scale for the pattern to be load-bearing.

Second: a two-way proof on Route 2 of foundation data supply. Among the six routes for two-channel full-duplex data supply tracked by Article 04 and Article 06, Route 2 (volunteer opt-in) is proven by Common Voice to be capable of shipping a real public artifact at real scale, and disproven as a path to full-duplex conversational data beyond single-speaker read speech via volunteer-only playbook. Both directions hold simultaneously. That is what makes this artifact critical evidence.

Third: an anchor in the benchmark layer. As laid out in §4. The reference set when Whisper, wav2vec 2.0, SeamlessM4T, MMS, Qwen3-Omni, or Moshi claim multilingual ASR performance. Being training input and being evaluation anchor are different payoffs.

Fourth: a signal of institutional concentration. As laid out in §6’s third point. The state where the Mozilla Foundation alone runs this playbook at this scale is both a Mozilla achievement and a measure of the size of the category gap. The day a second institution appears is not the category’s terminal state but its maturity state.

A community-curated aggregator like the Full Duplex AI Directory listing Common Voice as the single landmark for CC0 speech data in 2026 is the composite outcome of those four contact points.

fig.f3 · provenance matrix·········

Figure F3. The opt-in-versus-harvested dataset provenance matrix. Horizontal axis: consent-first to harvested. Vertical axis: redistributable to encumbered. Common Voice sits alone in the upper-left CC0 quadrant, with no close peer at five-figure hours. LibriSpeech and LibriLight sit upper-right (mixed license, public-domain-derived), Whisper sits lower-left (ToS-scraped), Moshi web pretraining sits lower-right (web-collected, not released). One picture of the four contact points §7 names: the consent-first interface assumption, the Route-2 two-way proof, the ASR evaluation anchor, and the institutional-concentration signal.

8. Summary and outlook — read it for durability, not competition

The 2026 role of Common Voice is read correctly through durability, not competition.

A corpus that launched in 2017 on a browser recording interface and a CC0 waiver was still shipping Scripted Speech 24.0 and Spontaneous Speech 2.0 in autumn 2025. 31,841 hours, 286 languages, 800,000 contributors, more than eight years of volunteer contribution, two Mozilla restructures survived, a rebalanced NVIDIA sponsorship support layer, a six-month release cadence. All of this (the corpus itself, the Community Playbook, the contributor platform, the release cadence, the versioned dataset family on Hugging Face) is the kind of quiet infrastructure that the next decade of consent-first speech-data infrastructure will be built on top of.

The insight worth pulling out: consent-first speech infrastructure matters most at the moment the frontier tilts toward scraping. Frontier labs train on internal data and keep it internal. What is structurally needed outside that incentive structure is a permanently public, explicitly consented, multilingual baseline. Common Voice is the only artifact carrying that function in 2026. In scale it is two to three orders below the frontier. In consent posture it is so far above any comparably sized speech corpus that there is no comparison. Those two properties holding simultaneously is what the phrase “ethical floor” actually means.

Three signals worth watching over the next five years.

Signal 1: does a second institution adopt the same playbook at comparable scale? If a UNESCO-backed language consortium, a national academic network, or a well-funded foundation runs CC0 collection in parallel, the category gets stronger, not more contested. The Playbook is documented and adoptable. The rate-limiting step is not technical difficulty but institutional commitment.

Signal 2: does a paired-speaker consent framework reach comparable maturity? Full-duplex conversational data requires device pairing, channel isolation, and simultaneous consent on both sides. The Community Playbook is built around one contributor at a time. The load-bearing task for the next generation of consent-first speech infrastructure is writing the second-tier Playbook that extends the same waiver discipline to two simultaneous contributors, and Common Voice is the starting reference.

Signal 3: do long-tail languages sustain volunteer mobilization or drift toward dormancy? The head (English, German, French) is self-sustaining. The tail depends on active language-community engagement. The clearest signal of whether multilingual reach survives is how the tail of the hours-per-language curve moves over the next five years.

Common Voice’s nine-year wager was that an opt-in, public-domain speech corpus could, underwritten by nonprofit and volunteer effort, become a lasting public good. The 2026 evidence (31,841 hours across 286 languages, two institutional survival events, a Community Playbook other projects route through) is consistent with that wager paying off. The ceiling remains open for a paired-speaker tier.

fig.f1 · hours per language (log scale)·········

Figure F1. Hours per language on a log scale. English owns the head, a few dozen languages sit in the mid-band (hundreds to low thousands of hours), and more than 200 languages make up the long tail (under ten hours each, many under one hour). The tail is what §8 calls “the clearest signal over the next five years.” How the tail of the hours-per-language curve moves over that window is the most direct indicator of whether multilingual reach survives. Conceptual chart; exact per-language counts are available from the Mozilla Common Voice release metadata on Hugging Face.

Dataset inquiry. Fullduplex.ai is building two-channel full-duplex conversational data and the collection framework that sits on top of the consent-first discipline the Community Playbook defined. If you are working on a product, benchmark, or research program that needs a consent-first conversational corpus, reach out at hello@fullduplex.ai. One line is enough.