the verticalsv03 / 17#cartesia#ssm§ 08 sections · 07 figures

Cartesia: why AWS put a non-transformer voice AI on its own shelf.

$191M cumulative, 62% blind-test preference, and a shelf next to Amazon Nova Sonic — earned by the only voice-AI company commercially competing without a transformer. The people who wrote the SSM papers are the people running the company.

fullduplex research

published apr 2026· 18 min read· ~3,280 words· verticals v03 / 17

18m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v03 of 17 · subject profile

AWS put a competing voice model on the same shelf as its own Amazon Nova Sonic — because as of April 2026, Cartesia is the only voice AI company competing commercially without a transformer.

subject: Cartesia · San Francisco · founded 2023~$191M cumulative · Sonic-3 on SageMaker JumpStart

1. Why AWS put a competing voice model on its own shelf

In February 2026, Amazon added a voice model called Sonic-3, from a startup called Cartesia, to its AI marketplace SageMaker JumpStart. At a glance this looks like a routine listing. One fact, though, sticks out. AWS promotes its own voice model, Amazon Nova Sonic, in exactly that category, and a competing model from another company now sits on the same shelf.

Putting a competitor on your own shelf is, by analogy, like Toyota stocking Suzuki kei cars inside its dealer network and selling them there. The only reason you put a rival product into your own distribution channel is that you have conceded there is a customer segment your own product cannot cover.

Why did that happen. Because as of April 2026, Cartesia is the only voice AI company competing commercially without using a transformer. Transformer, the dominant architecture of modern AI, is the foundation under GPT-4, Gemini Live, and Amazon Nova Sonic alike. A company that deliberately turned away from the mainstream and bet on a different design called a state-space model (SSM) reached the AWS shelf roughly thirty months after founding. That is the oddest scene in the voice AI industry in spring 2026.

Cumulative funding is about $191M (seed $27M + Series A $64M + Series A-II ~$100M). In independent blind testing, 62% of participants preferred Sonic-3 over the competition (mostly ElevenLabs). 42 languages. 90 ms model latency, 190 ms end to end. Voice cloning from a 3-second sample. More than 60 emotion and prosody tags. This bundle of numbers is the product spec that landed on the AWS shelf in February 2026.

The single question this piece wants to chase is this. Why was a company that bet on an architecture outside the mainstream able to earn these numbers and that shelf. The short answer has three parts. First, the founders are the people who wrote the SSM papers themselves. Second, the SSM cost curve matches the shape of voice workloads. Third, those two met at a lucky moment. Let us take them in order.

fig.f1 · latency × quality map·········

Figure F1. Streaming TTS latency-by-quality map, as of April 2026. Cartesia Sonic-3 is the only commercial model running on an SSM backbone. Placement is conceptual, not a precise benchmark readout.

2. Transformer as encyclopedia, SSM as flowing river

A short technical detour. No equations.

When an AI generates text or audio, it needs to “remember” what words or sounds it has already produced. When the mechanism of memory differs, the operating cost becomes an entirely different story.

A transformer’s memory works like an encyclopedia. Every time it emits one character, it appends everything produced so far onto a page, and each time it writes a new character it reads back through every prior page to reference them. That dictionary is called the KV cache (Key-Value cache, the intermediate memory used by attention). One minute of speech adds one minute of pages. One hour of speech adds one hour of pages. GPU memory has to hold the whole dictionary.

An SSM’s memory works more like a flowing river. Every time it emits one character, it overwrites “the current state of the river”, and for the next character it only looks at that overwritten state. The past dissolves into the flow rather than stacking up as pages. In short, the memory size is fixed.

The metaphor is not strict, but it is enough to understand the cost-curve difference. The thicker an encyclopedia, the heavier it is to carry. A river, whether it has flowed for one hour or ten, uses the same amount of memory at any moment. The Mamba paper (Gu & Dao 2023) reported the difference as “roughly 5× inference throughput versus transformers and linear scaling in sequence length”.

In voice AI, this difference hits hard. Speech produces tens of tokens per second (Kyutai’s Mimi codec runs at 12.5 Hz, Meta’s Encodec and Google’s SoundStream cluster in the 25–75 Hz range). A one-hour contact-center call tokenized at 50 Hz is 180,000 tokens — the length at which a transformer’s KV cache hits the memory wall.

fig.f2 · cost curve divergence·········

Figure F2. A transformer’s KV cache grows linearly with session length. An SSM simply updates a fixed-size hidden state. The gap widens at long conversations, many concurrent calls, and edge inference.

Cartesia’s bet is that “the future of voice goes toward much longer sessions”. Not a few seconds of synthesis demo, but a one-hour call, thousands of concurrent connections, inference on a smartwatch. Across those three axes, the river mechanism is structurally more favorable than the dictionary mechanism.

3. The founders — the people who wrote the SSM papers are the people running the company

Whether an architectural bet can actually be executed comes down, in the end, to who is running it. Cartesia’s founding team is a little unusual.

Karan Goel (CEO) was first author on the SaShiMi paper (2022) during his Stanford PhD. SaShiMi was the first paper to apply SSMs to raw audio generation, and it achieved higher quality and lower memory than WaveNet and prior attention-based models. The first public evidence that “SSMs are a natural fit for audio” was written by Goel himself. In Index Ventures’ investment memo, Goel put it concisely.

we believe the next generation of AI requires a phase shift in how we think about model architectures— Karan Goel, Cartesia CEO (Index Ventures)

Albert Gu (co-founder) co-authored the Mamba paper (2023) with Tri Dao. He holds a parallel appointment as assistant professor at CMU (Carnegie Mellon University). On the Cognitive Revolution podcast, he framed it this way.

SSMs give you a fundamentally different compute primitive— Albert Gu, Cartesia co-founder and CMU (The Cognitive Revolution, 2024)

Gu’s own CMU research is published, while the Cartesia research goes into the product unpublished. It is a deliberate two-track setup.

Arjun Desai and Brandon Yang are the other technical co-founders, both from the Stanford cohort that wrote S4 (2021) and H3 (2022), the papers that preceded Mamba. Advisor Chris Ré is a MacArthur Fellow, Stanford professor, and the principal investigator of Hazy Research, who oversaw the entire SSM lineage of research.

In short, Cartesia is a company where the first authors and co-authors of the foundational papers are the CEO and CTO, and they are the ones writing the product architecture. That shape is very rare in the voice-AI category. ElevenLabs is a duo of ex-Palantir and ex-Google Research. Sesame comes from Oculus founders. Moshi came out of a French research institute. Hume started from the emotion-research line. Outside Anthropic and Cohere, continuity between research and commercialization inside a single group this tight is the third strongest example in 2020s generative AI.

What does that continuity mean. The pool of engineers in the world who can write SSM code is on the order of a few hundred, and most of them trace back to Stanford Hazy Research or CMU. Cartesia has a hiring asymmetry against that narrow pool from day one. A latecomer who wanted to build a copycat SSM voice lab would, structurally, not be able to staff it.

fig.f3 · founders lineage·········

Figure F3. The S4 (2021) → H3 (2022) → SaShiMi (2022) → Mamba (2023) lineage is the technical core of Cartesia by direct descent. Origin point: Chris Ré’s Stanford Hazy Research.

4. A five-year research line, an eighteen-month product line

Here is a chronological roll of Cartesia’s releases across the 22 months from May 2024 to February 2026.

Sonic v1 (May 2024) was the first commercial SSM voice model, landing at TTFA (time-to-first-audio) of 90 ms. TechCrunch summarized it as “efficient enough to run pretty much anywhere.” That summary was sharp because in 2024 every other TTS vendor was positioning quality-first, and Cartesia was the only one positioning efficiency-first.

Sonic 2 (2025) tightened component-level TTFB into the 40 ms range. Sonic 2 plus Deepgram Nova-3 plus a fast LLM could land a voice agent end to end under 250 ms — the threshold where a conversation feels like a phone call rather than a walkie-talkie.

Sonic-3 (October 2025) is the current flagship. 42 languages, voice cloning from a 3-second sample, more than 60 emotion and prosody tags, 90 ms model latency, 190 ms end to end. That is the feature set competing directly in the price band ElevenLabs v3 priced itself around in 2025. The 62% blind-test preference rate matters here. The transition from “fast but lower quality” in 2024 to “fast and higher quality” in late 2025 closed with Sonic-3.

Ink-Whisper is Cartesia’s streaming ASR, also built on the SSM backbone. What this unlocks: when you build a full-duplex STS, the TTS head and the ASR head can sit on the same substrate. As of April 2026 Cartesia has not shipped a full-duplex STS, but the Sonic + Ink-Whisper pair is most of the infrastructure a full-duplex implementation would need.

Kleiner Perkins wrote “Real-time voice AI is here” in its March 2025 investment memo. The three milestones of Series A, Sonic-3, and Series A-II move in a release-to-funding-to-release-to-funding cadence. Product evidence arrives first, capital follows. A healthy order.

fig.f4 · product × funding timeline·········

Figure F4. Cartesia’s product and funding timeline, May 2024 to February 2026. Product releases lead, funding rounds follow.

5. Customers — a strategy to sit behind the phone switchboard

Cartesia’s customer list shares one distinctive pattern.

Not end-user voice apps, but the developers who build end-user voice apps and the companies that sell voice agents, with Cartesia sitting behind them. The same shape that let Twilio (the API for phone lines) or Stripe (the API for payments) grow as “not the service to the end user, but the plumbing behind the companies that build those services.”

Point 1 · Default engine for horizontal integrators. Vapi has been Cartesia’s earliest design partner since late 2023 and places Cartesia as the default TTS engine of its voice-agent platform. Of the four horizontal integrators covered elsewhere (Retell / Vapi / Bland / ElevenLabs CAI), the two without their own TTS — Retell and Vapi — both default to Cartesia in practice.

Point 2 · Vertical integrators on the contact-center side. Cresta, a leading AI contact-center platform, has Sonic built in. ServiceNow integrates Sonic into its enterprise-IT AI Voice Agents. Regal and Maven AGI use Sonic in high-volume contact centers. Contact-center procurement puts per-minute unit economics under a microscope — the front line where architectural advantages show up on the P&L.

Point 3 · Shelf space at the infrastructure layer. Together AI names Cartesia its dedicated voice-model partner. Rasa makes Sonic the default voice for its enterprise conversational AI. And Amazon SageMaker JumpStart — the fact flagged in §1 — landed Cartesia on the same shelf as AWS’s own Nova Sonic in a category AWS itself competes in.

This three-layer customer base has only been reached by two companies in the voice-AI category simultaneously, ElevenLabs and Cartesia. ElevenLabs runs a three-front strategy (consumer brand for creators, developer API, Fortune 500). Cartesia skipped the consumer brand and adopted a “sit behind the integrators” strategy. Two companies at the same voice- foundation layer have, by 2026, branched: ElevenLabs toward B2C, Cartesia toward infrastructure.

fig.f5 · customer stack·········

Figure F5. Cartesia’s customer stack. Three layers: behind horizontal integrators, vertical into the contact center, and infrastructure distribution.

6. The funding wave — what it means that Kleiner Perkins led twice and NVIDIA joined

The funding line runs almost in parallel to the product line.

The $27M seed in December 2024 was led by Index Ventures, seven months after the Sonic v1 launch. The $64M Series A in March 2025 was led by Kleiner Perkins. The ~$100M Series A-II in October 2025 was led again by Kleiner Perkins, with Index Ventures and Lightspeed participating pro-rata and NVIDIA joining new. Same week as the Sonic-3 launch. Cumulative: about $191M.

Kleiner Perkins leading twice in sequence is the standard VC doubling-down pattern: re-invest once the product thesis is validated. Bet at Series A, hit the mark with Sonic 2 and Sonic-3, bet again at Series A-II.

NVIDIA’s participation has a different meaning. NVIDIA runs NVentures, the largest corporate venture program of the GPU era, and its portfolio holds a vast number of transformer-based startups. That company put capital into a non-transformer inference lab. That is a selective move, not a routine one. Saying “NVIDIA bet on SSMs” would overclaim. The minimum reading that holds is that “NVIDIA’s corporate-development team saw evidence that SSMs are cost-structurally viable.”

Four months later, AWS listed Sonic-3 on SageMaker JumpStart. NVIDIA and AWS did not coordinate. But the fact remains that the two largest gatekeepers of the commercial ecosystem sent signals in the same architectural direction in the same window.

fig.f6 · funding waterfall·········

Figure F6. Cartesia funding waterfall. $27M → $64M → ~$100M, cumulative about $191M. Kleiner Perkins led twice, NVIDIA joined at Series A-II.

7. The counterargument — what if transformers catch up?

Here is the strongest counterargument to the thesis, taken head-on.

The counter runs as follows. “The transformer side has also closed the latency gap with flash attention, grouped-query attention (GQA), and speculative decoding. ElevenLabs Flash v2.5 hits sub-75 ms TTFA, and OpenAI TTS and Deepgram Aura have both descended into the sub-100 ms band. On short-session TTS, Sonic-3’s latency edge is already gone. Once transformer training recipes converge, the SSM advantage disappears. Doesn’t it?”

Respond to this in two steps.

First, the latency advantage is only measured under the narrowest definition. Sub-100 ms TTFA is the “synthesize one sentence, time to first audio out” benchmark, measured under single-turn, English, low-concurrency conditions. Transformer TTS has essentially caught up to SSM on that. But that is not the latency Cartesia is betting on. The real workload for an enterprise voice agent is defined by three axes: (a) sessions longer than an hour, (b) thousands of concurrent connections, (c) edge inference. The cost-curve divergence in §2 only becomes visible across those three axes. Transformer TTS caught up on the one-minute demo. At one-hour calls × 1000 concurrency on a smartwatch, the slope is different.

Second, whoever owns the different curve wins the price war. OpenAI TTS sits around $15 per 1M characters, Deepgram Aura 2 at enterprise volume around $0.015 per 1K characters. Those prices already price in the cost reductions from flash attention and GQA. The SSM side can run the same inference load with less GPU memory and higher throughput, which pulls the price floor lower still. “After the transformer catches up, the curves diverge again” is the second stage of Cartesia’s bet.

That said, the risks the counterargument points at are real. Hybrid architectures (Jamba, Zamba, and Samba are evolving by mixing transformer and attention layers), hyperscaler absorption (the January 2026 Google DeepMind / Hume move and the Apple / Q.ai $1.6–2B acquisition), and a possible delay in extending to full-duplex STS. These three do not cancel out. They sit alongside one another as signals worth tracking from the outside.

fig.f7 · two-stage curves·········

Figure F7. The transformer and SSM cost curves cross at short sessions but diverge again at long sessions, high concurrency, and on edge devices. This is the second stage of Cartesia’s bet.

8. Landscape — contact points and three signals

Overlaying this profile onto the STS omnibus series surfaces four contact points.

First, architectural diversity. Every full-duplex STS in the four-families taxonomy (dual-stream+codec / interleaved-flatten / cascade+predictor / codec-free) is currently transformer-based. That is not because SSMs are a poor fit for voice, but because SSM commercialization lagged transformer commercialization by roughly three years. If Cartesia ships a full-duplex STS on an SSM backbone, a fifth family is possible.

Second, the data-layer problem. The supply of full-duplex dialogue data sits 30 to 150 times below the foundation threshold. Whether you train with an SSM or a transformer, feeding the same data yields the same gap. Whether Cartesia’s bet ultimately pays off depends on progress at the data layer, not the architecture. This is the point that overlaps most directly with Fullduplex.ai’s domain.

Third, the integrator layer. As the horizontal integrators get commoditized, Cartesia’s position sitting behind them strengthens in relative terms.

Fourth, the hyperscaler absorption environment. The January 2026 Google / Hume and Apple / Q.ai pattern points toward independent frontier voice labs being absorbed. Against that backdrop, Cartesia sits in a position that could go either way: with NVIDIA and AWS as ecosystem partners while still independent by capital, it could be the next absorption candidate, or the exception that stays independent.

Three signals to track over the next three years. Signal 1, whether Cartesia ships a full-duplex STS on an SSM backbone. Signal 2, whether a hyperscaler moves Cartesia from ecosystem partnership to internalization. Signal 3, whether a second commercial voice lab emerges in the SSM category. One occupant is an exception. Two makes it a category.

Cartesia’s 2026 has produced five pieces of evidence for the hypothesis that “SSMs hold selective advantage over transformers on workload shape”: $191M cumulative funding, 62% blind-test preference, the AWS procurement shelf, NVIDIA’s corporate entry, and the integrator layer it sits behind. The distance from a 2021 Stanford Hazy Research paper to a listing on the AWS shelf has clearly shortened. If any of signals 1 through 3 moves in the next three years, this bet will be the event that defines architectural diversity in the voice-AI category.

Investor data room access. Fullduplex.ai is building the two-channel conversational data and evaluation infrastructure that full-duplex STS — whether on an SSM or a transformer backbone — needs to cross the foundation threshold. For investors evaluating Cartesia’s bet, the input side (data and evaluation) is a separate asset class. Contact: hello@fullduplex.ai.

■ ■ ■

#verticals#cartesia#ssm#voice-infrastructurefiled under: the latent · verticals v03

Cartesia: why AWS put a non-transformer voice AI on its own shelf.

1. Why AWS put a competing voice model on its own shelf

2. Transformer as encyclopedia, SSM as flowing river

3. The founders — the people who wrote the SSM papers are the people running the company

4. A five-year research line, an eighteen-month product line

5. Customers — a strategy to sit behind the phone switchboard

6. The funding wave — what it means that Kleiner Perkins led twice and NVIDIA joined

7. The counterargument — what if transformers catch up?

8. Landscape — contact points and three signals

The Verticals · 17 subject profiles, released as they land

Sesame AI: the billion-dollar bet on voices you want to keep talking to

Cartesia: why AWS put a non-transformer voice AI on its own shelf

Hume AI: the smile inside a sentence, and the nine days that clarified voice AI’s exit shape