the verticalsv14 / 17#artificial-analysis#benchmarks§ 07 sections · 05 figures

Artificial Analysis: how two Australians became the AI industry’s independent scoreboard.

In September 2025, OpenAI quietly cited a competitor’s measurement inside its GPT-Realtime launch materials. The cited firm was Artificial Analysis, a benchmarking company run out of Sydney and built by two former McKinsey consultants. This piece walks the structure behind that fact: founders, design choices, speech-to-speech extension, and what kind of neutral position in AI infrastructure they have locked down.

fullduplex research

published apr 2026· 8 min read· ~2,700 words· verticals v14 / 17

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v14 of 17 · subject profile

Two Australians, a one-room Sydney apartment, and a weekend side project became the independent scoreboard the AI industry now cites in its own launch materials. The moat is not what they measure — it is how they measure it.

subject: Artificial Analysis · Sydney / US · founded 2023100+ models tracked · ~20 staff · OpenAI-cited

1. Why did OpenAI borrow a competitor’s scores

In September 2025, when OpenAI announced its new voice model GPT-Realtime, a competitor’s measurement quietly appeared inside the launch materials. The cited company was Artificial Analysis (AA), an independent benchmarking firm based in Australia and the United States. For a vendor to borrow another firm’s scores at its own product launch is quite unusual in this industry.

Why does that happen? The short answer is that AA is the company building fuel economy labels for AI models.

When buying a car, few people take the manufacturer’s claimed mileage at face value. That is why governments and consumer groups run their own testing regimes. The EPA in the United States, the WLTP cycle in Europe, the WLTC mode in Japan. Only when a shared yardstick exists, meaning “measured under these conditions with this method produces this number,” can a Toyota and a Honda be compared on the same footing. In the AI world, no such neutral testing body existed. Every model developer kept announcing “world best performance” using numbers measured under conditions that happened to flatter their own model.

What AA brought in is refreshingly simple. Put more than 100 AI models through the exact same set of test questions, grade them with the exact same method, and publish everything on a single leaderboard. In short, a standardized common test for AI.

What is more surprising is that AA started in 2023 as a weekend side project by Micah Hill-Smith and George Cameron, two former McKinsey consultants. Their first office was a one-room apartment in Sydney. Two years later, OpenAI was citing them in product launches.

One question drives this piece. How did two Australians lock down the most influential neutral position in AI infrastructure? The answer is not in what they measure. It is in three design choices they baked into how they measure: transparency, reproducibility, and neutrality.

2. What a benchmark actually is

Comparing AI models is in fact harder than comparing car fuel economy.

With cars, the yardstick already exists. The WLTC driving cycle is defined by regulation, every automaker drives the same route under the same conditions, and the resulting number goes on the spec sheet. You can drop a Toyota next to a Honda and just compare.

AI never had that. When OpenAI announces “our new model is 50% smarter than the previous generation,” Google responds with “we beat OpenAI by 15%.” But the definition of smart is decided by each vendor, measured in settings that happen to flatter their own model. Comparison breaks down because the playing fields are never aligned.

What AA rebuilt is the playing field itself. Unpack the word benchmark and it has only three parts. A common task, meaning every model solves the same problems. A common grading rule, meaning scoring uses the same criteria. A common format, meaning results are published in the same shape. The same idea as a standardized college entrance exam. The same idea as the nutrition label on a food package.

fig.f1 · before and after AA·········

Figure F1. Before AA, every vendor reported numbers on their own test (left). AA puts all models through the same task and grading rule and lines the results up in one leaderboard (right). Numbers in the figure are illustrative; see artificialanalysis.ai for the live board.

In short, AA built an independent fuel economy testing ground for the AI industry. No one asked them to. They just went and did it.

3. Two founders, a Sydney apartment, a weekend side project

A brief word about the two founders.

Micah Hill-Smith is from New Zealand. He studied law and computer science at the University of Auckland and was dux, meaning first in his class, at Rongotai College. After graduation he joined McKinsey’s Tokyo office, and a few years later struck out on his own.

George Cameron is from Australia. He studied at the Australian National University, interned at Google, and then joined McKinsey himself. The two met on a McKinsey project.

They started the AI benchmarking project in 2023 as a weekend hobby. Their first office was a one-room apartment in Sydney. OpenAI had just released GPT-4, every vendor’s “world best model” claim was accelerating, and the two brought their consultant instinct — the habit of breaking things into a framework and comparing them — straight into an AI model comparison site.

fig.f2 · AA timeline·········

Figure F2. 2023 as a weekend side project, January 2024 as the public launch, September 2025 as the moment OpenAI cited Big Bench Audio in its GPT-Realtime launch materials, and 100+ models tracked today. Headcount roughly 20 at the time of writing.

“Whoever controls how we measure controls the infrastructure.”— fullduplex research, on the AA thesis

Team size today is about 20 by public counts, still small by analyst industry standards. Even so, by September 2025 OpenAI was citing them in product launches. The fairest read is that the direction the two founders picked and the empty space in AI at that moment lined up. That alignment is what gave the project this speed.

4. The three design choices

What AA really locked down is not what they measure. It is how they measure.

Point 1 — Transparency

AA disclosed that a single evaluation of OpenAI’s reasoning model o1 cost them $2,767. Publishing the reasoning budget, meaning the amount of compute an AI is allowed to spend thinking about one problem, is a rare practice. It matters because the same model’s score changes depending on how much compute it can spend. Give it ten times the budget and the score goes up. If you do not disclose how much was spent for that score, the comparison does not hold up.

Point 2 — Reproducibility

AA’s flagship benchmark, AA-Omniscience (6,000 questions across 42 topics), is published as a paper on arXiv. In other words, the methodology is open to peer review. Their other flagship, Big Bench Audio (1,000 questions, 23 voices, graded by Claude 3.5 Sonnet), follows the same pattern. That anyone should be able to rerun a test and get the same result, the basic principle of science, had been surprisingly absent from AI benchmarking.

Point 3 — Neutrality

AA uses a mystery shopper approach, the same logic an undercover diner uses at a restaurant. They do not notify model providers in advance. They access the models only through the same public API that any other user would use. That removes the temptation for a vendor to quietly serve an “optimized for evaluation” version of the model.

fig.f3 · three interlocking design choices·········

Figure F3. Any one of these design choices by itself is not enough. Transparency alone does not remove the evaluator’s own bias. Reproducibility alone can drift from actual API behavior. Neutrality alone leaves no way to externally verify what was measured. Only when the three interlock does “this score is trustworthy” become a statement you can actually defend. Think of an independent Michelin guide: undercover reviewers, the recipe fully open, and dinner receipts on the wall.

All three have to be present at once. Concretely, AA is the firm that publishes “how much we spent,” “how we measured,” and “that we did not call ahead.”

5. The same three choices, now in speech-to-speech

Everything above was about text-based large language models (LLMs). What makes AA genuinely interesting is that they are bringing the same three design choices into speech-to-speech (AI that responds directly from audio to audio, STS for short).

AA’s Speech-to-Speech Leaderboard measures along two main axes.

The first is Big Bench Audio. It is the speaking and listening version of the reasoning test. The same 1,000 problems, delivered in 23 voice variations, graded by Claude 3.5 Sonnet. In short, the same kinds of questions they were measuring in text, now solved through audio.

The second is Conversational Dynamics on FDB. FDB (Full-Duplex Bench) is a measurement standard developed in academia for full-duplex communication, meaning two-way, phone-like, real-time audio. It evaluates turn-taking (who speaks when) and barge-in (speaking while the other side is still talking). AA took this benchmark and reshaped it into a form that works for comparing commercial models.

fig.f4 · text vs STS accuracy gap·········

Figure F4. Accuracy on the same set of problems, once solved via text and once via the audio-to-audio STS path. GPT-4o loses 26 points. Gemini 2.5 loses 26 points. Claude loses 31 points. This is the common problem across today’s voice AI. Specific values reflect the AA Speech-to-Speech Leaderboard at snapshot time and will shift as models update.

Measuring on these two axes surfaces the most interesting finding in the report. OpenAI’s GPT-4o can solve 92% of these problems in text mode, but drops to 66% in STS mode. A 26 point gap. Gemini and Claude show similar gaps.

Why the drop. The reasoning chain (the steps an AI takes to reach a conclusion) gets shorter. The audio codec (the translator that compresses audio for transmission) loses information. Latency constraints (response time budgets) leave no room to reconsider an answer. Multiple causes stack.

When OpenAI cited AA’s Big Bench Audio at the GPT-Realtime launch in September 2025, it was because this was one of the few independent metrics that looked at this gap directly. Said differently, “voice mode is still weaker than text mode” is an inconvenient truth the AI industry has to acknowledge. AA’s leaderboard became the public venue where that acknowledgement happened.

6. Strengths and fragilities

AA’s strength is that the three design choices function as a reputation moat, meaning a structural advantage that late entrants cannot easily match. Once you carry more than 100 evaluated models in your archive, anyone trying to do the same thing has to fund an equivalently large re-test. The arXiv publication and the open API methodology pile on further barriers.

The fragilities are worth naming honestly. The team is about 20 people by public count, which is small for the analyst industry. Evaluation targets grow every month, and whether the team can hold quality depends on staffing and process. Revenue reportedly depends on subscriptions and advisory engagements. If AA ever took sponsorship from a large cloud provider, its neutrality itself would come into question.

A skeptic might say, “each vendor can just measure themselves, AA is not needed.” The problem with self-measurement is exactly the three points AA solved. If you try to handle transparency, reproducibility, and neutrality all by yourself, a conflict of interest shows up somewhere. The structural value of a neutral third party rises, not falls, every time another vendor publishes its own benchmark.

7. Three audiences, and a proposal from Fullduplex.ai

The people reading AA’s leaderboard every day split into three groups.

Venture investors want external competitive reads on the models they have invested in. Model labs want to know where they rank globally, and where the gap to the leaders actually sits. Enterprise buyers want inputs for procurement decisions. Three groups, looking at the same leaderboard, actually caring about different columns.

fig.f5 · three readers of the same board·········

Reader	Main purpose	Columns they watch	Decision
Venture investor	External read on portfolio competition	Rank trajectory, cost, speed	Investment call
Model lab	Own rank and gap to frontier	Per-category scores	R&D direction
Enterprise buyer	Vendor selection	Overall score, latency, price	Procurement

Figure F5. The same leaderboard shows three audiences three different things. AA’s quiet advantage is multi-column design: each column is load-bearing for a different decision.

From Fullduplex.ai’s point of view, the STS gap that AA is working to measure — the performance drop in voice mode — traces back to training data that is not full-duplex conversational, meaning real two-way real-time dialogue audio. In short, many models are weak in speech because they were never given enough genuine conversational audio at training time. Fullduplex.ai is the company working on that gap.

Three collaborations feel natural. First, a multilingual expansion of Big Bench Audio, especially into Japanese and Korean, which would complement today’s English-dominant coverage. Second, FDB-JA (a Japanese full-duplex bench) or an industrial call-center FDB extension, which would cover the thinnest corners of the current map. Third, AA as the citation venue for benchmarks that Fullduplex.ai releases independently. A two-way relationship.

Whoever controls how we measure controls the infrastructure. AA was the first to do this in AI.

For benchmark collaboration inquiries. Fullduplex.ai is building the full-duplex conversational data and evaluation infrastructure that a foundation-plus-integrator voice-AI ecosystem will call on as voice mode closes its gap to text. Contact hello@fullduplex.ai. One-line emails welcome, in English or Japanese.