Artificial Analysis: how two Australians became the AI industry’s independent scoreboard.
In September 2025, OpenAI quietly cited a competitor’s measurement inside its GPT-Realtime launch materials. The cited firm was Artificial Analysis, a benchmarking company run out of Sydney and built by two former McKinsey consultants. This piece walks the structure behind that fact: founders, design choices, speech-to-speech extension, and what kind of neutral position in AI infrastructure they have locked down.
1. Why did OpenAI borrow a competitor’s scores
In September 2025, when OpenAI announced its new voice model GPT-Realtime, a competitor’s measurement quietly appeared inside the launch materials. The cited company was Artificial Analysis (AA), an independent benchmarking firm based in Australia and the United States. For a vendor to borrow another firm’s scores at its own product launch is quite unusual in this industry.
Why does that happen? The short answer is that AA is the company building fuel economy labels for AI models.
When buying a car, few people take the manufacturer’s claimed mileage at face value. That is why governments and consumer groups run their own testing regimes. The EPA in the United States, the WLTP cycle in Europe, the WLTC mode in Japan. Only when a shared yardstick exists, meaning “measured under these conditions with this method produces this number,” can a Toyota and a Honda be compared on the same footing. In the AI world, no such neutral testing body existed. Every model developer kept announcing “world best performance” using numbers measured under conditions that happened to flatter their own model.
What AA brought in is refreshingly simple. Put more than 100 AI models through the exact same set of test questions, grade them with the exact same method, and publish everything on a single leaderboard. In short, a standardized common test for AI.
What is more surprising is that AA started in 2023 as a weekend side project by Micah Hill-Smith and George Cameron, two former McKinsey consultants. Their first office was a one-room apartment in Sydney. Two years later, OpenAI was citing them in product launches.
One question drives this piece. How did two Australians lock down the most influential neutral position in AI infrastructure? The answer is not in what they measure. It is in three design choices they baked into how they measure: transparency, reproducibility, and neutrality.
2. What a benchmark actually is
Comparing AI models is in fact harder than comparing car fuel economy.
With cars, the yardstick already exists. The WLTC driving cycle is defined by regulation, every automaker drives the same route under the same conditions, and the resulting number goes on the spec sheet. You can drop a Toyota next to a Honda and just compare.
AI never had that. When OpenAI announces “our new model is 50% smarter than the previous generation,” Google responds with “we beat OpenAI by 15%.” But the definition of smart is decided by each vendor, measured in settings that happen to flatter their own model. Comparison breaks down because the playing fields are never aligned.
What AA rebuilt is the playing field itself. Unpack the word benchmark and it has only three parts. A common task, meaning every model solves the same problems. A common grading rule, meaning scoring uses the same criteria. A common format, meaning results are published in the same shape. The same idea as a standardized college entrance exam. The same idea as the nutrition label on a food package.
In short, AA built an independent fuel economy testing ground for the AI industry. No one asked them to. They just went and did it.
3. Two founders, a Sydney apartment, a weekend side project
A brief word about the two founders.
Micah Hill-Smith is from New Zealand. He studied law and computer science at the University of Auckland and was dux, meaning first in his class, at Rongotai College. After graduation he joined McKinsey’s Tokyo office, and a few years later struck out on his own.
George Cameron is from Australia. He studied at the Australian National University, interned at Google, and then joined McKinsey himself. The two met on a McKinsey project.
They started the AI benchmarking project in 2023 as a weekend hobby. Their first office was a one-room apartment in Sydney. OpenAI had just released GPT-4, every vendor’s “world best model” claim was accelerating, and the two brought their consultant instinct — the habit of breaking things into a framework and comparing them — straight into an AI model comparison site.
“Whoever controls how we measure controls the infrastructure.”— fullduplex research, on the AA thesis
Team size today is about 20 by public counts, still small by analyst industry standards. Even so, by September 2025 OpenAI was citing them in product launches. The fairest read is that the direction the two founders picked and the empty space in AI at that moment lined up. That alignment is what gave the project this speed.
4. The three design choices
What AA really locked down is not what they measure. It is how they measure.
Point 1 — Transparency
AA disclosed that a single evaluation of OpenAI’s reasoning model o1 cost them $2,767. Publishing the reasoning budget, meaning the amount of compute an AI is allowed to spend thinking about one problem, is a rare practice. It matters because the same model’s score changes depending on how much compute it can spend. Give it ten times the budget and the score goes up. If you do not disclose how much was spent for that score, the comparison does not hold up.
Point 2 — Reproducibility
AA’s flagship benchmark, AA-Omniscience (6,000 questions across 42 topics), is published as a paper on arXiv. In other words, the methodology is open to peer review. Their other flagship, Big Bench Audio (1,000 questions, 23 voices, graded by Claude 3.5 Sonnet), follows the same pattern. That anyone should be able to rerun a test and get the same result, the basic principle of science, had been surprisingly absent from AI benchmarking.
Point 3 — Neutrality
AA uses a mystery shopper approach, the same logic an undercover diner uses at a restaurant. They do not notify model providers in advance. They access the models only through the same public API that any other user would use. That removes the temptation for a vendor to quietly serve an “optimized for evaluation” version of the model.
All three have to be present at once. Concretely, AA is the firm that publishes “how much we spent,” “how we measured,” and “that we did not call ahead.”
5. The same three choices, now in speech-to-speech
Everything above was about text-based large language models (LLMs). What makes AA genuinely interesting is that they are bringing the same three design choices into speech-to-speech (AI that responds directly from audio to audio, STS for short).
AA’s Speech-to-Speech Leaderboard measures along two main axes.
The first is Big Bench Audio. It is the speaking and listening version of the reasoning test. The same 1,000 problems, delivered in 23 voice variations, graded by Claude 3.5 Sonnet. In short, the same kinds of questions they were measuring in text, now solved through audio.
The second is Conversational Dynamics on FDB. FDB (Full-Duplex Bench) is a measurement standard developed in academia for full-duplex communication, meaning two-way, phone-like, real-time audio. It evaluates turn-taking (who speaks when) and barge-in (speaking while the other side is still talking). AA took this benchmark and reshaped it into a form that works for comparing commercial models.
Measuring on these two axes surfaces the most interesting finding in the report. OpenAI’s GPT-4o can solve 92% of these problems in text mode, but drops to 66% in STS mode. A 26 point gap. Gemini and Claude show similar gaps.
Why the drop. The reasoning chain (the steps an AI takes to reach a conclusion) gets shorter. The audio codec (the translator that compresses audio for transmission) loses information. Latency constraints (response time budgets) leave no room to reconsider an answer. Multiple causes stack.
When OpenAI cited AA’s Big Bench Audio at the GPT-Realtime launch in September 2025, it was because this was one of the few independent metrics that looked at this gap directly. Said differently, “voice mode is still weaker than text mode” is an inconvenient truth the AI industry has to acknowledge. AA’s leaderboard became the public venue where that acknowledgement happened.
6. Strengths and fragilities
AA’s strength is that the three design choices function as a reputation moat, meaning a structural advantage that late entrants cannot easily match. Once you carry more than 100 evaluated models in your archive, anyone trying to do the same thing has to fund an equivalently large re-test. The arXiv publication and the open API methodology pile on further barriers.
The fragilities are worth naming honestly. The team is about 20 people by public count, which is small for the analyst industry. Evaluation targets grow every month, and whether the team can hold quality depends on staffing and process. Revenue reportedly depends on subscriptions and advisory engagements. If AA ever took sponsorship from a large cloud provider, its neutrality itself would come into question.
7. Three audiences, and a proposal from Fullduplex.ai
The people reading AA’s leaderboard every day split into three groups.
Venture investors want external competitive reads on the models they have invested in. Model labs want to know where they rank globally, and where the gap to the leaders actually sits. Enterprise buyers want inputs for procurement decisions. Three groups, looking at the same leaderboard, actually caring about different columns.
| Reader | Main purpose | Columns they watch | Decision |
|---|---|---|---|
| Venture investor | External read on portfolio competition | Rank trajectory, cost, speed | Investment call |
| Model lab | Own rank and gap to frontier | Per-category scores | R&D direction |
| Enterprise buyer | Vendor selection | Overall score, latency, price | Procurement |
From Fullduplex.ai’s point of view, the STS gap that AA is working to measure — the performance drop in voice mode — traces back to training data that is not full-duplex conversational, meaning real two-way real-time dialogue audio. In short, many models are weak in speech because they were never given enough genuine conversational audio at training time. Fullduplex.ai is the company working on that gap.
Three collaborations feel natural. First, a multilingual expansion of Big Bench Audio, especially into Japanese and Korean, which would complement today’s English-dominant coverage. Second, FDB-JA (a Japanese full-duplex bench) or an industrial call-center FDB extension, which would cover the thinnest corners of the current map. Third, AA as the citation venue for benchmarks that Fullduplex.ai releases independently. A two-way relationship.
Whoever controls how we measure controls the infrastructure. AA was the first to do this in AI.