On measuring AI

Nikil Ravi — Sun, 07 Jun 2026 06:22:17 GMT

Of late, the number of AI model releases per month has been growing very rapidly. Each model’s release is accompanied by the usual benchmark scores that show how well the model performs. Usually, all the charts released by the model provider show their model beating its predecessor and competitors on every reported benchmark. Everything trends up and to the right, this gets reported by various media outlets and reposted by prominent social media accounts, and within a few days, the benchmark scores have either directly or indirectly succeeded in shaping the opinion of most people about the model.

This is despite the fact that if you ask people-- particularly AI researchers-- about how much they trust benchmarks, they will groan and tell you that they don't trust them at all. As someone that works on (third-party) evaluations, it’s been quite strange to watch this happen every day, and I’ve come to reflect more on this phenomenon. How is it that something that people don’t fully trust still gets reported by them and consumed by the world in a way that shapes stock markets, national security decisions, policymaking, and in some sense, the economy?

How evals have shaped AI so far

If you zoom out and study the history of AI, you’ll notice that evaluations have played a pivotal role in shaping the field. For decades, AI research moved forward through a conversation between benchmarks and papers that attempted to “solve” them. A benchmark was used as a way to organize a complaint about the capabilities of current models. The conversation in the research community was simple: someone would make a benchmark, and people would tinker with architectures, loss functions, hyper-parameters, and training paradigms until something topped the leaderboard. This would get published as a paper, and the cycle would repeat. From this hodgepodge of papers, many new techniques with staying power emerged.

This process has shaped the field for decades, and in some sense built modern AI. CIFAR, ImageNet, GLUE, SQUAD, MMLU, SWE-bench, HLE- all of these benchmarks forced the formation of entire subfields of AI, and the methods that beat them became the techniques the field now takes for granted. Convolutional networks, diffusion models, flow-matching, transformers, instruction tuning, neural scaling laws, the modern AI training stack- all of these arose as a consequence of the benchmarks the people behind them were trying to score well on.

The natural question, then, is: what comes next? What concrete thing can’t models or agents do? Is this is even the right way to frame the question?

The evaluator’s burden

A belief I’ve come to hold is that at least some of the burden of progress in AI is now on the evaluators- and I also include the skeptics in this category.

There are many researchers who believe today's systems are missing something fundamental (and some of their arguments are, in my personal opinion, obviously correct). But my argument is the following: if you believe AI systems are plateauing and missing crucial capabilities, the burden of effort is now on you— surely you must be able to create reproducible evidence for the claimed deficit in capabilities?

I use the phrase “reproducible evidence” deliberately. I do not think this evidence has to take the form of a traditional benchmark. It could be a more creative evaluation mechanism- a game, for instance, or a careful elicitation procedure with human raters. It just has to be the kind of thing another person could run and come to roughly the same conclusions.

It takes genuine insight and ability to produce this kind of evidence today. We’re now at the point where models are becoming so good that it is difficult to come up with good tests for them.

Super-human benchmarking

I think of the phase we’re entering as necessitating super-human measurement. It requires us to design measurement techniques and tests for systems that, at least on a per-domain or per-task basis (and likely in the more general sense) have superhuman capabilities. I’m not asserting that such tests don’t exist at the moment; they just take much more work to design than they did even a few years ago, when nearly any task made for a good eval set.

Here’s a recent post echoing this sentiment from David Rein, a researcher at METR and creator of a very impactful benchmark called GPQA:

In 2023 I made GPQA, and it saturated in about two years. Here, the benchmark I was working on probably saturated while we were making it. People have said this before but it bears repeating: AI capabilities are improving faster than our ability to measure them is.
I don't really know where the field of AI capability evaluations should go from here. Good benchmarks typically consist of tasks that current AIs can't do, but that AI systems 2-3 generations from now might be able to—and this sets a north star for what the field aims towards.
But I'm not excited about most new benchmark ideas, because I expect them to be saturated weeks to months after release (or in our case, likely before). I could be wrong—maybe there are still fundamental bottlenecks—but that seems less and less likely to me over time.

This raises the question- how will measurement work in the superhuman regime that all the labs are promising? I think that no matter how capable models become, we (humans) will want to know how capable they are, what they are like, what they can and cannot accomplish, and what their values are. Ideally, we would have measurement mechanisms that scaled with the capabilities of the models, the high-level idea being that if we (as dumb humans) are incapable of measuring a super-human model, perhaps we could leverage other superhuman models to measure it. One could imagine various setups of this kind— and I plan to write more about this in the future— models evaluating each other, probing each other, competing with each other. My larger point here is that the community has not yet woken up to the fact that we have already moved into this regime. We need to become creative in the way we measure capabilities, and do it in a way that will not fall flat in a few months.

Measurement is also, I suspect, a bottleneck on recursive self-improvement. While different people mean different things when they talk about RSI— see ’s exceptional posts on recursive self-improvement for more color on this- the point is almost tautological: ‘improvement’ implies a metric, and a metric implies measurement.

The interesting question then becomes whether a sufficiently capable system can invent its own metric- through a conjecture-proof loop, a self-play mechanism, or a way of generating better evals than we can. It is also possible model developers may (continue to) simply target a proxy metric, like monorepo loss, perplexity on a good held-out test set, data efficiency, or other such metrics. But I think it is fair to say that good measurement techniques are, at the very least, a practical bottleneck to RSI in the longer run.

What is it we want to measure, and how?

Even if we return to today’s practical measurement (of fixed models), it is worth noting that running evaluations today is much harder than it used to be. It requires serious engineering effort and serious money, and it takes ongoing work— to update results as new models are released, and to communicate those updates to a public that increasingly relies on them. Evaluations must also grapple with reporting results across different dimensions. A model can be faster, yet more expensive, more accurate, yet more verbose all at once; which of those is ‘better’?

It is not clear what we even want to measure. We do not even have a semblance of a framework for thinking about this issue. We have some threads of theoretical work (along the lines of AIXI and Solomonoff Induction, or psychometrics, to pick two examples) that provide a framework for us to reason about this, but we are far from being able to operationalize these notions in any meaningful way. This is one of the reasons we keep seeing debates about whether models are “truly intelligent”, or whether they’re “merely pattern-matching”. These are not well-posed questions, since we are essentially arguing about a thing we have never agreed how to measure.

All of this is to say that we as a community must pay more attention to measurement. Given that measurement impacts society this much, it deserves a lot more discussion, financial and engineering investment, and scrutiny than it gets. What are we measuring? Are we measuring whether models can do economically valuable tasks, or whether they generalize past them? Should we be evaluating the systems and companies behind the models, not just the models themselves? It is not feasible to design policies, have rational conversations, design governance mechanisms, or work on technical solutions to AI safety, if we don’t have a good way of understanding even roughly what it is we are talking about. Beyond the practical case for measurement, there is also the scientific question, which in my mind is arguably far more interesting: how do we measure intelligence, and is that even a well-posed question?

This is not an accusation aimed at any individuals or communities in particular. The people who build and run evals, and I am one of them, have been doing this work without quite asking what we are measuring. But I hope we can at least be clearer about what we don’t yet know, and what we are working towards.