The difficulty of AI benchmarks

How much progress have we made in AI, and how close are we to a general human-level intelligence (AGI)? It's tempting to pick some benchmarks that surely only something intelligent could pass. Superhuman chess, go, or the ability to hold a conversation have all been given as examples, among many others. Those three have arguably been beaten (by Deep Blue, AlphaGo, and ELIZA), but there's no AGI in sight yet. What's going on?

I claim the problem is benchmarks fail to measure something as general as "intelligence". Not that these benchmarks fail, but that all benchmarks fail. And it's not obvious how they will fail until you see a clearly non-AGI system beat it.

It was once thought outplaying humans at chess would require real "intelligence". Deep Blue beating Kasparov was a dramatic occasion! But superhuman chess clearly turned out to be a much easier problem than intelligence. It's now mundane. You can even run it on your smartphone, and you're more likely to call "AI" as a common name for a computer opponent than because it's "intelligent".

Computer vision has long been another hard task for computers. A lot of "AI" work went into having computers be able to recognize objects from pictures we'd find easy. Computers progressed from useless, to ok, to arguably superhuman at major benchmarks like ImageNet. Even my car - not a Tesla - quietly does optical image processing for automated steering and braking. I haven't found the words "AI" anywhere in the manual or the marketing, just bland terms like "lane keeping assistance".

A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labelled AI anymore.
-Nick Bostrom

This sentiment is so common it has its own name, the "AI effect". It's easy to underestimate how much progress computers have made into domains once thought to be the preserve of human intelligence. Tasks we've succeeded at addressing with computers seem mundane, mere advances in some other field, not true AI. We miss that it was work in AI that lead to them.

This might suggest AGI is closer than we think. We retroactively reduce the significance of benchmarks AI attains, so underestimate the progress made so far in AI. The remaining distance might look much less after accounting for quite how far we've come already. If you want to be able to convince others - for any reason from getting investment to worrying about AI risks - you might want some way we could agree it's near.

One way you could address this issue is to ask people to commit to some benchmark(s) now, that they'll accept as a marker of progress towards AGI. That protects against this revisionism. You can ensure that when a proto-AGI passes that test it can't be neglected.

The AI effect has another side though. Perhaps the benchmarks were always flawed, because we set them as measures of a general system, forgetting that the first systems to break through might be specialized to the task. You only see how "hackable" the test was after you see it "passed" by a system that clearly isn't "intelligent". Taking previous benchmarks at face value might falsely suggest progress towards AGI from task-specific systems.

An obvious solution is to define better benchmarks. We've seen previous benchmarks that failed. We can sit down and come up with new ones, being careful that they're not vulnerable to any of the non-generalizable systems or approaches we've learned exist. Then we can commit to these new benchmarks.

I claim that not only will these new benchmarks turn out to have the same issue, but that this is a fundamental problem of defining precise benchmarks for something so general as intelligence.

We falsely extrapolate correlations that hold within one type of system - humans - between classes. Generally speaking, smarter humans are better at chess, so we saw "being good at chess" as a sign of intelligence. It seemed natural to expect that a system beating humans at chess would similarly have human intelligence. Yet two decades after Deep Blue beat Kasparov, with top computers now unambiguously better than any humans at chess, those systems remain far from general intelligence. It seems strange now to think many once saw chess as equivalent to general intelligence for computers. Easy to forget.

Of course, this shouldn't be surprising. Consider the limited domain that is "motion" instead of "intelligence". As for intelligence humans are relatively general systems. Humans who are faster on flat ground are generally also faster on rough ground, or in water. You might conjecture that anything much faster than humans on the flat would surely also beat them in water. But then you see a car is much faster than a human on flat ground, yet is useless in water, or on rough-enough ground. It's a system specialized to a smaller range of problems.

Similarly, among a class of general intelligences, it's easy to find measures that correlate with how intelligent they are. But that's not what you want to measure for AI benchmarks! It's much harder to construct precise benchmarks that measure whether a system is generally intelligent.

It gets worse. Whenever you do define a precise benchmark that gets widely accepted by the community it becomes a target for AI researchers to beat. Everyone agreed that this benchmark measures intelligence, so improving benchmark results must be good work. Against such relentless optimization both individually and as a community, any decoupling between the new benchmark and AGI progress will manifest.

This is Goodhart's law in action. The very act of agreeing on a benchmark for AGI can make it useless in that role! If you use it as an early warning then long before the first proto-AGI passes that test, countless false alarms have been raised and the benchmark is long since ignored.

It's all but impossible to be both precise and quantifiable enough to easily agree on what progress has been made, and general and flexible enough not to succumb to Goodhart.

The Turing Test highlights this tension. Well-defined, easily measurable and quantifiable versions, like the ability to fool ~50% of random people in a short conversation, have arguably been passed already by chat bots. (GPT-3 even more so). But the old chat bots relied heavily on specialized countermeasures to certain lines of questioning, and even GPT-3 is specialized at "producing plausible-looking text". None are AGI.

The most general version is that an expert with an arbitrarily long time to discuss any topic can't tell the difference better than chance. The chat bots and even GPT-3 fail this one. But it's impractical to run and impossible to replicate - was the expert expert enough and did they spend long enough?

Where does this leave us for measuring AI progress? No option seems great:

Benchmarks known at the time turn out to be poor measures of general progress (as above)
New benchmarks might not be computable for old systems and risk bias from knowing history
Expert surveys have substantial flaws (I'll write about this in a later post)

Still, we can try to piece together a view from the available evidence, taking into account and attempting to adjust for its flaws. I'll write more about my personal conclusions on AI progress from this later.

The difficulty of AI benchmarks

Z

Z

Microthought: Safety isn't first, and it shouldn't be

Microthought: "Be more careful" isn't a good solution