The AI That Learned to Look for Rulers, Not Cancer

A few years before anyone was talking about ChatGPT, a research team built something that should have changed how we talk about AI.

They had trained a model to detect skin cancer from clinical photographs. It performed well in testing. High accuracy on the validation set. The kind of result that looks, on paper, like a genuine medical advance.

Then someone looked more carefully at what the model had actually learned. It had learned to detect rulers.

In the clinical training data, photographs of malignant lesions often included a ruler for scale. Doctors measured suspicious growths as part of standard practice. Benign lesions were typically photographed without one. The correlation was consistent across thousands of images, and the model found it. From the model's perspective, this was a perfectly reasonable discovery. Rulers reliably predicted the correct label. The model used what it found.

On real-world clinical data, where no such correlation existed, it failed.

Andreas Møgelmose shares this story in our recent conversation as an illustration of the explainability problem in AI. That framing is accurate. But it undersells how fundamental the problem is.

The Mistake Before the Mistake

The model didn't create a problem. It revealed one that was already there.

The training data contained a systematic bias, and nobody caught it. Not because they weren't careful, but because the bias wasn't in any individual image. It was in the pattern across thousands of images: malignant lesions photographed one way, benign ones another. A human reviewer looking at individual photos would likely miss it. A model trained on all of them at once would find it immediately.

The model did what it was designed to do. It found a strong, consistent correlation in the data and used it to make predictions. The question of whether that correlation was causally related to the thing being predicted was never asked, because machine learning systems don't ask that question. They find patterns that work in the available data and assume those patterns will generalise.

This is Goodhart's Law, running at machine speed. The economist Charles Goodhart formulated the principle in the context of monetary policy: when a measure becomes a target, it ceases to be a good measure. The observation has turned out to apply far beyond economics. Schools evaluated on test scores teach to the test. Hospitals evaluated on readmission rates find ways to manage readmissions that don't always involve making patients well. Financial models optimise for whatever signals predicted defaults in historical data, and those signals shift when underlying conditions change.

In each case, a measurable proxy is substituted for the thing we actually care about. The proxy and the goal were once correlated enough to be useful. Then the system optimised for the proxy, and the correlation broke down.

AI makes this process faster, more opaque, and substantially harder to catch.

Why Scale Changes the Stakes

A human clinician who develops a flawed mental model of what malignant lesions look like will make errors. Those errors are observable over time. Colleagues notice patterns. Feedback accumulates. The flawed model tends to get corrected.

A deep learning model trained on biased data and deployed at scale makes the same error millions of times before the pattern becomes detectable. The errors are distributed, not clustered. The feedback mechanisms that correct a human's reasoning either don't exist or are too slow to matter before the harm accumulates.

"We have fully explainable models," Andreas acknowledges, "but they're far weaker than top neural networks. There's a trade-off."

This trade-off is the current working condition of applied AI. In safety-critical contexts, including medical diagnosis, autonomous vehicles, credit decisions, and criminal risk assessment, we're regularly deploying systems we can't fully audit, at scales that make auditing necessary, under the assumption that after-the-fact error detection will be sufficient.

The skin cancer case suggests that assumption deserves more scrutiny than it typically gets.

The Counterargument and Its Limits

The standard counterargument runs roughly like this: even an imperfect AI may outperform the alternative. If the model detects cancer at a higher rate than average clinicians, even with the ruler bias, it might be net beneficial. Demanding perfect explainability before deployment delays tools that could save lives.

This argument is real, and in some cases it's correct. But it concedes too much too quickly.

The question isn't whether AI outperforms humans on some average task under controlled conditions. The question is whether we understand our deployment well enough to know when it will fail and how badly. A model that outperforms clinicians on average while failing systematically on the patients who most need it is a different thing from a model that performs uniformly better. Without explainability, we can't tell the difference.

The EU AI Act's push for explainability in high-risk uses is a regulatory acknowledgment of this. But regulation tends to follow disasters, and in high-stakes domains, disasters are expensive in ways that go beyond cost.

What This Actually Requires

The explainability problem won't be solved by better explainability tools alone, though better tools would help. It requires a different relationship between the people building AI systems and the people deploying them. One that asks not just whether the model performed well on the test set, but what the model actually learned to measure.

That question is uncomfortable because answering it takes time and expertise that deployment timelines don't always accommodate. It's easier to ship the system that works and investigate failures later.

Rulers aren't always rulers. Sometimes the spurious correlation is subtler, buried in how data was collected, in who was represented in the training set, in what counted as a positive label and under what conditions. These correlations are invisible until the failure mode surfaces.

The question we should be asking before deployment is always the same: what did this model actually learn to optimise for?

The answer is rarely as obvious as a ruler in a photograph. That's precisely what makes it worth asking.