Most of the AI conversation right now is about language. Models that read and write, answer questions and generate text. It's easy to forget the revolution didn't start there.
In 2012, a team of researchers entered a vision competition. Their model, trained on millions of images, recognised objects in photographs with accuracy that was, by the standards of the time, striking. They were working on a problem that had occupied researchers for decades. How do you make a computer understand what it's looking at?
What happened next reshaped not just computer vision, but the entire field. The techniques that cracked image recognition turned out to work across speech recognition, language understanding, and generative art. The AI tools we interact with today are, in a real sense, descendants of a vision paper from 2012.
Understanding why that matters, and what it actually means for a machine to see, requires going back to basics. What are neural networks, and why do they have almost nothing to do with the brain? What does it mean for a system to learn from data? And why does the same approach that identifies cats in photos sometimes learn to find rulers instead of cancer?
A Sense Computers Were Missing
"Vision is probably our most important sense," Andreas says. "If computers can see, they can do things that would otherwise be impossible."
It's a claim worth sitting with. We tend to talk about AI in terms of language right now, but the ability to see, and to understand what's seen, is foundational to a much wider range of capabilities. Self-driving vehicles, medical imaging, robotic manipulation, satellite analysis, industrial inspection. Each of these depends on computer vision at its core. Language models are impressive, but a robot that can only read is of limited use in a warehouse.
There's something interesting here that goes beyond engineering. Vision evolved independently in at least 40 different animal lineages. Vertebrates and invertebrates, molluscs and arthropods, organisms as different as possible, all eventually developing eyes. That's strong evidence that vision solves something so fundamental about navigating and understanding the world that almost every branch of complex life converged on it. The fact that AI researchers found themselves circling the same problem isn't coincidental. It points at something important about what it means to be an intelligent agent in a physical environment.
How the Subfields Found Each Other
For most of AI's history, researchers working on images barely talked to researchers working on sound, and neither group talked much to the people working on language. Each subfield had developed its own methods, its own benchmarks, its own conferences and journals.
"Over the last 10 to 15 years, really since 2012, these subfields converged," Andreas explains, "because neural networks that broke through in vision turned out to work across domains."
This convergence has an interesting structural parallel in the history of biology. When researchers established the double helix structure of DNA in the 1950s, they provided a single underlying mechanism that unified what had been separate fields: genetics, biochemistry, evolutionary biology, microbiology. Each had been building useful knowledge independently. The breakthrough was realising they were all describing different aspects of the same underlying process.
Something similar happened after 2012. Vision researchers had cracked a problem that, it turned out, wasn't specific to vision. They'd found an approach to learning representations from raw data that generalised to any problem involving pattern recognition at scale. Language, speech, and image processing weren't different problems requiring different solutions. They were the same class of problem, and one approach was powerful enough for all of them.

What Neural Networks Actually Are
When people hear "neural network," they tend to assume it's a digital version of the brain. Andreas is direct about this: "People hear 'neural' and think we replicate the brain. We don't."
Artificial neural networks are loosely inspired by biology in the sense that they consist of many interconnected units passing signals to each other. But the individual units don't do anything like what biological neurons do. They perform basic arithmetic, additions and multiplications. Connect enough of them, and you get a system capable of approximating complex functions. Feed in numbers (an image is just a grid of pixel values) and get back a decision.
What makes this work is learning, and learning happens through a process called backpropagation. A network starts with millions of parameters, called weights, set to random values. You show it an example and it produces a prediction. You compare that prediction to the correct answer, then calculate which adjustments to the weights would have produced a slightly better prediction. You make those adjustments, then repeat with the next example. Millions of times.
No human decides what the network should learn. The internal representations emerge from the training process itself. The network discovers them. This is part of why neural networks are hard to explain: we didn't design what they learned. We designed a process that produced learning, and then observed the results.
Building Vision in Layers
Standard neural networks run into a practical problem when applied to images. A high-resolution image has millions of pixels, and connecting every pixel to every part of the network creates a system that's nearly impossible to train effectively.
Convolutional neural networks solve this by constraining the connections. Each unit looks at only a small neighbourhood of neighbouring pixels. This forces the network to build understanding incrementally.
"Early layers learn simple patterns," Andreas explains. "Brightness, edges. Deeper layers compose edges into shapes, later into parts like eyes and mouths, and arrangements like faces. It's a hierarchy that builds understanding."
This layered organisation wasn't designed to mimic the brain, but it turns out to resemble, in rough outline, how the visual cortex processes information. The primary visual cortex responds to orientations and edges. Higher areas respond to more complex features. Even higher areas respond to faces and objects. The engineers weren't copying neuroscience. They were building what worked, and what worked happened to echo a structure that evolution had discovered independently.
When the Model Learns the Wrong Thing
The most striking moment in our conversation is a single story about skin cancer and a ruler.
A research team had trained a model to classify skin lesions as malignant or benign. The model achieved high accuracy on its test data, and the research team considered this a success. Then someone looked more carefully.
"The model learned to look for rulers, not cancer," Andreas says. "On real-world data, it failed."
Here's what had happened. Clinical photographs of malignant lesions often included a ruler for scale. Doctors measured suspicious growths as part of standard practice. Benign lesions were photographed without one. The training data reflected this correlation consistently. The model found it, learned it, and used it. It had no way to know the ruler was irrelevant to the diagnosis. It just found a pattern that reliably predicted the right answer in the data it had seen.
On data from the real world, where that correlation didn't hold, the model failed. Confidently.
This is sometimes called the explainability problem, but that framing undersells how fundamental it is. The model's error wasn't an anomaly. It was the system working exactly as designed. Finding correlations at scale is what these models do. Whether those correlations are the ones we actually need is a question the training process can't answer on its own.
The same dynamic shows up throughout systems that use data to make decisions. Goodhart's Law, the economist Charles Goodhart's observation that when a measure becomes a target it ceases to be a good measure, applies wherever a proxy is substituted for the thing we actually care about. Schools optimise for test scores. Hospitals optimise for readmission rates. Financial models optimise for whatever signals predicted defaults in historical data. In each case, the system finds the correlation in the available data and runs with it.

What the Field Still Can't Do
Andreas identifies three open problems at the frontier of computer vision research, and they're worth taking seriously.
The first is fine-grained understanding. Current models can identify broad categories reliably, but struggle with the kind of contextual, nuanced recognition that humans find easy. Distinguishing between similar breeds of dog, understanding what's happening in a scene rather than just what objects are present, recognising when context changes the meaning of what you're seeing. These remain genuinely hard problems.
The second is explainability. "We have fully explainable models," Andreas notes, "but they're far weaker than top neural networks. There's a trade-off." In safety-critical applications, including medical diagnosis, autonomous vehicles, and credit decisions, knowing why a model made a decision isn't optional. The EU AI Act is pushing for explainability in high-risk AI uses. The problem is that our most capable models are, at their core, opaque.
The third is machine unlearning: how to remove information from a trained model after the fact. If a model was trained on data that later turns out to be sensitive, or that should be deleted under privacy law, how do you actually remove it? Re-training from scratch is expensive. Simply flagging the data doesn't remove its influence. This is becoming an increasingly important problem as data rights questions move from policy to enforcement.
None of these are technical glitches or edge cases. They're fundamental tensions in how these systems work.
The Pattern Behind the Progress
The 2012 breakthrough didn't just improve image recognition. It demonstrated something more general: that learning representations from data at scale could work across problems that had resisted hand-crafted solutions for decades. The same recipe, data, compute, backpropagation, and the right architecture, turned out to be more transferable than anyone had fully anticipated.
But the story of the skin cancer classifier is a useful counterweight to any simple narrative of progress. Capability and reliability don't advance together automatically. The same properties that let a model find patterns invisible to the human eye also let it find the wrong patterns, with total confidence. The challenge isn't just building more capable systems. It's building systems we can trust to be capable in the ways that actually matter.
The three open problems Andreas identifies, fine-grained understanding, explainability, and machine unlearning, aren't at the edges of the field. They're at its centre. The models we have are remarkable. Whether we understand them well enough to deploy them wisely is a question we're still working out.
What does it actually mean for a machine to see? The answer, it turns out, is still incomplete. And the gap between what these systems can do and what we can explain about how they do it is where the most important work in the field is now happening.
