Jiajun Wu, an MIT PhD candidate, likes to use a thought experiment to explain why machines still have trouble with something that kids take for granted. Without hesitation, a child who has never seen a pink elephant can describe one; the elephant from memory and the color from experience are instantly combined to create something new.
This is not possible for a computer that has been trained on millions of gray elephant photos. It has no color separate from the animal. Other than the color, it has no animals. “The ability to generalize and recognize something you’ve never seen before,” Wu states, “is very hard for machines.”
| Category | Details |
|---|---|
| Research Topic | Machine Reasoning & Hybrid AI (Symbolic + Statistical Learning) |
| Lead Researchers | Jiajun Wu (MIT PhD), Joshua Tenenbaum (MIT), Jiayuan Mao (Tsinghua University) |
| Affiliated Institutions | MIT, MIT-IBM Watson AI Lab, DeepMind |
| Conference Presentation | International Conference on Learning Representations (ICLR) |
| Key AI Model Type | Hybrid: Neural Networks + Symbolic Programming |
| Training Dataset | CLEVR Image Comprehension Test — developed at Stanford University |
| Core Capability | Concept transfer, object-relation reasoning, visual question answering |
| Data Efficiency | Achieved top results with a fraction of data used by competing models |
| Expert Commentary | Dan Roth, Glandt Distinguished Professor, University of Pennsylvania & Science Lead, AWS AI |
| Historical Context | Symbolic AI originated in 1950s–60s; deep learning dominated last decade |
| Key Research Challenge | Generalizing to unseen scenarios — the “pink elephant” problem |
| Publication Reference | Khemlani & Johnson-Laird, KI – Künstliche Intelligenz, 2019 |
One of the most contentious issues in artificial intelligence is the difference between what a child can comprehend and what a machine can compute. Deep learning—systems that sort through massive datasets, find statistical patterns, and become exceptionally skilled at particular tasks—was the predominant method of narrowing it for years. Alexa is estimating the weather. tagging your friends on Facebook. Google appears with a response before you’ve finished typing.
These are all products of statistical learning, and they are genuinely impressive. But they are also, in a specific and increasingly obvious way, brittle. They collapse when you ask them a question that is just a little bit different from what they were trained on.

It’s possible that the field has been reaching its natural ceiling, and the researchers know it. That’s why, in labs at MIT, DeepMind, and elsewhere, there’s a growing movement back toward something older — symbolic AI. The approach, popular in the 1950s and 1960s, encodes rules and logical relationships directly into a system. It learns through structured steps rather than from unprocessed patterns.
It is able to justify its choices. It makes use of a lot less data. And when you combine it with modern neural networks, something interesting happens: the hybrid model starts performing better than either approach on its own.
A team led by Wu and Joshua Tenenbaum — a professor in MIT’s Department of Brain and Cognitive Sciences — published results showing exactly this. Their model, developed alongside researchers at the MIT-IBM Watson AI Lab and DeepMind, was trained on image-question-answer pairs drawn from the CLEVR visual comprehension test, a rigorous benchmark developed at Stanford University. The questions begin simple: “What color is that object?” They get harder: “
How many objects are both to the right of the green cylinder and made of the same material as the small blue ball?” What the model demonstrated, on progressively harder questions and with only a fraction of the training data used by competing systems, was something close to conceptual transfer. It wasn’t just pattern-matching. It was something that, at least loosely, looked like reasoning.
The architecture behind it is worth understanding, because it’s clever in a particular way. A perception module built from neural networks scans each image and maps the objects within it. A language module — also neural — parses the question and converts it into a symbolic program, a kind of logical instruction set. A third reasoning module then executes those instructions against the scene and produces an answer.
What makes this work, and what distinguishes it from similar attempts, is something the team calls curriculum learning: feeding the model increasingly difficult examples in a deliberate sequence, rather than randomly. It turns out that orderly learning is important. The model learns faster and more accurately when the data is structured logically, which is, when you think about it, exactly how children are taught.
“One way children learn concepts is by connecting words with images,” says Jiayuan Mao, an undergraduate at Tsinghua University who worked on the project as a visiting fellow at MIT. “A machine that can learn the same way needs much less data, and is better able to transfer its knowledge to new scenarios.”
That’s the part that tends to get underemphasized in coverage of AI breakthroughs: not just what the system gets right, but how much it needed to learn it. The MIT model outperformed competitors at Stanford and MIT Lincoln Laboratory using significantly less data. That’s not a marginal improvement. That’s a different kind of intelligence.
Dan Roth has been watching this shift for a long time. As science lead for natural-language processing at Amazon Web Services’ AI organization and a professor at the University of Pennsylvania, he chairs sessions at major AI conferences and reads more papers in a month than most people read in a year. Reviewing submissions to the Association for Computational Linguistics conference recently, one pattern stood out immediately. “
I looked at some statistics of papers in ACL,” he says, “and I saw that there are dozens of papers now that have ‘reasoning’ in the title. The title ‘learning to reason’ is now becoming sort of hot.” Roth’s phrasing is dry, but the observation is significant. The field is reorienting.
Roth defines machine reasoning as “the ability to make inferences, especially in sparse situations that are unlikely to have been observed before.” The example from the textbook is deduction: since Sappho is a woman and all women are mortal, a reasoning system ought to conclude that Sappho is mortal. However, textbook problems are not the true challenge. Roth describes a question you might ask a virtual assistant: “Are we going to make it to dinner before the movie?”
To answer that, the system needs your current location, the restaurant’s location, estimated travel time, a typical duration for dinner — information you never provided — and parking, an event you didn’t mention at all but that any human would account for automatically. No single model can handle all of that. It requires modularity: separate systems that know how to compute specific things, coordinated by a higher-order model that knows how to combine them.
That’s where symbolic reasoning re-enters the picture. Roth is direct about it: “The growth and excitement around neural networks has left symbols behind. Some people think that symbols are an evil invention of the old AI people. But symbols were invented because they’re useful, necessary abstractions.” He points to time as an example. If A happens before B and B happens before C, then A happens before C.
This is transitive reasoning, and it will never appear explicitly in training data. It has to be built in. In some cases that can be encoded into a neural network’s architecture. In others, it can only be applied as a declarative constraint after the model’s primary choice has been made.
As this develops, there’s a sense that the field may have arrived sooner rather than later—not at artificial general intelligence, which is still a far-off and contentious concept, but rather at a more accurate description of what machines truly lack. They can pattern-match at inhuman speed. They struggle to imagine a pink elephant. The researchers working on hybrid models aren’t claiming to have closed that gap. But they are, carefully and with accumulating evidence, beginning to describe its shape.
