AI Just Passed a Test That Was Never Meant for Machines

Watching a chatbot pass a test that was created more than 70 years ago by a man who wasn’t quite sure what he was designing has a subtly strange quality. It was dubbed the imitation game by Alan Turing. Eventually, everyone else referred to it as the Turing test.

A language model that resides in a data center and generates text by predicting the next most likely word is said to have passed a version of it in a lab at the University of California, San Diego, rather than a humanoid robot with synthetic skin or some science fiction supercomputer.

Category	Details
Full Name	Alan Mathison Turing
Born	23 June 1912, Maida Vale, London, England
Died	7 June 1954, Wilmslow, Cheshire, England
Nationality	British
Profession	Mathematician, Computer Scientist, Cryptanalyst, Philosopher
Known For	Turing Test, Turing Machine, Codebreaking at Bletchley Park
Education	King’s College, Cambridge; Princeton University
Key Publication	Computing Machinery and Intelligence (1950)
Affiliated Institution	University of Manchester
Test Concept Origin	Originally called the “Imitation Game” — proposed to answer: Can machines think?
Recent Study	GPT-4.5 by OpenAI judged human 73% of the time in 2024 UC San Diego preprint
Study Authors	Cameron Jones & Benjamin Bergen, University of California San Diego
Models Tested	ELIZA, GPT-4o, LLaMa-3.1-405B, GPT-4.5
Peer Review Status	Not yet peer-reviewed (preprint as of March 2024)

Participants in the study judged OpenAI’s GPT-4.5 to be human 73% of the time. That number is compelling enough to be repeated. When sitting in front of a split-screen chat interface and exchanging messages with an AI and a real person for five minutes, nearly three out of four people made the incorrect decision. “That one’s the human,” they said, pointing to the machine.

The study was carried out by cognitive scientists Cameron Jones and Benjamin Bergen, and it was published as a preprint in March. This indicates that it has not yet undergone peer review, which is something to consider before getting too excited. However, it is hard to completely discount the results. The performance of the other tested models varied. 56% of participants were persuaded by LLaMa-3.1-405B. The notoriously archaic chatbot from the 1960s, ELIZA, only achieved 23%.

GPT-4o scored 21%, which is oddly lower than its more recent sibling. This begs the question of what specifically makes GPT-4.5 feel more human in brief conversations. Is the wording warmer? An improved sense of timing? A readiness to be a little vague? No one has yet provided a complete response to that.

Since the popular version of the test’s history has strayed significantly from the original, it is worthwhile to return to the test’s original source. Turing did not suggest a simple human-machine test. He suggested something more bizarre: a three-way game in which an interrogator used only written answers to try and identify the sex of two hidden participants, one man and one woman.

Then he questioned what would happen if the woman was replaced by a machine. Is there anything worse the interrogator could do? Fooling people wasn’t really the question. It was about substituting one type of ambiguity for another. Even Turing appeared unsure if he had put forth a real intelligence test or merely an intriguing thought experiment.

That uncertainty has never been completely cleared up. Decades later, philosopher John Searle put forth the Chinese Room argument, which claimed that a system could pass any language-based test without truly comprehending anything—processing symbols without meaning, imitating intelligence without having it. There is still no clear solution to this problem.

It’s difficult to ignore the significance of Searle’s argument as you watch this most recent study develop. Although GPT-4.5 may have persuaded people that it was human, thinking and convincing are two different behaviors carried out by different types of entities.

The authors of the study appear to be aware of this. According to them, the Turing test assesses “substitutability”—that is, the ability of a system to act in place of an individual without anyone noticing. That framing is important. The assertion that GPT-4.5 has attained human-level intelligence is untrue. It is said to be able to generate responses that are identical to what a human might type in a five-minute text exchange. Many of the breathless headlines are the result of confusing those two very different things.

There are also valid concerns regarding the conditions of the study. Five minutes is not much time. The participants were aware that they were taking a test, which can alter behavior in unpredictable ways. Some may have been searching for glaring mistakes, while others may have been primed to see humanity everywhere. Additionally, each AI model was assigned a particular persona to adopt; however, the specifics of those personas were not made public, so it is unclear how much that influenced the outcomes.

A model that is told to be a little hesitant, a little self-deprecating, and possibly prone to small typos might perform significantly differently than one that is left to its defaults.

It’s possible that GPT-4.5 has mastered something more akin to social texture than intelligence. People are not always rational when conversing. They use hedging. They float. They either make minor mistakes and then fix them or they don’t.

When trained on billions of human-generated words, a model that has sufficiently absorbed that texture can replicate it in a way that seems genuine even when nothing genuine is going on beneath the surface. This seems more in line with what happened here than with any kind of real thought.

The moment is not completely diminished by any of this. Something has changed. The difference between machine-generated and human-sounding text has become so small that it is impossible for regular people to distinguish between the two when they are paying attention in a controlled environment.

That has real-world implications that go far beyond philosophy, including misinformation, trust, and how we read and comprehend online content. Even though the Turing test has always been a somewhat flawed tool, now that a machine has finally pointed back, the thing it was loosely pointing at turns out to matter more than ever.

AI Just Passed a Test That Was Never Meant for Machines

Anthropic’s $10 Billion Mistake: What the Claude Source Code Leak Means for AI Safety

Apple’s iOS 18 DarkSword Patch Is a Rare Emergency Fix — Here’s What It Means for Your iPhone

Japan’s New AI Robot Just Did Something Completely on Its Own

This Breakthrough Could Change How Humans Think About Intelligence

NASA Scientists Detect Strange Energy Pattern in Deep Space

Anthropic’s $10 Billion Mistake: What the Claude Source Code Leak Means for AI Safety

Apple’s iOS 18 DarkSword Patch Is a Rare Emergency Fix — Here’s What It Means for Your iPhone

Japan’s New AI Robot Just Did Something Completely on Its Own

The Discovery That Has Physicists Questioning Everything

Scientists Say Earth May Be Changing Faster Than Expected

AI Just Passed a Test That Was Never Meant for Machines

AI Just Passed a Test That Was Never Meant for Machines

Related Posts