Watching a chatbot pass a test that was created more than 70 years ago by a man who wasn’t quite sure what he was designing has a subtly strange quality. It was dubbed the imitation game by Alan Turing. Eventually, everyone else referred to it as the Turing test.
A language model that resides in a data center and generates text by predicting the next most likely word is said to have passed a version of it in a lab at the University of California, San Diego, rather than a humanoid robot with synthetic skin or some science fiction supercomputer.
| Category | Details |
|---|---|
| Full Name | Alan Mathison Turing |
| Born | 23 June 1912, Maida Vale, London, England |
| Died | 7 June 1954, Wilmslow, Cheshire, England |
| Nationality | British |
| Profession | Mathematician, Computer Scientist, Cryptanalyst, Philosopher |
| Known For | Turing Test, Turing Machine, Codebreaking at Bletchley Park |
| Education | King’s College, Cambridge; Princeton University |
| Key Publication | Computing Machinery and Intelligence (1950) |
| Affiliated Institution | University of Manchester |
| Test Concept Origin | Originally called the “Imitation Game” — proposed to answer: Can machines think? |
| Recent Study | GPT-4.5 by OpenAI judged human 73% of the time in 2024 UC San Diego preprint |
| Study Authors | Cameron Jones & Benjamin Bergen, University of California San Diego |
| Models Tested | ELIZA, GPT-4o, LLaMa-3.1-405B, GPT-4.5 |
| Peer Review Status | Not yet peer-reviewed (preprint as of March 2024) |
Participants in the study judged OpenAI’s GPT-4.5 to be human 73% of the time. That number is compelling enough to be repeated. When sitting in front of a split-screen chat interface and exchanging messages with an AI and a real person for five minutes, nearly three out of four people made the incorrect decision. “That one’s the human,” they said, pointing to the machine.
The study was carried out by cognitive scientists Cameron Jones and Benjamin Bergen, and it was published as a preprint in March. This indicates that it has not yet undergone peer review, which is something to consider before getting too excited. However, it is hard to completely discount the results. The performance of the other tested models varied. 56% of participants were persuaded by LLaMa-3.1-405B. The notoriously archaic chatbot from the 1960s, ELIZA, only achieved 23%.

GPT-4o scored 21%, which is oddly lower than its more recent sibling. This begs the question of what specifically makes GPT-4.5 feel more human in brief conversations. Is the wording warmer? An improved sense of timing? A readiness to be a little vague? No one has yet provided a complete response to that.
Since the popular version of the test’s history has strayed significantly from the original, it is worthwhile to return to the test’s original source. Turing did not suggest a simple human-machine test. He suggested something more bizarre: a three-way game in which an interrogator used only written answers to try and identify the sex of two hidden participants, one man and one woman.
Then he questioned what would happen if the woman was replaced by a machine. Is there anything worse the interrogator could do? Fooling people wasn’t really the question. It was about substituting one type of ambiguity for another. Even Turing appeared unsure if he had put forth a real intelligence test or merely an intriguing thought experiment.
That uncertainty has never been completely cleared up. Decades later, philosopher John Searle put forth the Chinese Room argument, which claimed that a system could pass any language-based test without truly comprehending anything—processing symbols without meaning, imitating intelligence without having it. There is still no clear solution to this problem.
It’s difficult to ignore the significance of Searle’s argument as you watch this most recent study develop. Although GPT-4.5 may have persuaded people that it was human, thinking and convincing are two different behaviors carried out by different types of entities.
The authors of the study appear to be aware of this. According to them, the Turing test assesses “substitutability”—that is, the ability of a system to act in place of an individual without anyone noticing. That framing is important. The assertion that GPT-4.5 has attained human-level intelligence is untrue. It is said to be able to generate responses that are identical to what a human might type in a five-minute text exchange. Many of the breathless headlines are the result of confusing those two very different things.
There are also valid concerns regarding the conditions of the study. Five minutes is not much time. The participants were aware that they were taking a test, which can alter behavior in unpredictable ways. Some may have been searching for glaring mistakes, while others may have been primed to see humanity everywhere. Additionally, each AI model was assigned a particular persona to adopt; however, the specifics of those personas were not made public, so it is unclear how much that influenced the outcomes.
A model that is told to be a little hesitant, a little self-deprecating, and possibly prone to small typos might perform significantly differently than one that is left to its defaults.
It’s possible that GPT-4.5 has mastered something more akin to social texture than intelligence. People are not always rational when conversing. They use hedging. They float. They either make minor mistakes and then fix them or they don’t.
When trained on billions of human-generated words, a model that has sufficiently absorbed that texture can replicate it in a way that seems genuine even when nothing genuine is going on beneath the surface. This seems more in line with what happened here than with any kind of real thought.
The moment is not completely diminished by any of this. Something has changed. The difference between machine-generated and human-sounding text has become so small that it is impossible for regular people to distinguish between the two when they are paying attention in a controlled environment.
That has real-world implications that go far beyond philosophy, including misinformation, trust, and how we read and comprehend online content. Even though the Turing test has always been a somewhat flawed tool, now that a machine has finally pointed back, the thing it was loosely pointing at turns out to matter more than ever.
