There is an almost uncomfortable moment in the study. In front of a split screen, participants converse with what they thought were two people at the same time—one human, one machine—and are then asked to determine which was which. The majority of them made a mistake. Not every now and then. People pointed at GPT-4.5 and said, “That one’s human,” 73% of the time when it was on the other side of the conversation.
Cameron Jones and Benjamin Bergen, two cognitive scientists at UC San Diego, conducted the study, which was released as a preprint in March. It’s important to keep in mind that it hasn’t yet undergone peer review. However, the numbers are hard to ignore even with that asterisk firmly in place.
| Category | Details |
|---|---|
| Test Name | Turing Test (Imitation Game) |
| Originally Proposed By | Alan Turing — British mathematician and computer scientist |
| Year Introduced | 1950, in the paper “Computing Machinery and Intelligence” |
| Study Authors | Cameron Jones & Benjamin Bergen, UC San Diego |
| Study Status | Preprint — not yet peer-reviewed |
| AI Model Tested | GPT-4.5 (OpenAI), GPT-4o, LLaMa-3.1-405B (Meta), ELIZA |
| GPT-4.5 Human Rating | 73% of participants judged it as human |
| LLaMa-3.1-405B Human Rating | 56% of participants judged it as human |
| GPT-4o Human Rating | 21% — lower than ELIZA’s 23% |
| Participants | ~284 people assigned as interrogators or witnesses |
| Test Format | Split-screen, simultaneous five-minute conversations |
| Key Variable | Persona prompting significantly improved AI performance |
| Without Persona Prompt | GPT-4.5 dropped to 36% human identification rate |
| Reference | OpenAI Research |
A modernized version of the Turing test, which has been used in philosophical and scientific circles since 1950, was applied to four AI models. OpenAI’s GPT-4.5, one of them, did not simply pass. In the study, it performed better than the real humans, being recognized as human even more frequently.
Most people are unaware of the longer and more bizarre history of the Turing test itself. It was not initially intended by Alan Turing as a direct confrontation between humans and machines. It wasn’t until his 1950 publication that the imitation game—a three-person setup in which an interrogator attempts to ascertain, solely through text, whether they are speaking with a man or a woman—to take shape.

His 1948 paper introduced a chess-based thought experiment. According to Turing’s framing, the machine was supposed to replace one of them. Over the course of several decades, the public version of this test deviated greatly from Turing’s initial goals and may not have had his full support.
In 2023, Google software engineer François Chollet stated unequivocally that the test was never intended to be taken literally. It was an experiment in thought. A method disguised as a philosophical challenge. When headlines start announcing that machines have crossed some sort of finish line, it’s easy to forget that context.
Nevertheless, despite the philosophical disagreements, something occurred in that study that merits consideration. Everything was altered by persona prompting. GPT-4.5’s success rate fell to 36% when given simple instructions, such as “convince the interrogator you are human,” compared to 73% when instructed to assume the role of a young, culturally competent internet user.
It’s possible that people aren’t actually detecting intelligence at all. Perhaps it’s personality. familiarity. the sound of a particular voice type that they have come to identify with actual people.
ELIZA, a chatbot that was created about eighty years ago and is a technological relic, managed to achieve a twenty-three percent success rate. GPT-4o, which presently powers the mainstream version of ChatGPT and is regarded as one of the more capable models available to the public, is one percentage point behind that. Though it’s still unclear exactly what it reveals, there’s a dark humor in that comparison. Because ELIZA is so crude, it completely deviates from the tone that users have trained themselves to expect from sophisticated AI.
It’s difficult to ignore the fact that the Turing test has always generated more controversy than the test itself merits. Four persistent objections are raised by critics: passing the test demonstrates behavioral mimicry rather than thought; the brain may not function like a machine at all; AI reasoning processes are not comparable to human ones; and testing a single behavior cannot establish general intelligence.
Jones himself appeared wary of making too many claims. The question of whether LLMs are as intelligent as humans is “very complicated,” he tweeted, and the findings should be “one among many other pieces of evidence.”
However, Jones did pursue a more sensible and possibly more pressing course of action. The automation of conversational jobs, more convincing social engineering, and the erosion of the presumption that you know who you’re talking to are all possibilities that most people haven’t fully considered if AI can convincingly replace humans in brief interactions. These are no longer science fiction issues. These are inquiries about design. questions about policy. the type that requires more time to complete than a split-screen test lasting five minutes.
Additionally, Jones concludes with an odd loop: the Turing test measures more than just machines. We are measured by it. Our presumptions, our expectations, and our perception of what constitutes a “real” conversation. People may become more adept at identifying or accepting AI systems as they spend more time interacting with them. It’s really hard to tell where that will go.
For the time being, a chatbot persuaded almost 75% of its interrogators that it was one of them. It’s probably worth more than a preprint and a news cycle to determine whether this indicates that machines are capable of thinking or just that thinking isn’t quite what we thought it was.
