OpenAI’s ChatGPT-4.5 has achieved a milestone once considered decades away: convincing a majority of participants in a Turing Test-style evaluation that it was human.
In a recent study by the University of California, San Diego, which sought to assess whether large language models can pass the classical three-party Turing test, GPT-4.5 was reported to succeed in 73% of text-based conversations.
The study showed the latest large language model outperforming earlier iterations, such as GPT-4.0 and others, including ELIZA and LLama-3.1-405 B.
GPT-4.5, launched by OpenAI in February, was able to detect subtle language cues, making it appear more human, according to Cameron Jones, a postdoctoral researcher at UC San Diego.
“If you ask them what it’s like to be human, the models tend to answer well and can convincingly pretend to have emotional and sexual experiences,” Jones told Decrypt. “But they struggle with things like real-time information or current events.”
The Turing Test, proposed by British mathematician Alan Turing in 1950, evaluates whether a machine can mimic human conversation convincingly enough to fool a human judge. If the judge can’t reliably distinguish the machine from the human, the machine is considered to have passed.
To evaluate the AI models’ performance, researchers tested two prompt types: a baseline prompt with minimal instruction and a more detailed prompt that directed the model to adopt the voice of an introverted, internet-savvy young person who uses slang.
“We selected these witnesses on the basis of an exploratory study where we evaluated five different prompts and seven different LLMs and found that LLaMa-3.1-405B, GPT-4.5, and this persona prompt performed best,” researchers in the study said.
The study also addressed the broader social and economic implications of large language models passing the Turing Test, including potential misuse.
“Some risks include misinformation, like astroturfing, where bots pretend to be people to inflate interest in a cause,” Jones said. “Others involve fraud or social engineering—if a model emails someone over time and seems real, it might persuade them to share sensitive information or access bank accounts.”
On Monday, OpenAI announced the launch of the next iteration of its flagship GPT model, GPT-4.1. This new AI is even more advanced and can process extensive documents, codebases, or even novels. OpenAI said it would sunset GPT-4.5 and replace it with GPT 4-1 this summer.
While Turing never witnessed today’s AI landscape, Jones noted that the test he proposed in 1950 remains relevant.
“The Turing Test is still relevant in the way Turing intended,” he said. “In his paper, he talks about learning machines and suggests the way to build something that passes the Turing Test is by creating a computational child that learns from lots of data. That’s essentially how modern machine learning models work.”
When asked about criticism of the study, Jones acknowledged its value while clarifying what the Turing Test measures and doesn’t.
“The main thing I’d say is the Turing Test isn’t a perfect test of intelligence—or even of human-likeness,” he said. “But it is valuable for what it measures: whether a machine can convince a person it’s human. That’s worth measuring and has real implications.”
Edited by Sebastian Sinclair