Skip to main content

A study by the Universitat Politècnica de València, ValgrAI, and the University of Cambridge reveals an alarming trend: compared to earlier AI models, reliability has worsened in more recent models (GPT-4 compared to GPT-3, for example).

The study is published today in the journal Nature.

Recent advances in artificial intelligence (AI) have generalized the use of large language models in our society, in areas such as education, science, medicine, art, and finance, among many others. These models are increasingly present in our daily lives. However, they are not as reliable as users expect. This is the conclusion of a study led by a team from the VRAIN Institute of the Universitat Politècnica de València (UPV) and the Valencian School of Postgraduate Studies and Artificial Intelligence Research Network (ValgrAI), together with the University of Cambridge, published today in the journal Nature.

The work reveals an “alarming” trend: compared to the first models, and considering certain aspects, reliability has worsened in the most recent models (GPT-4 compared to GPT-3, for example).

According to José Hernández Orallo, researcher at the Valencian Research Institute in Artificial Intelligence (VRAIN) of the UPV and ValgrAI, one of the main concerns about the reliability of language models is that their performance does not align with the human perception of task difficulty. In other words, there is a discrepancy between expectations that models will fail according to human perception of task difficulty and the tasks where models actually fail. “Models can solve certain complex tasks according to human abilities, but at the same time fail in simple tasks in the same domain. For example, they can solve several doctoral-level mathematical problems, but can make mistakes in a simple addition,” points out Hernández-Orallo.

In 2022, Ilya Sutskever, the scientist behind some of the biggest advances in artificial intelligence in recent years (from the Imagenet solution to AlphaGo) and co-founder of OpenAI, predicted that “perhaps over time that discrepancy will diminish.”

However, the study by the UPV, ValgrAI, and University of Cambridge team shows that this has not been the case. To demonstrate this, they investigated three key aspects that affect the reliability of language models from a human perspective.

There is no “safe zone” where models perform perfectly

The study confirms a discordance with the perception of difficulty. “Do models fail where people expect them to fail? Our work concludes that models are usually less accurate in tasks that humans consider difficult, but they are not 100% accurate even in simple tasks. This means there is no ‘safe zone’ where models can be trusted to perform perfectly,” says VRAIN UPV and ValgrAI researcher Yael Moros Daval.In fact, the team from VRAIN UPV Institute, ValgrAI, and the University of Cambridge assures that the most recent models basically improve their performance on high-difficulty tasks, but not on low-difficulty tasks, “which aggravates the difficulty discordance between model performance and human expectations,” adds Fernando Martínez Plumed, also a researcher at VRAIN UPV and ValgrAI.

More prone to provide incorrect answers

The study also discovers that recent language models are much more prone to provide incorrect answers, instead of avoiding answering tasks they are unsure about. “This can lead to users who initially trust the models too much, then becoming disappointed. On the other hand, unlike people, the tendency to avoid providing answers does not increase with difficulty. For example, humans tend to avoid giving their opinion on problems that exceed their capacity. This relegates to users the responsibility of detecting failures during all their interactions with the models,” adds Lexin Zhou, member of the VRAIN UPV-ValgrAI team.

Sensitivity to problem statement

Is the effectiveness of question formulation affected by their difficulty? This is another of the issues analyzed by the UPV, ValgrAI, and Cambridge study, which concludes that the current trend of progress in language model development and greater understanding of a variety of orders may not free users from worrying about making effective statements. “We have found that users can be influenced by prompts that work well on complex tasks but, at the same time, obtain incorrect answers on simple tasks,” adds Cèsar Ferri, also co-author of the study and researcher at VRAIN UPV-ValgrAI.

Human supervision unable to compensate for these problems

In addition to these findings on aspects of the lack of reliability of language models, the researchers have discovered that human supervision is unable to compensate for these problems. For example, people can recognize high-difficulty tasks, but they still frequently consider incorrect results to be correct in this area, even when allowed to say “I’m not sure,” indicating overconfidence.

From ChatGPT to LLaMA and BLOOM

The results were similar for multiple families of language models, including OpenAI’s GPT family, Meta’s open-weight LLaMA, and BLOOM, a fully open initiative from the scientific community.The researchers have also found that the problems of difficulty discordance, lack of proper abstention, and prompt sensitivity remain an issue for new versions of popular families such as OpenAI’s new o1 models and Anthropic’s Claude-3.5-Sonnet.”In short, large language models are becoming less reliable from a human point of view, and user supervision to correct errors is not the solution, as we tend to trust models too much and are unable to recognize incorrect results at different levels of difficulty. Therefore, a fundamental change in the design and development of general-purpose AI is necessary, especially for high-risk applications, where predicting the performance of language models and detecting their errors are paramount,” concludes Wout Schellaert, from VRAIN UPV-ValgrAI.