back to top

Scaled-up LLMs are more prone to sensible yet wrong answers

They lack accuracy

Researchers at the Universitat Politècnica de València, in Spain, have found that the current methods to make Large Language Models (LLM) more powerful is by increasing their size, data volume, and computational resources. This scaled-up and shaped-up feeding to LLMs is likely to give a wrong answer, making it less reliable.

As LLms grow larger and more sophisticated, these models fail to identify and fix the areas of errors that are relatively simple, or human supervision can easily spot the errors in the answer furnished. While larger and more instructable LLMs may provide sensible answers, they lack accuracy. Moreover, the study found that LLMs rarely admit to a user that they do not know an answer.

With the mainstreaming of Artificial Intelligence and computing, users are relying on LLMs for day-to-day tasks, including writing papers, poems, and texts, and also for computing. This nudges accurate responses.

To better understand the evolution, reliability, and accuracy, the study analyzed three popular LLMs; GPT by OpenAI, the LLaMA of Meta, and the BLOOM suite developed by BigScience. The analysis revealed the exponential scaling in the number of parameters. For instance, the GPT-3 ada model used 350M parameters in 2020, and the GPT-3.5-turbo model uses 175B parameters. The same goes for the LLaMA-7b model of 2023 which used 6.7B and now uses 70B. You can view this table of the evolution of LLMs.

Moreover, the analysis also shows the prevalence of avoidance which includes procrastination, deviation, making excuses, or simply not answering.

Related:

Accuracy measurement

To test the accuracy of GPT, LLaMA, and BLOOM, researchers asked thousands of questions and documented the response of the current and earlier versions to the same question. They adopted various themes like math, science, anagrams, geography, and the ability to generate meaningful texts. Each question was assigned a level of difficulty, from human assessments.

Researchers found that the easier questions attained the highest accuracy with the new version of LLMs. However, as questions grew harder and more sophisticated LLMs came into play, accuracy decreased. Moreover, scientists noticed hedging, refusal, or evasiveness, which is also called avoidance.

In the earlier versions, LLMs either offered accurate answers or conveyed to users their failure to grab the answer. However, newer versions led to more answers, both correct and incorrect.

performance of a selection of gpt and llama models with increasing difficulty.
Performance of a selection of GPT and LLaMA models with increasing difficulty Credit Nature

The study has urged the development of a general-purpose artificial intelligence, particularly with a predictable distribution of errors.

Journal Reference:
Zhou, L., Schellaert, W., & Ferri, C. (2024). Larger and more instructable language models become less reliable. Nature, 1-8. https://doi.org/10.1038/s41586-024-07930-y

Read More

Trending Now