Galileo Labs Finds LLMs are Not One Size Fits All as It Evaluates LLM’s Likelihood to Hallucinate
SAN FRANCISCO, Nov. 15, 2023 (GLOBE NEWSWIRE) — Galileo, a leading machine learning (ML) company for unstructured data, today released a Hallucination Index developed by its research arm, Galileo Labs, to help users of today’s leading LLMs determine which model is least likely to hallucinate for their intended application. The findings can be viewed here: https://www.rungalileo.io/hallucinationindex
“2023 has been the year of LLMs. While everyone from individual developers to Fortune 50 enterprises has been learning how to wrangle this novel new technology, two things are clear: first, LLMs are not one size fits all and second, hallucinations remain one of the greatest hurdles to LLM adoption,” said Atindriyo Sanyal, Galileo’s co-founder and CTO. “To help builders identify which LLMs to use for their applications, Galileo Labs created a ranking of the most popular LLMs based on their propensity to hallucinate using our proprietary hallucination evaluation metrics, Correctness and Context Adherence. We hope this effort sheds light on LLMs and helps teams pick the perfect LLM for their use case.”
While businesses of all sizes are building LLM-based applications, these efforts are being hindered by hallucinations that pose significant challenges in generating accurate and reliable responses. With hallucinations, AI generates information that appears realistic at first glance yet is ultimately incorrect or disconnected from the context.
To help teams get a handle on hallucinations and identify the best LLM that suits their needs, Galileo Labs developed a Hallucination Index that takes 11 LLMs from Open AI (GPT-4-0613, GPT-3.5-turbo-1106, GPT-3.5-turbo-0613 and GPT-3.5-turbo-instruct), Meta (Llama-2-70b, Llama-2-13b and Llama-2-7b), TII UAE (Falcon-40b-instruct), Mosaic ML (MPT-7b-instruct), Mistral.ai (Mistral-7b-instruct) and Hugging Face (Zephyr-7b-beta) and evaluates each LLM’s likelihood to hallucinate in common generative AI task types.
Key insights include:
Among open source models, Meta’s Llama-2-70b leads (Correctness Score = 0.65), while other models like Meta’s Llama-2-7b-chat and Mosaic ML’s MPT-7b-instruct showed a higher propensity for hallucinations in similar tasks with Correctness Scores of 0.52 and 0.40 respectively.
The Index recommends GPT-4-0613 for reliable and accurate AI performance in this task type.
Surprisingly, Hugging Face’s Zephyr-7b (Context Adherence Score = 0.71), an open source model, surpassed Meta’s much larger Llama-2-70b (Context Adherence Score = 0.68), challenging the notion that bigger models are inherently superior.
However, TII UAE’s Falcon-40b (Context Adherence Score = 0.60) and Mosaic ML’s MPT-7b (Context Adherence Score = 0.58) lagged for this task.
The Index recommends GPT-3.5-turbo-0613 for this task type.
Remarkably, Meta’s open source Llama-2-70b-chat rivaled GPT-4’s capabilities (Correctness Score = 0.82), presenting an efficient alternative for this task. Conversely, TII UAE’s Falcon-40b (Correctness Score = 0.65) and Mosaic ML’s MPT-7b (Correctness Score = 0.53) trailed behind in effectiveness.
The Index recommends Llama-2-70b-chat for an optimal balance of cost and performance in Long-form Text Generation.
Supporting these analyses are Galileo’s proprietary evaluation metrics Correctness and Context Adherence. These metrics are powered by ChainPoll, a hallucination detection methodology developed by Galileo Labs. During the creation of the index, Galileo’s evaluation metrics were proven to detect hallucinations with 87% accuracy, finally giving teams a reliable way to automatically detect hallucination risk saving teams time and cost typically spent on manual evaluation.
By helping teams catch errors of stale knowledge, wrong knowledge, logical fallacies and mathematical errors, Galileo hopes to help organizations find the perfect LLM for their use case, move from sandbox to production and more quickly deploy reliable and trustworthy AI.
Additional Resources:
About Galileo
Galileo’s mission is to unlock the value of unstructured data for ML. With more than 80% of the world’s data being unstructured and recent model advancements massively lowering the barrier to utilizing the data for enterprise ML, there is an urgent need for the right data-focused tools to build high performing models fast. Galileo is based in San Francisco and backed by Battery Ventures, Walden Catalyst and The Factory. For more information, visit https://www.rungalileo.io or follow @rungalileo.
Legal Disclaimer: The findings and rankings presented in Galileo’s Hallucination Index are based on Galileo’s proprietary evaluation metrics, namely “Correctness” and “Context Adherence.” These metrics have been developed by Galileo to assess the performance of various Large Language Models (LLMs). It’s important to note that these rankings could differ when evaluated against other metrics or methodologies.
This study is not endorsed by, directly affiliated with, maintained, authorized or sponsored by any of the LLM providers mentioned in this index including but not limited to OpenAI, Meta, Mosaic ML, Hugging Face or their subsidiaries or affiliates. All product and company names are the registered trademarks of their original owners. The use of any trade name or trademark is for identification and reference purposes only and does not imply any association with the trademark holder of their product brand.
Media and Analyst Contact:
Amber Rowland
amber@therowlandagency.com
+1-650-814-4560
NEW YORK, April 20, 2025 (GLOBE NEWSWIRE) -- WHY: Rosen Law Firm, a global investor rights…
Global Times: Feng Jicai leads efforts to preserve China's cultural heritage through art, literature, and…
Hanover, MA , April 20, 2025 (GLOBE NEWSWIRE) -- Boston Hemp Inc., a leader in…
NEW YORK, April 20, 2025 (GLOBE NEWSWIRE) -- WHY: Rosen Law Firm, a global investor rights…
NEW YORK, April 20, 2025 (GLOBE NEWSWIRE) -- WHY: Rosen Law Firm, a global investor…
The latest data release from the Commodity Futures Trading Commission (CFTC) reveals a marginal improvement…
This website uses cookies.