Galileo LLM Hallucination Index

LLM Hallucination Index: A Ranking & Evaluation Framework For LLM Hallucinations

Many enterprise teams have already successfully deployed LLMs in production, and many others have committed to deploying Generative AI products in 2024. However, for enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates text that is incorrect or fabricated. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data.

Why another benchmark?

There are a few LLM benchmarks today. While these benchmarks do much to advance the adoption of LLMs, they have a few critical blindspots.

  • Not focused on LLM output quality:

    Existing benchmarks provide a generic evaluation of LLM attributes and performance, and not a focused evaluation of the quality of the LLMs output (hallucination likelihood). As a result, these benchmarks do not leverage metrics that measure the actual quality of LLM outputs – one of the top concerns for enterprise GenAI teams today.

  • Not focused on task type:

    A practical benchmark useful for Enterprise genAI teams needs to cater to the variability in task types. For instance, a model that works well for chat, might not be great at text summarization.

  • Not focused on the power of context:

    Retrieval augmented generation (RAG) is a popular technique across teams to provide LLMs with useful context. LLM benchmarks today ignore how they perform with context – granted there is nuance here with regards to the quality of the context, but measuring variability in LLM performance across RAG vs non-RAG tasks is critical.

The Hallucination Index offers a structured approach to assess and measure hallucinations as an endeavor to help teams build more trustworthy GenAI applications.

About the index


There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations.


The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation.


The Index ranks 11 leading LLMs performance across three task types. The LLMs were evaluated using seven popular datasets. To measure hallucinations, the Hallucination Index employs two metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method ChainPoll.

To learn more about the index, click here