RAG metrics reference

All of the RAG metrics use LLM-assisted evaluation to measure your RAG system’s performance.

LLM-assisted evaluation means that an LLM, called the LLM evaluator, is prompted with an aspect of the RAG system response. The LLM evaluator is then asked to respond with a score that grades that aspect of the RAG system response.

Using the metrics

To use the metrics in your ValidateScorer, you can pass the metrics in as a list

from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AnswerSimilarityMetric

scorer = ValidateScorer([AnswerSimilarityMetric(), AnswerConsistencyMetric()])

List of metrics

Answer similarity score

from tonic_validate.metrics import AnswerSimilarityMetric

The answer similarity score measures, on a scale from 0 to 5, how well the answer from the RAG system corresponds in meaning to a reference answer.

This score is an end-to-end test of the RAG LLM.

To calculate the score, we ask an LLM to grade on a scale from 0 to 5 how well the RAG LLM response matches the reference response.

The answer similarity score is a float between 0 and 5. The value is usually an integer.

Retrieval and augmentation precision

When you ask a question of a RAG LLM, you can view whether the RAG system retrieves or uses the correct context as a binary classification problem on the set of context vectors. The labels indicate whether a context vector provides relevant context for the question.

This is a very imbalanced binary classification problem that has very few positive classes. The vector database contains many vectors, and only a few of those vectors are relevant.

In an imbalanced binary classification problem with few positive classes, the relevant metrics are:

  • precision = (number of correctly predicted positive classes)/(number of predicted positive classes)

  • recall = (number of correctly predicted positive classes)/(number of positive classes).

For vector retrieval, unless you already know the question and relevant context, you cannot know all of the relevant context for a question. For this reason, it is difficult to explicitly calculate recall. However, we can calculate precision. This binary classification setup is how we define the retrieval precision and augmentation precision metrics.

Retrieval precision

from tonic_validate.metrics import RetrievalPrecisionMetric

The context retrieval step of a RAG system predicts the context that is relevant to answering the question.

Retrieval precision is the percentage of retrieved context that is relevant to answer the question.

To measure retrieval precision, for each context vector, we ask the LLM evaluator whether the context is relevant to use to answer the question.

Retrieval precision for a given RAG system response, is a float between 0 and 1 that represents the percentage of retrieved context that is relevant to answer the question.

Augmentation precision

from tonic_validate.metrics import AugmentationPrecisionMetric

Augmentation precision is the percentage of relevant retrieved context for which some portion of the context appears in the answer from the RAG system.

For the context from the retrieval step, this score measures how well the LLM used in the RAG system recognizes relevant context.

To measure augmentation precision, for each relevant piece of context, we ask the LLM evaluator whether information from the relevant context appears in the answer.

Augmentation precision is a float between 0 and 1.

Augmentation accuracy

from tonic_validate.metrics import AugmentationAccuracyMetric

Augmentation accuracy is the percentage of retrieved context for which some portion of the context appears in the answer from the RAG system. This metric is unrelated to the binary classification framework.

To calculate both augmentation precision and augmentation accuracy, we ask the LLM evaluator whether retrieved context is used in the RAG system answer.

There is a trade-off between maximizing augmentation accuracy and augmentation precision.

  • If retrieval precision is very high, then you want to maximize augmentation accuracy.

  • If retrieval precision is not very high, then you want to maximize augmentation precision.

Augmentation accuracy is a float between 0 and 1.

Answer consistency

from tonic_validate.metrics import AnswerConsistencyMetric

Answer consistency is the percentage of the RAG system answer that can be attributed to retrieved context.

To calculate this metric, we:

  1. Ask the LLM evaluator to create a bulleted list of the main points in the RAG system answer.

  2. Ask the LLM evaluator whether it can attribute each bullet point to the retrieved context.

The final score is the percentage of the bullet points that can be attributed to the context.

Answer consistency is a float between 0 and 1.

Answer consistency binary

from tonic_validate.metrics import AnswerConsistencyBinaryMetric

Answer consistency binary indicates whether all of the information in the answer is derived from the retrieved context.

  • If all of the information in the answer comes from the retrieved context, then the answer is consistent with the context, and the value of this metric is 1.

  • If the answer contains information that is not derived from the context, then the value of this metric is 0.

To calculate answer consistency binary, we ask an LLM whether the RAG system answer contains any information that is not derived from the context.

Answer consistency binary is a binary integer.

Overall score

The overall score is the mean of the normalized values of the metrics. For answer similarity, the normalized value is the original value divided by 5. For the other metrics, the normalized value is the same as the original value. Normalizing answer similarity ensures that all of the metrics have values between 0 and 1.

In tval and Tonic Validate, the overall score is calculated from the metrics that are used. If only some of the metrics are calculated, then the overall score is the mean of those metrics. The overall score does not include metrics that were not calculated.

We opt for this definition because it is both:

  • Simple, which makes it easy to understand.

  • Flexible, because you can specify only the metrics that you are interested in calculating, and then the overall score depends only on those metrics.

In general, the best way to improve your RAG system is to focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.

If needed, however, the overall score is a simple representation of all the metrics in one number.

Last updated