RAG metrics reference

The RAG metrics use LLM-assisted evaluation to measure your RAG system’s performance.

LLM-assisted evaluation means that an LLM, called the LLM evaluator, is prompted with an aspect of the RAG system response. The LLM evaluator is then asked to respond with a score that grades that aspect of the RAG system response.

Using the metrics

To use the metrics in your ValidateScorer, you can pass the metrics in as a list

from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AnswerSimilarityMetric

scorer = ValidateScorer([AnswerSimilarityMetric(), AnswerConsistencyMetric()])

Metrics

Answer consistency

from tonic_validate.metrics import AnswerConsistencyMetric

Answer consistency is the percentage of the RAG system answer that can be attributed to retrieved context.

Answer consistency is a float between 0 and 1.

To calculate this metric, we:

  1. Ask the LLM evaluator to create a bulleted list of the main points in the RAG system answer.

  2. Ask the LLM evaluator whether it can attribute each bullet point to the retrieved context.

The final score is the percentage of the bullet points that can be attributed to the context.

Answer consistency binary

from tonic_validate.metrics import AnswerConsistencyBinaryMetric

Answer consistency binary indicates whether all of the information in the answer is derived from the retrieved context.

Answer consistency binary is a binary integer (1 for true and 0 for false).

  • If all of the information in the answer comes from the retrieved context, then the answer is consistent with the context, and the value of this metric is 1.

  • If the answer contains information that is not derived from the context, then the value of this metric is 0.

To calculate answer consistency binary, we ask an LLM whether the RAG system answer contains any information that is not derived from the context.

Answer contains PII

This metric requires that you have a Tonic Textual account and a Textual API key.

from tonic_validate.metrics import AnswerContainsPiiMetric
metric = AnswerContainsPiiMetric(["<pii_type>"])

Answer contains PII indicates whether the answer from the LLM contains Personally Identifiable Information (PII) of specific types.

For example, you might check whether the response includes credit card numbers and bank numbers.

Answer contains PII is a binary integer (1 for true and 0 for false).

This metric uses the PII detection models in Tonic Textual. To use this metric, you must have a Textual API key. You can either:

  • Provide the Textual API key as the value of the textual_api_key parameter

  • Set the API key as the value of the TONIC_TEXTUAL_API_KEY environment variable.

For example, to check whether the answer contains city or zip code value:

from tonic_validate.metrics import AnswerContainsPiiMetric
metric = AnswerContainsPiiMetric(["LOCATION_CITY", "LOCATION_ZIP"])

Answer match

from tonic_validate.metrics import AnswerMatchMetric
metric = AnswerMatchMetric("<metric display name>", "<text>", case_sensitive=boolean)

Answer match indicates whether the answer from the LLM matches a text string that you provide.

Answer match is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use on the Validate application

  • The text to look for in the LLM response

  • Whether the search for the text is case-sensitive. For example, if this is True, and you provide "John" as the text, then "john" and "JOHN" would not be a match.

For example, to do a case-sensitive search for the word "Paris" in the LLM response

from tonic_validate.metrics import AnswerMatchMetric
metric = AnswerMatchMetric("Answer Match", "Paris", case_sensitive=True)

Answer similarity score

from tonic_validate.metrics import AnswerSimilarityMetric

The answer similarity score measures, on a scale from 0 to 5, how well the answer from the RAG system corresponds in meaning to a reference answer. You cannot calculate an Answer similarity score for a production monitoring project.

This score is an end-to-end test of the RAG LLM.

The answer similarity score is a float between 0 and 5. The value is usually an integer.

To calculate the score, we ask an LLM to grade on a scale from 0 to 5 how well the RAG LLM response matches the reference response.

Augmentation accuracy

from tonic_validate.metrics import AugmentationAccuracyMetric

Augmentation accuracy is the percentage of retrieved context for which some portion of the context appears in the answer from the RAG system. This metric is unrelated to the binary classification framework.

Augmentation accuracy is a float between 0 and 1.

To calculate both augmentation precision and augmentation accuracy, we ask the LLM evaluator whether retrieved context is used in the RAG system answer.

There is a trade-off between maximizing augmentation accuracy and augmentation precision.

  • If retrieval precision is very high, then you want to maximize augmentation accuracy.

  • If retrieval precision is not very high, then you want to maximize augmentation precision.

Binary

from tonic_validate.metrics import BinaryMetric
metric = BinaryMetric("<metric display name>", <callback>, llm_response: boolean)

Binary is a metric that returns either true or false from a callback function that you provide. You cannot calculate a Binary metric for a production monitoring project.

Binary is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use for the metric on the Validate application.

  • The callback function. This a a function that you define that returns either true or false. The callback uses either the OpenAIService or, for non-OpenAI calls, the LiteLLMService.

  • Whether to include LLMResponse, which contains:

    • The LLM answer (llm_answer)

    • The context used (llm_context_list)

    • For a development project run, the benchmark item (benchmark_item)

For example:

from tonic_validate.metrics import BinaryMetric
metric = BinaryMetric("Binary Metric", lambda open_ai, llm_response: True)

Contains text

from tonic_validate.metrics import ContainsTextMetric
metric = ContainsTextMetric("<metric display name>", "<text>", case_sensitive=boolean)

Contains text indicates whether the answer from the LLM contains a text string that you provide.

Contains text is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use for the metric on the Validate application

  • The text to look for in the LLM response

  • Whether the search for the text is case-sensitive. For example, if this is True, and you provide "John" as the text, then "john" and "JOHN" would not be a match.

For example, the following request does a case-sensitive search for the text "Paris":

from tonic_validate.metrics import ContainsTextMetric
metric = ContainsTextMetric("Contains Text", "Paris", case_sensitive=True)

Context contains PII

This metric requires that you have a Tonic Textual account and a Textual API key.

from tonic_validate.metrics import ContextContainsPiiMetric
metric = ContextContainsPiiMetric(["<pii_type>"])

Context contains PII indicates whether the context that the LLM used to answer the question contains Personally Identifiable Information (PII) of specific types.

Context contains PII is a binary integer (1 for true and 0 for false).

For example, you might check whether the context includes credit card numbers and bank numbers.

This metric uses the PII detection models in Tonic Textual. To use this metric, you must have a Textual API key. You can either:

  • Provide the Textual API key as the value of the textual_api_key parameter

  • Set the API key as the value of the TONIC_TEXTUAL_API_KEY environment variable.

For example, to check whether the answer contains city or zip code value:

from tonic_validate.metrics import ContextContainsPiiMetric
metric = ContextContainsPiiMetric(["LOCATION_CITY", "LOCATION_ZIP"])

Context length

from tonic_validate.metrics import ContextLengthMetric
metric = ContextLengthMetric("<metric display name", <minimum length>, <maximum length>)

Context length indicates whether the length of the context that the LLM used to answer the question fell within a range that you provide.

Context length is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use for the metric on the Validate application

  • The minimum length. If you do not provide a minimum length, then the metric checks whether the context is shorter than the maximum length.

  • The maximum length. If you do not provide a maximum length, then the metric checks whether the context is longer than the minimum length.

For example, to check whether the context is between 5 and 10 characters long:

from tonic_validate.metrics import ContextLengthMetric
metric = ContextLengthMetric("Context Length", 5, 10)

Duplication

from tonic_validate.metrics import DuplicationMetric
metric = DuplicationMetric()

Duplication indicates whether the answer from the LLM contains duplicate information.

Duplication is a binary integer (1 for true and 0 for false).

Hate speech content

from tonic_validate.metrics import HateSpeechContentMetric
metric = HateSpeechContentMetric()

Hate speech content indicates whether the answer from the LLM contains hate speech.

Hate speech content is a binary integer (1 for true and 0 for false).

Latency

from tonic_validate.metrics import LatencyMetric
metric = LatencyMetric(target_time=<time in seconds>)

Latency indicates whether the length of time that it took for the LLM to return an answer is less than a length of time in seconds that you provide.

Latency is a binary integer (1 for true and 0 for false).

For example, to check whether the LLM responded within 5 seconds:

from tonic_validate.metrics import LatencyMetric
metric = LatencyMetric(target_time=5)

Regex

from tonic_validate.metrics import RegexMetric
metric = RegexMetric("<metric display name>", "<regular expression>", match_count=<expected number of matches>)

Regex indicates whether the answer from the LLM contains the expected number of matches to a provided regular expression.

Regex is a binary integer (1 for true and 0 for false).

For example, to check whether the response contained 2 matches for the regular expression Fid*o:

from tonic_validate.metrics import RegexMetric
metric = RegexMetric("Regex Metric", "Fid*o", match_count=2)

Response length

from tonic_validate.metrics import ResponseLengthMetric
metric = ResponseLengthMetric("Response Length", <minimum length>, <maximum length>)

Response length indicates whether the length of the answer from the LLM falls within a provided range.

The request includes:

  • The display name to use for the metric on the Validate application

  • The minimum length. If you do not provide a minimum length, then the metric checks whether the answer is shorter than the maximum length.

  • The maximum length. If you do not provide a maximum length, then the metric checks whether the answer is longer than the minimum length.

Response length is a binary integer (1 for true and 0 for false).

For example, to check whether the answer is between 5 and 10 characters long:

from tonic_validate.metrics import ResponseLengthMetric
metric = ResponseLengthMetric("Response Length", 5, 10)

Retrieval and augmentation precision

When you ask a question of a RAG LLM, you can view whether the RAG system retrieves or uses the correct context as a binary classification problem on the set of context vectors. The labels indicate whether a context vector provides relevant context for the question.

This is a very imbalanced binary classification problem that has very few positive classes. The vector database contains many vectors, and only a few of those vectors are relevant.

In an imbalanced binary classification problem with few positive classes, the relevant metrics are:

  • precision = (number of correctly predicted positive classes)/(number of predicted positive classes)

  • recall = (number of correctly predicted positive classes)/(number of positive classes).

For vector retrieval, unless you already know the question and relevant context, you cannot know all of the relevant context for a question. For this reason, it is difficult to explicitly calculate recall. However, we can calculate precision. This binary classification setup is how we define the retrieval precision and augmentation precision metrics.

Retrieval precision

from tonic_validate.metrics import RetrievalPrecisionMetric

The context retrieval step of a RAG system predicts the context that is relevant to answering the question.

Retrieval precision is the percentage of retrieved context that is relevant to answer the question.

Retrieval precision is a float between 0 and 1.

To measure retrieval precision, for each context vector, we ask the LLM evaluator whether the context is relevant to use to answer the question.

Augmentation precision

from tonic_validate.metrics import AugmentationPrecisionMetric

Augmentation precision is the percentage of relevant retrieved context for which some portion of the context appears in the answer from the RAG system.

For the context from the retrieval step, this score measures how well the LLM used in the RAG system recognizes relevant context.

Augmentation precision is a float between 0 and 1.

To measure augmentation precision, for each relevant piece of context, we ask the LLM evaluator whether information from the relevant context appears in the answer.

Calculating the overall score

The overall score is the mean of the normalized values of the metrics. For answer similarity, the normalized value is the original value divided by 5. For the other metrics, the normalized value is the same as the original value. Normalizing answer similarity ensures that all of the metrics have values between 0 and 1.

In Validate, the overall score is calculated from the metrics that are used. If only some of the metrics are calculated, then the overall score is the mean of those metrics. The overall score does not include metrics that were not calculated.

We opt for this definition because it is both:

  • Simple, which makes it easy to understand.

  • Flexible, because you can specify only the metrics that you are interested in calculating, and then the overall score depends only on those metrics.

In general, the best way to improve your RAG system is to focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.

If needed, however, the overall score is a simple representation of all the metrics in one number.

Last updated