Only this pageAll pages
Powered by GitBook
1 of 26

Tonic Validate

Loading...

About Tonic Validate

Loading...

Loading...

Loading...

Getting started with Validate

Loading...

Loading...

Loading...

Loading...

About RAG metrics

Loading...

Loading...

Loading...

Loading...

Benchmarks and projects

Loading...

Loading...

Runs

Loading...

Loading...

Production monitoring

Loading...

Loading...

Code examples

Loading...

What is Tonic Validate?

The Tonic Validate application and SDK (tonic_validate) allow you to measure how well your RAG LLM system performs.

What is a RAG system?

Retrieval augmented generation (RAG) allows you to augment a large language model (LLM) with additional data that is not in the LLM's original training set. The additional data usually takes the form of text from documents such as HTML, MarkDown, Word, or Notion.

The LLM can then use that data in its responses to user queries. The responses can also include references for the additional content.

Using Validate to evaluate development and production RAG systems

But how do you know how well your RAG system works? How good are its responses? How relevant is the additional context? And how does the quality of the responses change when you change the available context data?

That is where Validate comes in.

As you develop your RAG system, you can use a Validate development project to run tests to determine how the system performs against a benchmark set of questions. You can then see whether the quality of the answers improves for each run.

After you release your RAG system, you can configure it to send the questions, answers, and context to a Validate production monitoring project that tracks the quality of the responses over time in your production systems.

Validate includes:

  • Metrics to measure the performance of each component in your RAG system

  • Visualizations to compare performance across time as the system changes

Validate provides insight into your RAG LLM system performance, so that you can deploy it with confidence.

Tonic Validate guide

The Tonic Validate application and Validate SDK (tonic_validate) allow you to measure how well your RAG LLM system performs.

Validate calculates rigorous LLM-assisted RAG evaluation metrics. You can also use tonic_validate to compute RAG metrics outside the context of a Validate project.

In Validate:

  • Development projects track the performance of a RAG system that is under development.

  • Production monitoring projects track how well a production RAG system answers questions from actual users.

For development projects, Validate also provides an integration with Ragas, tonic_ragas_logger, that allows you to visualize Ragas evaluation results in Validate.

Get started

Configure benchmarks and projects

Create and review development project runs

Connect and monitor a production RAG system

Code examples

Need help with Validate? Contact [email protected].

Start your Validate account

Sign up for a Validate account and create your first project.

Set up the Validate SDK

Install the SDK. Provide Validate and Open AI API keys.

Quickstart example

Use tonic_validate to log RAG metrics to a development project.

Types of RAG metrics

RAG metrics measure the quality of RAG LLM responses.

Create and manage benchmarks

A benchmark is a set of questions, optionally with expected responses to send to a development project.

Create and manage projects

A development project consists of a set of runs.

A production monitoring project tracks performance over time.

Start a new run

Start a new Validate run to calculate metrics for RAG LLM answers to questions.

View run results

Review average metric scores, and the grouping of values for the questions.

Connect your RAG system to Validate

Configure your RAG system to send user questions and system answers to Validate.

Track RAG system performance

View average metric scores over time for the RAG system.

End-to-end example with a llama index

Demonstrates an end-to-end Validate development project flow.

Configuring your RAG system to send questions to Validate

To configure your RAG system to log questions and answers to Tonic Validate, whenever your RAG system answers a question from a user, you add a call to the Validate log function.

The call to log includes:

  • The identifier of the Validate production monitoring project to send the question to

  • The text of the question from the user to the RAG system

  • The answer that the RAG system provided to the user

  • The context that the RAG system used to answer the question

from tonic_validate import ValidateMonitorer
monitorer = ValidateMonitorer()
monitorer.log(
    project_id="<project identifier>",
    question="<question text>",
    answer="<answer>",
    context_list=["<context used to answer the question"]
)

As your RAG system sends questions to the Validate production managing project, Validate by default generates the following metrics scores for each question:

  • Answer consistency

  • Retrieval precision

  • Augmentation precision

You can also request additional metrics. For information about the available metrics, go to RAG metrics reference.

Viewing the metric scores and logged questions

Viewing a score summary over time

At the top of the project details page are the average metrics scores. The scores include the most recently received questions.

When you click a score, the timeline chart is updated to display the average scores for that metric across time. The averages are grouped by hour.

Timeline for a selected metric on a production monitoring project

Setting a range for the timeline

By default, the time range for the timeline reflects all time since Validate began to receive questions from the RAG system.

You can use the date pickers above the timeline to select a different time range.

Viewing the questions for a time range or a point

Below the timeline is the list of questions that Validate received from the RAG system, along with the metric scores for that question.

Question list for a production monitoring project

By default, the list of questions includes all questions that were received during the time frame for the timeline.

To set a time range for which to display questions, use the date pickers above the question list.

To filter the questions to only those that Validate received during a specific point on the timeline, click that point.

Viewing and managing runs

You use the Tonic Validate application to view the results of each run.

If you use the Ragas integration, , then each upload of Ragas results is displayed as a run in Validate.

Displaying the details for a run

From the run list on the project details page, to display the run results, click the run.

The run details provide details about the run questions and scores. It also provides access to delete the run.

The run details replace the project overview with the metrics, chart, and questions. To return to the overview, click Show Project Overview.

Overview tab

The Overview tab summarizes the scores for the questions.

Average overall score and metrics scores

The tiles at the top of the Overview tab show the average overall score and metrics scores from across the entire run.

Score and metrics summary graphs

Below the composite scores are bar graphs for the overall score and the metrics scores.

For each range of score values in the x-axis, the graph displays the number of questions that received scores that fall within that range.

Scores tab

The Scores tab provides a spreadsheet list of the run questions and their overall and metrics scores.

You can sort the list by any of the columns. To sort by a selected column, click the column heading. To reverse the sort, click the heading again.

Metadata tab

The Metadata tab provides any metadata that was provided when the run was started.

Questions & Answers tab

The Questions & Answers tab provides a detailed list of questions that were included in the run.

For each question, the list includes:

  • The text of the question.

  • The reference answer - this is the answer that you expected.

  • The answer that your LLM provided.

  • The context that the LLM used to answer the question.

  • The overall and metrics scores for the question.

Here is a full question entry from the Questions & Answers tab:

Deleting a run

To delete a run:

  1. In the run details heading, click Delete.

  2. On the confirmation panel, click Delete.

tonic_ragas_logger
Run details for a selected run
Overview tab for the run details
Scores tab for the run details
Metadata tab for the run details
Questions & Answers tab for the run details
Full entry for a question on the Questions & Answers tab
Delete option for a run

Validate components and tools

Validate components

Projects

A development project is designed to be used during RAG system development. It is a collection of runs that allow you to see how the run performance for a given set of questions changes over time.

A production monitoring project allows you to monitor the performance over time of a production RAG system. You configure the RAG system to automatically send to the production monitoring project the questions your users asked, the answers the RAG system provided, and the associated context.

For more information, go to Managing projects in Validate.

Metrics

Metrics are used to score the RAG system responses to questions.

For a development project, Validate calculates metric scores for the benchmark questions that are provided for the project.

For a production monitoring project, Validate calculates metric scores for the questions that users ask the RAG system. The RAG system sends the questions to Validate.

Validate calculates different metrics that represent different aspects of a RAG system. For more information about metrics, go to the metrics section.

Runs

For a Validate development project, a run represents an assessment of the RAG responses to a set of questions based on the RAG system configuration at a given point in time.

For each response, the run includes:

  • The question and, optionally, the corresponding ideal answer. A benchmark is one option for providing the questions.

  • The LLM's response and the context that the RAG system retrieved

  • Metadata in the form of key-value pairs that you specify. For example, "Model": "GPT-4"

  • Scores for the responses that use your chosen metrics

The run also includes overall scores for the given metrics.

For more information, go to Viewing and managing runs.

Benchmarks

For a Validate development project, a benchmark is a collection of questions with or without responses. The responses represent the ideal answers to the given questions.

A benchmark is one way to provide the questions for Validate to use to evaluate your RAG system.

For more information, go to Managing benchmarks in Validate.

Validate tools

Validate SDK (tonic-validate)

You must use the Validate SDK to:

  • Create Validate runs for a development project

  • Provide questions and RAG system responses to Validate for a development project run

  • Send questions and RAG system responses from your RAG system to a production monitoring project

You can also use the SDK to:

  • Manage projects

  • Manage benchmarks for a development project

  • Calculate RAG metrics outside the context of a Validate project

Validate application

You can use the Validate application to manage benchmarks and projects.

You must use the Validate application to view:

  • The results of Validate runs for a development project

  • The metric scores over time for a production monitoring project

About using metrics to evaluate a RAG system

Evaluating a RAG system is difficult because:

  • There are many moving pieces. For a list of components, go to RAG components summary.

  • You need to evaluate natural language responses.

  • Building RAG systems is relatively new. There are not well-established systems to evaluate the performance of a RAG system.

About our RAG metrics

Our RAG metrics use the input and output of your RAG system to test all of the components of the system.

Each metric calculates a score that measures the performance of a different aspect of a typical RAG system.

The metrics use LLM-assisted evaluation to evaluate the natural language aspect of a RAG system question, answer, and retrieved context.

The metrics include both end-to-end RAG system metrics, and metrics that are based on a binary classification framework.

How to use the metrics

Validate development project

For a Validate development project, to use the RAG metrics to evaluate and improve your RAG system, you first build a benchmark dataset of question and reference answers, or only questions, to test the RAG system with.

Next, use your RAG system to get the answers and the retrieved context that was used to generate the answers to the questions in your benchmark.

After you have this information, you can calculate each RAG metric and see how different aspects of your RAG system perform.

To improve your RAG system, focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.

Validate production monitoring project

For a production RAG system, you can configure the system to send the questions, answers, and context to a Validate production monitoring project.

You can then track the RAG system performance over time.

Quickstart example - create a run and log responses

For an existing Validate development project, the following example creates a run and logs the responses to the UI.

When the RAG system response and retrieved context is logged to the Tonic Validate application, RAG metrics are calculated using calls to Open AI.

from tonic_validate import ValidateScorer, ValidateApi, Benchmark

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }

benchmark = Benchmark(questions=["What is the capital of France?"], answers=["Paris"])
# Score the responses for each question and answer pair
scorer = ValidateScorer()
run = scorer.score(benchmark, get_rag_response)

# Uploads the project to the UI
validate_api = ValidateApi("your-api-key")
validate_api.upload_run("your-project-id", run)

Creating and revoking Validate API keys

The Tonic Validate SDK (tonic_validate) allows you to:

  • Manage projects and benchmarks

  • Start runs against Validate development projects

  • Send questions from a RAG system to a Validate production monitoring project

To use tonic_validate, you must have a tonic_validate API key.

When you sign up for your Validate account, it automatically creates an API key for you. You can also create and revoke API keys from the Validate application.

Viewing the list of API keys

On the Validate Home page, the API Keys panel displays the list of API keys.

API Keys panel that displays on the Validate home page

Creating a new API key

To create a new key:

  1. In the API Keys panel on the Validate home page, click Create an API Key.

  2. In the Name field, provide a name to use to identify the API key.

  3. Click Create API Key.

Panel to create a new API key

Validate displays a message that the key was created. To copy the API key to use later, click Copy to Clipboard.

Confirmation panel with option to copy the new API key

Displaying an API key

From the API Keys panel, to display an API key, click the key icon for the key.

Displaying an API key from the API Keys panel

You can then copy the key to use later.

Revoking an API key

To revoke an API key, in the API Keys panel, click the delete icon for the key to revoke.

Setting up the Validate SDK

Installing the Validate SDK

To install tonic_validate, you use pip:

pip install tonic-validate

Setting the Validate API key environment variable

To log metrics using tonic_validate, you need to set the API key by passing it into ValidateApi

from tonic_validate import ValidateApi
validate_api = ValidateApi("your-api-key")

Obtaining and providing an API key for the LLM evaluator

Validate uses LLM-assisted evaluation. You must provide an API key for the LLM evaluator model that you want to use.

Validate currently supports the following models:

  • OpenAI

  • Azure OpenAI service

  • Gemini

  • Anthropic

Using the OpenAI API

To use the OpenAI models, you must:

  • Have an OpenAI API key

  • Set the API key as the value of the environment variable OPENAI_API_KEY

To get an Open AI API key, go to the OpenAI API key page.

In your Python script or Jupyter notebook, set your Open AI API key as the value of OPENAI_API_KEY:

import os
os.environ["OPENAI_API_KEY"] = "put-your-openai-api-key-here"

Using Azure's OpenAI service

Validate also supports Azure's OpenAI API service.

To use Azure, you must set up an Azure OpenAI deployment.

For information on how to set up a deployment, go to this Azure OpenAI service quickstart.

After you set up your deployment, copy your API key and API endpoint to the following environment variables:

import os
os.environ["AZURE_OPENAI_KEY"] = "put-your-azure-openai-api-key-here"
os.environ["AZURE_OPENAI_ENDPOINT"] = "put-your-azure-endpoint-here"

When you start a Validate run, you must provide the deployment name.

Using the Gemini API

To use Gemini models, you must:

  • Have a Gemini API key

  • Set the API key as the value of the environment variable GEMINI_API_KEY

To get a Gemini API key, go to the Gemini home page.

In your Python script or Jupyter notebook, set your Gemini API key as the value of GEMINI_API_KEY:

import os
os.environ["GEMINI_API_KEY"] = "put-your-gemini-api-key-here"

Using the Anthropic API

To use Anthropic models, you must:

  • Have a Anthropic API key

  • Set the API key as the value of the environment variable ANTHROPIC_API_KEY

To get a Anthropic API key, go to the Anthropic API page.

In your Python script or Jupyter notebook, set your Anthropic API key as the value of ANTHROPIC_API_KEY:

import os
os.environ["ANTHROPIC_API_KEY"] = "put-your-anthropic-api-key-here"

Managing benchmarks in Validate

A benchmark is a set of questions that can optionally include the expected answers. For a Tonic Validate development project, a benchmark is one way to provide the questions for a run.

A run assesses how your RAG system answers the benchmark questions. If your benchmark includes answers, then Validate compares the answers from the benchmark with the answers from your RAG system.

To create and update benchmarks, you can use either the Validate application or the Validate SDK.

Managing benchmarks

Displaying the list of benchmarks

To display your list of benchmarks, in the Validate navigation menu, click Benchmarks.

For each benchmark, the Benchmarks page displays:

  • The name of the benchmark

  • The number of questions in the benchmark

Creating a benchmark

You create a benchmark from the Benchmarks page.

To create a benchmark from the Benchmarks page:

  1. Click Create A New Benchmark.

  2. In the Name field, enter a name for the benchmark.

  3. .

  4. Click Save.

Updating a benchmark

You can update the name and questions for an existing benchmark.

To update a benchmark:

  1. On the Benchmarks page, either:

    • Click the benchmark name.

    • Click the options menu for the benchmark, then click Edit.

  1. On the Edit Benchmark panel, to change the benchmark name, in the Name field, enter the new name.

  2. You can also:

  3. To save the changes, click Save.

Deleting a benchmark

To delete a benchmark, on the Benchmarks page:

  1. Click the options menu for the benchmark.

  2. In the options menu, click Delete.

Configuring benchmark questions

Adding a question to a benchmark

A benchmark consists of a set of questions. For each question, you can optionally provide the expected response.

To add a question to a benchmark:

  1. Click Add Q&A.

  2. In the Question field, type the text of the question.

  3. Optionally, in the Answer field, type the text of the expected answer. If you do not provide an answer, then Validate cannot calculate an answer similarity score for the question.

  4. Click Finish Editing.

Updating a benchmark question

To update an existing question:

  1. Click the edit icon for the question.

  2. Update the Question and Answer fields.

  3. Click Finish Editing.

Deleting questions from a benchmark

To delete a question from a benchmark, click the delete icon for the question.

To delete all of the questions, click Clear All.

Using Benchmarks from the UI

You can use the benchmarks from the UI in the Validate SDK via calling get_benchmark

Using the Validate SDK to manage benchmarks

You can use the Validate SDK to create a benchmark from a list of questions and answers.

To upload this benchmark to the UI, use the new_benchmark method in the ValidateApi

Starting your Validate account

Signing up for an account

To sign up for a Validate account:

  1. Go to .

  2. In the Email Address field, provide your email address.

  3. In the Password field, create a password for your Validate account.

  4. Click Sign Up.

    1. Validate creates your account, and prompts you to log in.

  5. Validate sends an activation email to your email address.

When you confirm your account, Validate displays the Home page and prompts you to create your first project.

Creating your first project

The next step is to create a project.

On the Create a new project panel, in the Name field, provide the name for the new project, then click Next.

On the next panel, select the type of project.

  • To create a production monitoring project, click Monitoring.

  • To create a development project, click Development.

After you select the type, click Save. Tonic creates the project and displays the details page.

The confirmation panel also displays a code snippet that you can use to generate a metric (for a development project) or send a question from a RAG system (production monitoring project). To copy the snippet to a clipboard, click the copy icon.

from tonic_validate import ValidateApi
validate_api = ValidateApi("your-api-key")
benchmark = validate_api.get_benchmark("benchmark_id")
from tonic_validate import Benchmark
benchmark = Benchmark(
    questions=["What is the capital of France?"],
    answers=["Paris"]
)
from tonic_validate import ValidateApi
benchmark = Benchmark(
    questions=["What is the capital of France?"],
    answers=["Paris"]
)
validate_api.new_benchmark(benchmark, "benchmark_name")
Add questions to the benchmark
Add questions
Update questions
Delete questions
Benchmarks page displaying the list of benchmarks
Edit Benchmark panel
Options menu for a benchmark
Fields to add a question to a benchmark
https://validate.tonic.ai/signup
Getting started panel for a new Tonic Validate account
Create a new project step for getting started
Project type selection for a new project
Details panel for a new development project
Project details for a new production monitoring project

Starting a Validate run

To start and upload a run, use ValidateScorer to score the run.

Next, to upload the run with ValidateApi, use an API token to connect to Tonic Validate, then specify the project identifier.

from tonic_validate import ValidateScorer, ValidateApi, Benchmark

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }

benchmark = Benchmark(questions=["What is the capital of France?"], answers=["Paris"])
# Score the responses for each question and answer pair
scorer = ValidateScorer()
run = scorer.score(benchmark, get_rag_response)

# Uploads the project to the UI
validate_api = ValidateApi("your-api-key")
validate_api.upload_run("your-project-id", run)

Configuring the run

When you create a run, you specify the LLM evaluator and the metrics to calculate on the responses that are logged during the run.

For the LLM evaluator, we currently support the following models:

  • OpenAI, including Azure's OpenAI service

  • Gemini

  • Anthropic

To change your model, use the model_evaluator argument to pass the model string to ValidateScorer .

For example:

scorer = ValidateScorer(model_evaluator="gpt-4")
scorer = ValidateScorer(model_evaluator="gemini/gemini-1.5-pro-latest")
scorer = ValidateScorer(model_evaluator="claude-3")

If you use Azure, then instead of the model name, you pass in your deployment name:

scorer = ValidateScorer(model_evaluator="my-azure-deployment-name")

For information on how to set up an Azure deployment, go to Using Azure's OpenAI service.

Configuring the run metrics

By default, the following RAG metrics are calculated:

  • Answer similarity - Note that if you do not provide expected answers to your benchmark questions, then Tonic Validate cannot determine answer similarity.

  • Augmentation precision

  • Answer consistency

When you create a run, if you only pass in the LLM evaluator, then the run calculates all of these metrics.

To specify the metrics to calculate during the run, pass a list of the metrics:

from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AugmentationAccuracyMetric

scorer = ValidateScorer([
    AnswerConsistencyMetric(),
    AugmentationAccuracyMetric()
])

Providing the RAG system responses and retrieved context to Validate

To provide the RAG system response, create a callback function that returns:

  • A string that contains the RAG system's response

  • A list of strings that represent the list of retrieved context from the RAG system

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }
     
# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, ask_rag)

When you log the RAG system answer and retrieved context, the metrics for the answer and retrieved context are calculated on your machine. They are then sent to the Validate application.

This may take some time. To calculate each metric requires at least one call to the LLM's API.

Using parallelism to speed up runs

If you want to speed up your runs, you can pass a parallelism argument to the score function to create additional threads to score your metrics faster

scorer.score(
    benchmark,
    get_rag_response,
    scoring_parallelism=2,
    callback_parallelism=2
)
  • scoring_parallelism controls the number of threads that are used to score the responses. For example, if scoring_parallelism is set to 2, then for scoring, 2 threads can call the LLM's API simultaneously.

  • callback_parallelism controls the number of threads that are used to call the callback that you provided. For example, if callback_parallelism is 2, then two threads can call your callback function simultaneously.

Managing projects in Validate

Tonic Validate supports the following types of projects:

  • Development projects

  • Production monitoring projects

About development projects

A Validate development project contains the results of Validate runs.

For a set of questions, which can be from a Validate benchmark, the run assesses the quality of the answers from your RAG system. If you also provide expected answers, then the run compares the answers from your RAG system against the provided answers. It also analyzes how the RAG system used additional context to answer the questions.

Each run generates an overall score and metrics.

For information about starting a run, go to Starting a Validate run.

For information about viewing run results, go to Viewing and managing runs.

For development projects, you can also use our Ragas integration, tonic_ragas_logger, which allows you to display Validate visualizations of Ragas results.

About production monitoring projects

A production monitoring project tracks the performance of a production RAG system.

You configure the RAG system to send to the production monitoring project:

  • The questions the RAG system receives

  • The answers it provided

  • The context it used to determine the answer

Validate then generates metrics for each question, and allows you track the RAG system performance over time. Note that production monitoring does not use Ragas.

For information about configuring your RAG system to send questions to a Validate production monitoring project, go to Configuring your RAG system to send questions to Validate.

For information about viewing the results, go to Viewing the metric scores and logged questions.

Displaying the list of projects

The Validate home page includes the list of projects.

Tonic Validate home page

For each project, the list displays:

  • The project name

  • For development projects, when the most recent run occurred

  • For development projects, a chart that maps the average overall score for each run over time

Displaying details for a project

To display the details for a project, click the project tile.

New development project without runs

For a new development project that does not have any runs, the project details page guides you through the required steps to create a run.

If you use tonic_ragas_logger to visualize Ragas results in Validate, then select Ragas as the logging framework.

If you use Validate runs to generate and visualize metrics, then select Validate as the logging framework.

Project details page for a development project without runs

Development project with runs

For a development project that has completed runs, the project details page displays the list of runs, and provides an overview of the scores across the runs and questions.

Project details page with the overall scores and list of runs and questions

At the left is the list of runs for the project. From there, you can display details for the run results.

The tiles across the top contain the average overall score and average metrics scores for the most recent run.

By default, the graph displays the overall score across all of the runs over time. When you click a metric score tile, the graph updates to show the average metric score across the runs.

Graph that shows a specific metric score over time

Below the graph is the list of questions in the project benchmark. For each question, the list shows the overall score for a month ago and for the most recent run.

To filter the question list, in the filter field, type text from the question.

Filtering the list of questions

When you click a question, the graph is updated to show the average overall or metric score across runs for that specific question. To deselect the question, click it again.

Graph filtered to show a metric score over time for a specific question

New production monitoring project

For a new production monitoring project that does not have any results, the project details page guides you through the required steps to set up a feed of questions from the RAG system to the project.

Project details for a new production monitoring project

Production monitoring project with results

For a production monitoring project that has received questions, the project details page shows a set of overall scores based on the most recent questions that the project received.

The overall scores are followed by a timeline that shows changes in the average metric scores over a selected timeframe.

Below the timeline is the list of questions with metric scores. When you click a point in the timeline, the questions are filtered to display questions that were received during that time.

Project details for a production monitoring project

Creating a project

For a new project, you provide a name and select the project type.

To create a project:

  1. On the Validate home page, click Create a Project.

  2. In the Name field, type the name of the project, then click Next.

Create a new project panel to provide the project name
  1. Click the type of project to create, then click Save.

Project type selection for a new project

Validate displays the project details page.

Changing the project name

For an existing project, you can change the name.

To edit the project name:

  1. Either:

    • From the Validate Home page, click the options menu for the project, then click Edit.

    • From the project details page, click Edit Name.

  2. On the Edit Project panel, in the Project Name field, type the new name for the project.

Edit Project page to update the project name
  1. Click Save.

Deleting a project

To delete a project, from the projects list:

  1. Click the options icon for the project.

  2. In the options menu, click Delete.

Options menu for a project

Validate workflows

Development project workflow

The overall process to use a Tonic Validate development project to evaluate your RAG system consists of the following:

Overview diagram of a Validate development project workflow

Create your benchmark (optional)

A Validate run analyzes a RAG system performance against a set of questions and optional ideal answers.

One way to provide the questions and answers is to configure a benchmark in Validate.

You can use the Validate application or SDK to add the benchmark to Validate.

Create your project

Next, use the Validate application to create a development project.

Create a run

Use the Validate SDK to create a run for the project.

The run configuration includes:

  • The project

  • The questions for to analyze the RAG performance. A Validate benchmark is one way to provide the question data.

  • Any metadata about the RAG data, such as the type of LLM, the embedder, or the retrieval algorithm

  • The metrics to calculate

Review the run results

From the Validate application, review the scores and metrics from the run.

Update and iterate

Based on the run results, you update the RAG system to improve the results, then create another run.

You compare the run results to see if your changes improved the quality of the answers.

Production monitoring project workflow

After you release your RAG system, you can use a Validate production monitoring project to track how well it answers user questions.

Overview diagram of a Validate production monitoring project workflow

Create your project

Use the Validate application to create a production monitoring project.

Configure your RAG system to send questions to the project

In your RAG system, you add a call to the Validate SDK to send the following to the production monitoring project:

  • Each question that a user asked

  • The answer that the RAG system provided

  • The context that the RAG system used

View the results

As it receives the questions, Validate generates metric scores.

In the Validate application, you can view a timeline of the average scores for the questions that Validate received from the RAG system.

You can also view and filter the list of questions.

End-to-end example using LlamaIndex

This development project example uses:

  • The data found in the folder of the tonic_validate SDK repository

  • The list of questions and reference answers found in

We use six Paul Graham essays about startup founders taken from . With these we build a RAG system that uses the simplest default model.

We load the question and answer list and use it to create a Tonic Validate benchmark.

Next, we connect to Validate using an API token we generated from the Validate application, and create a new development project and benchmark.

Finally, we can create a run and score it.

After you execute this code, you can upload your results to the Validate application and view it.

The metrics are automatically calculated and logged to Validate. The distribution of the scores over the benchmark are also graphed.

from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./paul_graham_essays").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

# Gets the response from llama index
def get_llama_response(prompt):
    response = query_engine.query(prompt)
    context = [x.text for x in response.source_nodes]
    return {
        "llm_answer": response.response,
        "llm_context_list": context
    }
import json
from tonic_validate import Benchmark

import json
qa_pairs = []
with open("question_and_answer_list.json", "r") as qa_file:
    qa_pairs = json.load(qa_file)[:10]

question_list = [qa_pair['question'] for qa_pair in qa_pairs]
answer_list = [qa_pair['answer'] for qa_pair in qa_pairs]

benchmark = Benchmark(questions=question_list, answers=answer_list)
from tonic_validate import ValidateApi
validate_api = ValidateApi("api-key-here")
from tonic_validate import ValidateScorer, ValidateApi
from tonic_validate.classes.benchmark import BenchmarkItem

# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, get_llama_response)
from tonic_validate import ValidateApi
# Upload the run
validate_api = ValidateApi("your-api-key")
validate_api.upload_run("your-project-id", run)
examples/paul_graham_essays
examples/question_and_answer_list.json
his blog
LlamaIndex

RAG metrics summary

RAG metrics include the following types of scores:

Score name
Input
Formula
What does it measure?
Evaluated components

or

  • Retrieved context

  • LLM answer

(Count of main points in answer that can be attributed to context) /

(Count of main points in the answer)

Whether the LLM answer contains information that does not come from the context.

  • Prompt builder

  • LLM

  • LLM answer

  • List of PII types

Calculated by Textual

Whether the LLM answer contains personally identifiable information (PII) of the specified types. Requires a Tonic Textual API key.

  • Prompt builder

  • LLM

  • LLM answer

  • Text string

  • Case-sensitivity flag

Compare LLM answer to text string

Whether the answer matches the provided text string.

LLM

  • Question

  • Reference answer

  • LLM answer

Score between 0 and 5

How well the reference answer matches the LLM answer. Cannot be used for production monitoring projects.

All components

  • Retrieved context

  • LLM answer

(Count of retrieved context in LLM answer) /

(Count of retrieved context)

Whether all of the context is in the LLM answer.

  • Prompt builder

  • LLM

  • Question

  • Retrieved context

  • LLM answer

(Count of relevant retrieved context in LLM answer) / (Count of relevant retrieved context)

Whether the relevant context is in the LLM answer.

  • Prompt builder

  • LLM

  • Callback

User-defined

Returns a true or false value based on a callback function that you provide. Cannot be used for production monitoring projects.

User-defined

  • LLM answer

  • Text string

Text.in(LLM answer)

Whether the response contains the provided text string.

LLM

  • Retrieved context

  • List of PII types

Calculated by Textual

Whether the context used for the response contains PII of the specified types. Requires a Tonic Textual API key.

Prompt builder

  • Retrieved context

  • Minimum length

  • Maximum length

(Minimum length) <= len(Context) <= (Maximum length)

Whether the length of a context item falls within the specified range.

Prompt builder

  • LLM answer

Returns 1 or 0 based on whether there is duplicate information

Whether the response contains duplicate information.

LLM

  • LLM answer

Returns 1 or 0 based on whether there is hate speech

Whether the response contains hate speech.

LLM

  • Target length of time

(Run time) <= (Target time)

Whether the response takes longer than the provided target time.

Entire system

  • LLM answer

  • Regular expression

  • Expected number of matches

Runs a regex search and then counts the matches. Returns true if the number of matches is equal to the expected match count.

Whether the response contains the expected number of matches for the provided regular expression.

LLM

  • LLM answer

  • Minimum length

  • Maximum length

(Minimum length) <= len(LLM response) <= (Maximum length)

Whether the response length falls within the specified range.

LLM

  • Question

  • Retrieved context

(Count of relevant retrieved context) / (Count of retrieved context)

Whether the context retrieved is relevant to answer the given question.

  • Chunker

  • Embedder

  • Retriever

Answer consistency
Answer consistency binary
Answer contains PII
Answer match
Answer similarity score
Augmentation accuracy
Augmentation precision
Binary
Contains text
Context contains PII
Context length
Duplication
Hate speech content
Latency
Regex
Response length
Retrieval precision

RAG metrics reference

The RAG metrics use LLM-assisted evaluation to measure your RAG system’s performance.

LLM-assisted evaluation means that an LLM, called the LLM evaluator, is prompted with an aspect of the RAG system response. The LLM evaluator is then asked to respond with a score that grades that aspect of the RAG system response.

Using the metrics

To use the metrics in your ValidateScorer, you can pass the metrics in as a list

from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AnswerSimilarityMetric

scorer = ValidateScorer([AnswerSimilarityMetric(), AnswerConsistencyMetric()])

Metrics

Answer consistency

from tonic_validate.metrics import AnswerConsistencyMetric

Answer consistency is the percentage of the RAG system answer that can be attributed to retrieved context.

Answer consistency is a float between 0 and 1.

To calculate this metric, we:

  1. Ask the LLM evaluator to create a bulleted list of the main points in the RAG system answer.

  2. Ask the LLM evaluator whether it can attribute each bullet point to the retrieved context.

The final score is the percentage of the bullet points that can be attributed to the context.

Answer consistency binary

from tonic_validate.metrics import AnswerConsistencyBinaryMetric

Answer consistency binary indicates whether all of the information in the answer is derived from the retrieved context.

Answer consistency binary is a binary integer (1 for true and 0 for false).

  • If all of the information in the answer comes from the retrieved context, then the answer is consistent with the context, and the value of this metric is 1.

  • If the answer contains information that is not derived from the context, then the value of this metric is 0.

To calculate answer consistency binary, we ask an LLM whether the RAG system answer contains any information that is not derived from the context.

Answer contains PII

This metric requires that you have a Tonic Textual account and a Textual API key.

from tonic_validate.metrics import AnswerContainsPiiMetric
metric = AnswerContainsPiiMetric(["<pii_type>"])

Answer contains PII indicates whether the answer from the LLM contains Personally Identifiable Information (PII) of specific types.

For example, you might check whether the response includes credit card numbers and bank numbers.

Answer contains PII is a binary integer (1 for true and 0 for false).

This metric uses the PII detection models in Tonic Textual. To use this metric, you must have a Textual API key. You can either:

  • Provide the Textual API key as the value of the textual_api_key parameter

  • Set the API key as the value of the TONIC_TEXTUAL_API_KEY environment variable.

When you calculate the metric, you provide the list of PII types to look for. For information about the available types, go to the .

For example, to check whether the answer contains city or zip code value:

from tonic_validate.metrics import AnswerContainsPiiMetric
metric = AnswerContainsPiiMetric(["LOCATION_CITY", "LOCATION_ZIP"])

Answer match

from tonic_validate.metrics import AnswerMatchMetric
metric = AnswerMatchMetric("<metric display name>", "<text>", case_sensitive=boolean)

Answer match indicates whether the answer from the LLM matches a text string that you provide.

Answer match is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use on the Validate application

  • The text to look for in the LLM response

  • Whether the search for the text is case-sensitive. For example, if this is True, and you provide "John" as the text, then "john" and "JOHN" would not be a match.

For example, to do a case-sensitive search for the word "Paris" in the LLM response

from tonic_validate.metrics import AnswerMatchMetric
metric = AnswerMatchMetric("Answer Match", "Paris", case_sensitive=True)

Answer similarity score

from tonic_validate.metrics import AnswerSimilarityMetric

The answer similarity score measures, on a scale from 0 to 5, how well the answer from the RAG system corresponds in meaning to a reference answer. You cannot calculate an Answer similarity score for a production monitoring project.

This score is an end-to-end test of the RAG LLM.

The answer similarity score is a float between 0 and 5. The value is usually an integer.

To calculate the score, we ask an LLM to grade on a scale from 0 to 5 how well the RAG LLM response matches the reference response.

Augmentation accuracy

from tonic_validate.metrics import AugmentationAccuracyMetric

Augmentation accuracy is the percentage of retrieved context for which some portion of the context appears in the answer from the RAG system. This metric is unrelated to the binary classification framework.

Augmentation accuracy is a float between 0 and 1.

To calculate both augmentation precision and augmentation accuracy, we ask the LLM evaluator whether retrieved context is used in the RAG system answer.

There is a trade-off between maximizing augmentation accuracy and augmentation precision.

  • If retrieval precision is very high, then you want to maximize augmentation accuracy.

  • If retrieval precision is not very high, then you want to maximize augmentation precision.

Binary

from tonic_validate.metrics import BinaryMetric
metric = BinaryMetric("<metric display name>", <callback>, llm_response: boolean)

Binary is a metric that returns either true or false from a callback function that you provide. You cannot calculate a Binary metric for a production monitoring project.

Binary is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use for the metric on the Validate application.

  • The callback function. This a a function that you define that returns either true or false. The callback uses either the OpenAIService or, for non-OpenAI calls, the LiteLLMService.

  • Whether to include LLMResponse, which contains:

    • The LLM answer (llm_answer)

    • The context used (llm_context_list)

    • For a development project run, the benchmark item (benchmark_item)

For example:

from tonic_validate.metrics import BinaryMetric
metric = BinaryMetric("Binary Metric", lambda open_ai, llm_response: True)

Contains text

from tonic_validate.metrics import ContainsTextMetric
metric = ContainsTextMetric("<metric display name>", "<text>", case_sensitive=boolean)

Contains text indicates whether the answer from the LLM contains a text string that you provide.

Contains text is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use for the metric on the Validate application

  • The text to look for in the LLM response

  • Whether the search for the text is case-sensitive. For example, if this is True, and you provide "John" as the text, then "john" and "JOHN" would not be a match.

For example, the following request does a case-sensitive search for the text "Paris":

from tonic_validate.metrics import ContainsTextMetric
metric = ContainsTextMetric("Contains Text", "Paris", case_sensitive=True)

Context contains PII

This metric requires that you have a Tonic Textual account and a Textual API key.

from tonic_validate.metrics import ContextContainsPiiMetric
metric = ContextContainsPiiMetric(["<pii_type>"])

Context contains PII indicates whether the context that the LLM used to answer the question contains Personally Identifiable Information (PII) of specific types.

Context contains PII is a binary integer (1 for true and 0 for false).

For example, you might check whether the context includes credit card numbers and bank numbers.

This metric uses the PII detection models in Tonic Textual. To use this metric, you must have a Textual API key. You can either:

  • Provide the Textual API key as the value of the textual_api_key parameter

  • Set the API key as the value of the TONIC_TEXTUAL_API_KEY environment variable.

When you calculate the metric, you provide the list of PII types to look for. For information about available types, go to the .

For example, to check whether the answer contains city or zip code value:

from tonic_validate.metrics import ContextContainsPiiMetric
metric = ContextContainsPiiMetric(["LOCATION_CITY", "LOCATION_ZIP"])

Context length

from tonic_validate.metrics import ContextLengthMetric
metric = ContextLengthMetric("<metric display name", <minimum length>, <maximum length>)

Context length indicates whether the length of the context that the LLM used to answer the question fell within a range that you provide.

Context length is a binary integer (1 for true and 0 for false).

The request includes:

  • The display name to use for the metric on the Validate application

  • The minimum length. If you do not provide a minimum length, then the metric checks whether the context is shorter than the maximum length.

  • The maximum length. If you do not provide a maximum length, then the metric checks whether the context is longer than the minimum length.

For example, to check whether the context is between 5 and 10 characters long:

from tonic_validate.metrics import ContextLengthMetric
metric = ContextLengthMetric("Context Length", 5, 10)

Duplication

from tonic_validate.metrics import DuplicationMetric
metric = DuplicationMetric()

Duplication indicates whether the answer from the LLM contains duplicate information.

Duplication is a binary integer (1 for true and 0 for false).

Hate speech content

from tonic_validate.metrics import HateSpeechContentMetric
metric = HateSpeechContentMetric()

Hate speech content indicates whether the answer from the LLM contains hate speech.

Hate speech content is a binary integer (1 for true and 0 for false).

Latency

from tonic_validate.metrics import LatencyMetric
metric = LatencyMetric(target_time=<time in seconds>)

Latency indicates whether the length of time that it took for the LLM to return an answer is less than a length of time in seconds that you provide.

Latency is a binary integer (1 for true and 0 for false).

For example, to check whether the LLM responded within 5 seconds:

from tonic_validate.metrics import LatencyMetric
metric = LatencyMetric(target_time=5)

Regex

from tonic_validate.metrics import RegexMetric
metric = RegexMetric("<metric display name>", "<regular expression>", match_count=<expected number of matches>)

Regex indicates whether the answer from the LLM contains the expected number of matches to a provided regular expression.

Regex is a binary integer (1 for true and 0 for false).

For example, to check whether the response contained 2 matches for the regular expression Fid*o:

from tonic_validate.metrics import RegexMetric
metric = RegexMetric("Regex Metric", "Fid*o", match_count=2)

Response length

from tonic_validate.metrics import ResponseLengthMetric
metric = ResponseLengthMetric("Response Length", <minimum length>, <maximum length>)

Response length indicates whether the length of the answer from the LLM falls within a provided range.

The request includes:

  • The display name to use for the metric on the Validate application

  • The minimum length. If you do not provide a minimum length, then the metric checks whether the answer is shorter than the maximum length.

  • The maximum length. If you do not provide a maximum length, then the metric checks whether the answer is longer than the minimum length.

Response length is a binary integer (1 for true and 0 for false).

For example, to check whether the answer is between 5 and 10 characters long:

from tonic_validate.metrics import ResponseLengthMetric
metric = ResponseLengthMetric("Response Length", 5, 10)

Retrieval and augmentation precision

When you ask a question of a RAG LLM, you can view whether the RAG system retrieves or uses the correct context as a binary classification problem on the set of context vectors. The labels indicate whether a context vector provides relevant context for the question.

This is a very imbalanced binary classification problem that has very few positive classes. The vector database contains many vectors, and only a few of those vectors are relevant.

In an imbalanced binary classification problem with few positive classes, the relevant metrics are:

  • precision = (number of correctly predicted positive classes)/(number of predicted positive classes)

  • recall = (number of correctly predicted positive classes)/(number of positive classes).

For vector retrieval, unless you already know the question and relevant context, you cannot know all of the relevant context for a question. For this reason, it is difficult to explicitly calculate recall. However, we can calculate precision. This binary classification setup is how we define the retrieval precision and augmentation precision metrics.

Retrieval precision

from tonic_validate.metrics import RetrievalPrecisionMetric

The context retrieval step of a RAG system predicts the context that is relevant to answering the question.

Retrieval precision is the percentage of retrieved context that is relevant to answer the question.

Retrieval precision is a float between 0 and 1.

To measure retrieval precision, for each context vector, we ask the LLM evaluator whether the context is relevant to use to answer the question.

Augmentation precision

from tonic_validate.metrics import AugmentationPrecisionMetric

Augmentation precision is the percentage of relevant retrieved context for which some portion of the context appears in the answer from the RAG system.

For the context from the retrieval step, this score measures how well the LLM used in the RAG system recognizes relevant context.

Augmentation precision is a float between 0 and 1.

To measure augmentation precision, for each relevant piece of context, we ask the LLM evaluator whether information from the relevant context appears in the answer.

Calculating the overall score

The overall score is the mean of the normalized values of the metrics. For answer similarity, the normalized value is the original value divided by 5. For the other metrics, the normalized value is the same as the original value. Normalizing answer similarity ensures that all of the metrics have values between 0 and 1.

In Validate, the overall score is calculated from the metrics that are used. If only some of the metrics are calculated, then the overall score is the mean of those metrics. The overall score does not include metrics that were not calculated.

We opt for this definition because it is both:

  • Simple, which makes it easy to understand.

  • Flexible, because you can specify only the metrics that you are interested in calculating, and then the overall score depends only on those metrics.

In general, the best way to improve your RAG system is to focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.

If needed, however, the overall score is a simple representation of all the metrics in one number.

RAG components summary

A RAG system includes the following components:

Component
Definition
Examples

Document store

Where the textual data is stored.

Google Docs, Notion, Word documents

Chunker

How each document is broken into pieces (or chunks) that are then embedded.

llama hub

Embedder

How each document chunk is transformed into a vector that stores its semantic meaning.

ada-002, sentence transformer

Retriever

The algorithm that retrieves relevant chunks of text from the user query.

Those chunks of text are used as context to answer a user query.

Take the top cosine similarity scores between the embedding of the user query and the embedded document chunks

Prompt builder

How the user query, along with conversation history and retrieved document chunks, are put into the context window to prompt the LLM for an answer to the user query.

Here's a user query {user_query} and here's a list of context that may be helpful to answer the user's query: {context_1}, {context_2}.

Answer the user's query using the given context.

LLM

The large language model that receives the prompt from the prompt builder and returns an answer to the user's query.

gpt3.5-turbo, gpt4, llama 2, claude

entities list in the Textual documentation
entities list in the Textual documentation