Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Tonic Validate application and Validate SDK (tonic_validate
) allow you to measure how well your RAG LLM system performs.
Validate calculates rigorous LLM-assisted RAG evaluation metrics. You can also use tonic_validate
to compute RAG metrics outside the context of a Validate project.
In Validate:
Development projects track the performance of a RAG system that is under development.
Production monitoring projects track how well a production RAG system answers questions from actual users.
For development projects, Validate also provides an integration with Ragas, tonic_ragas_logger, that allows you to visualize Ragas evaluation results in Validate.
Need help with Validate? Contact support@tonic.ai.
Start your Validate account
Sign up for a Validate account and create your first project.
Set up the Validate SDK
Install the SDK. Provide Validate and Open AI API keys.
Quickstart example
Use tonic_validate
to log RAG metrics to a development project.
Types of RAG metrics
RAG metrics measure the quality of RAG LLM responses.
Create and manage benchmarks
A benchmark is a set of questions, optionally with expected responses to send to a development project.
Create and manage projects
A development project consists of a set of runs.
A production monitoring project tracks performance over time.
Start a new run
Start a new Validate run to calculate metrics for RAG LLM answers to questions.
View run results
Review average metric scores, and the grouping of values for the questions.
Connect your RAG system to Validate
Configure your RAG system to send user questions and system answers to Validate.
Track RAG system performance
View average metric scores over time for the RAG system.
End-to-end example with a llama index
Demonstrates an end-to-end Validate development project flow.
A development project is designed to be used during RAG system development. It is a collection of runs that allow you to see how the run performance for a given set of questions changes over time.
A production monitoring project allows you to monitor the performance over time of a production RAG system. You configure the RAG system to automatically send to the production monitoring project the questions your users asked, the answers the RAG system provided, and the associated context.
For more information, go to Managing projects in Validate.
Metrics are used to score the RAG system responses to questions.
For a development project, Validate calculates metric scores for the benchmark questions that are provided for the project.
For a production monitoring project, Validate calculates metric scores for the questions that users ask the RAG system. The RAG system sends the questions to Validate.
Validate calculates different metrics that represent different aspects of a RAG system. For more information about metrics, go to the metrics section.
For a Validate development project, a run represents an assessment of the RAG responses to a set of questions based on the RAG system configuration at a given point in time.
For each response, the run includes:
The question and, optionally, the corresponding ideal answer. A benchmark is one option for providing the questions.
The LLM's response and the context that the RAG system retrieved
Metadata in the form of key-value pairs that you specify. For example, "Model": "GPT-4"
Scores for the responses that use your chosen metrics
The run also includes overall scores for the given metrics.
For more information, go to Viewing and managing runs.
For a Validate development project, a benchmark is a collection of questions with or without responses. The responses represent the ideal answers to the given questions.
A benchmark is one way to provide the questions for Validate to use to evaluate your RAG system.
For more information, go to Managing benchmarks in Validate.
You must use the Validate SDK to:
You can also use the SDK to:
Manage projects
Calculate RAG metrics outside the context of a Validate project
You can use the Validate application to manage benchmarks and projects.
You must use the Validate application to view:
The Tonic Validate application and SDK (tonic_validate
) allow you to measure how well your RAG LLM system performs.
Retrieval augmented generation (RAG) allows you to augment a large language model (LLM) with additional data that is not in the LLM's original training set. The additional data usually takes the form of text from documents such as HTML, MarkDown, Word, or Notion.
The LLM can then use that data in its responses to user queries. The responses can also include references for the additional content.
But how do you know how well your RAG system works? How good are its responses? How relevant is the additional context? And how does the quality of the responses change when you change the available context data?
That is where Validate comes in.
As you develop your RAG system, you can use a Validate development project to run tests to determine how the system performs against a benchmark set of questions. You can then see whether the quality of the answers improves for each run.
After you release your RAG system, you can configure it to send the questions, answers, and context to a Validate production monitoring project that tracks the quality of the responses over time in your production systems.
Validate includes:
Metrics to measure the performance of each component in your RAG system
Visualizations to compare performance across time as the system changes
Validate provides insight into your RAG LLM system performance, so that you can deploy it with confidence.
To install tonic_validate
, you use pip:
To log metrics using tonic_validate
, you need to set the API key by passing it into ValidateApi
Validate uses LLM-assisted evaluation. You must provide an API key for the LLM evaluator model that you want to use.
Validate currently supports the following models:
To use the OpenAI models, you must:
Have an OpenAI API key
Set the API key as the value of the environment variable OPENAI_API_KEY
To get an Open AI API key, go to the OpenAI API key page.
In your Python script or Jupyter notebook, set your Open AI API key as the value of OPENAI_API_KEY
:
Validate also supports Azure's OpenAI API service.
To use Azure, you must set up an Azure OpenAI deployment.
For information on how to set up a deployment, go to this Azure OpenAI service quickstart.
After you set up your deployment, copy your API key and API endpoint to the following environment variables:
When you start a Validate run, you must provide the deployment name.
To use Gemini models, you must:
Have a Gemini API key
Set the API key as the value of the environment variable GEMINI_API_KEY
To get a Gemini API key, go to the Gemini home page.
In your Python script or Jupyter notebook, set your Gemini API key as the value of GEMINI_API_KEY
:
To use Anthropic models, you must:
Have a Anthropic API key
Set the API key as the value of the environment variable ANTHROPIC_API_KEY
To get a Anthropic API key, go to the Anthropic API page.
In your Python script or Jupyter notebook, set your Anthropic API key as the value of ANTHROPIC_API_KEY
:
Evaluating a RAG system is difficult because:
There are many moving pieces. For a list of components, go to RAG components summary.
You need to evaluate natural language responses.
Building RAG systems is relatively new. There are not well-established systems to evaluate the performance of a RAG system.
Our RAG metrics use the input and output of your RAG system to test all of the components of the system.
Each metric calculates a score that measures the performance of a different aspect of a typical RAG system.
The metrics use LLM-assisted evaluation to evaluate the natural language aspect of a RAG system question, answer, and retrieved context.
The metrics include both end-to-end RAG system metrics, and metrics that are based on a binary classification framework.
For a Validate development project, to use the RAG metrics to evaluate and improve your RAG system, you first build a benchmark dataset of question and reference answers, or only questions, to test the RAG system with.
Next, use your RAG system to get the answers and the retrieved context that was used to generate the answers to the questions in your benchmark.
After you have this information, you can calculate each RAG metric and see how different aspects of your RAG system perform.
To improve your RAG system, focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.
For a production RAG system, you can configure the system to send the questions, answers, and context to a Validate production monitoring project.
You can then track the RAG system performance over time.
A RAG system includes the following components:
Document store
Where the textual data is stored.
Google Docs, Notion, Word documents
Chunker
How each document is broken into pieces (or chunks) that are then embedded.
llama hub
Embedder
How each document chunk is transformed into a vector that stores its semantic meaning.
ada-002, sentence transformer
Retriever
The algorithm that retrieves relevant chunks of text from the user query.
Those chunks of text are used as context to answer a user query.
Take the top cosine similarity scores between the embedding of the user query and the embedded document chunks
Prompt builder
How the user query, along with conversation history and retrieved document chunks, are put into the context window to prompt the LLM for an answer to the user query.
Here's a user query {user_query} and here's a list of context that may be helpful to answer the user's query: {context_1}, {context_2}.
Answer the user's query using the given context.
LLM
The large language model that receives the prompt from the prompt builder and returns an answer to the user's query.
gpt3.5-turbo, gpt4, llama 2, claude
To sign up for a Validate account:
In the Email Address field, provide your email address.
In the Password field, create a password for your Validate account.
Click Sign Up.
Validate creates your account, and prompts you to log in.
Validate sends an activation email to your email address.
When you confirm your account, Validate displays the Home page and prompts you to create your first project.
The next step is to create a project.
On the Create a new project panel, in the Name field, provide the name for the new project, then click Next.
On the next panel, select the type of project.
To create a production monitoring project, click Monitoring.
To create a development project, click Development.
After you select the type, click Save. Tonic creates the project and displays the details page.
The confirmation panel also displays a code snippet that you can use to generate a metric (for a development project) or send a question from a RAG system (production monitoring project). To copy the snippet to a clipboard, click the copy icon.
The overall process to use a Tonic Validate development project to evaluate your RAG system consists of the following:
A Validate run analyzes a RAG system performance against a set of questions and optional ideal answers.
One way to provide the questions and answers is to configure a benchmark in Validate.
You can use the Validate application or SDK to add the benchmark to Validate.
Next, use the Validate application to create a development project.
Use the Validate SDK to create a run for the project.
The run configuration includes:
The project
The questions for to analyze the RAG performance. A Validate benchmark is one way to provide the question data.
Any metadata about the RAG data, such as the type of LLM, the embedder, or the retrieval algorithm
The metrics to calculate
From the Validate application, review the scores and metrics from the run.
Based on the run results, you update the RAG system to improve the results, then create another run.
You compare the run results to see if your changes improved the quality of the answers.
After you release your RAG system, you can use a Validate production monitoring project to track how well it answers user questions.
Use the Validate application to create a production monitoring project.
In your RAG system, you add a call to the Validate SDK to send the following to the production monitoring project:
Each question that a user asked
The answer that the RAG system provided
The context that the RAG system used
As it receives the questions, Validate generates metric scores.
In the Validate application, you can view a timeline of the average scores for the questions that Validate received from the RAG system.
You can also view and filter the list of questions.
The RAG metrics use LLM-assisted evaluation to measure your RAG system’s performance.
LLM-assisted evaluation means that an LLM, called the LLM evaluator, is prompted with an aspect of the RAG system response. The LLM evaluator is then asked to respond with a score that grades that aspect of the RAG system response.
To use the metrics in your ValidateScorer
, you can pass the metrics in as a list
Answer consistency is the percentage of the RAG system answer that can be attributed to retrieved context.
Answer consistency is a float between 0 and 1.
To calculate this metric, we:
Ask the LLM evaluator to create a bulleted list of the main points in the RAG system answer.
Ask the LLM evaluator whether it can attribute each bullet point to the retrieved context.
The final score is the percentage of the bullet points that can be attributed to the context.
Answer consistency binary indicates whether all of the information in the answer is derived from the retrieved context.
Answer consistency binary is a binary integer (1 for true and 0 for false).
If all of the information in the answer comes from the retrieved context, then the answer is consistent with the context, and the value of this metric is 1.
If the answer contains information that is not derived from the context, then the value of this metric is 0.
To calculate answer consistency binary, we ask an LLM whether the RAG system answer contains any information that is not derived from the context.
This metric requires that you have a Tonic Textual account and a Textual API key.
Answer contains PII indicates whether the answer from the LLM contains Personally Identifiable Information (PII) of specific types.
For example, you might check whether the response includes credit card numbers and bank numbers.
Answer contains PII is a binary integer (1 for true and 0 for false).
This metric uses the PII detection models in Tonic Textual. To use this metric, you must have a Textual API key. You can either:
Provide the Textual API key as the value of the textual_api_key
parameter
Set the API key as the value of the TONIC_TEXTUAL_API_KEY
environment variable.
For example, to check whether the answer contains city or zip code value:
Answer match indicates whether the answer from the LLM matches a text string that you provide.
Answer match is a binary integer (1 for true and 0 for false).
The request includes:
The display name to use on the Validate application
The text to look for in the LLM response
Whether the search for the text is case-sensitive. For example, if this is True
, and you provide "John" as the text, then "john" and "JOHN" would not be a match.
For example, to do a case-sensitive search for the word "Paris" in the LLM response
The answer similarity score measures, on a scale from 0 to 5, how well the answer from the RAG system corresponds in meaning to a reference answer. You cannot calculate an Answer similarity score for a production monitoring project.
This score is an end-to-end test of the RAG LLM.
The answer similarity score is a float between 0 and 5. The value is usually an integer.
To calculate the score, we ask an LLM to grade on a scale from 0 to 5 how well the RAG LLM response matches the reference response.
Augmentation accuracy is the percentage of retrieved context for which some portion of the context appears in the answer from the RAG system. This metric is unrelated to the binary classification framework.
Augmentation accuracy is a float between 0 and 1.
To calculate both augmentation precision and augmentation accuracy, we ask the LLM evaluator whether retrieved context is used in the RAG system answer.
There is a trade-off between maximizing augmentation accuracy and augmentation precision.
If retrieval precision is very high, then you want to maximize augmentation accuracy.
If retrieval precision is not very high, then you want to maximize augmentation precision.
Binary is a metric that returns either true or false from a callback function that you provide. You cannot calculate a Binary metric for a production monitoring project.
Binary is a binary integer (1 for true and 0 for false).
The request includes:
The display name to use for the metric on the Validate application.
The callback function. This a a function that you define that returns either true or false. The callback uses either the OpenAIService or, for non-OpenAI calls, the LiteLLMService.
Whether to include LLMResponse
, which contains:
The LLM answer (llm_answer
)
The context used (llm_context_list
)
For a development project run, the benchmark item (benchmark_item
)
For example:
Contains text indicates whether the answer from the LLM contains a text string that you provide.
Contains text is a binary integer (1 for true and 0 for false).
The request includes:
The display name to use for the metric on the Validate application
The text to look for in the LLM response
Whether the search for the text is case-sensitive. For example, if this is True
, and you provide "John" as the text, then "john" and "JOHN" would not be a match.
For example, the following request does a case-sensitive search for the text "Paris":
This metric requires that you have a Tonic Textual account and a Textual API key.
Context contains PII indicates whether the context that the LLM used to answer the question contains Personally Identifiable Information (PII) of specific types.
Context contains PII is a binary integer (1 for true and 0 for false).
For example, you might check whether the context includes credit card numbers and bank numbers.
This metric uses the PII detection models in Tonic Textual. To use this metric, you must have a Textual API key. You can either:
Provide the Textual API key as the value of the textual_api_key
parameter
Set the API key as the value of the TONIC_TEXTUAL_API_KEY
environment variable.
For example, to check whether the answer contains city or zip code value:
Context length indicates whether the length of the context that the LLM used to answer the question fell within a range that you provide.
Context length is a binary integer (1 for true and 0 for false).
The request includes:
The display name to use for the metric on the Validate application
The minimum length. If you do not provide a minimum length, then the metric checks whether the context is shorter than the maximum length.
The maximum length. If you do not provide a maximum length, then the metric checks whether the context is longer than the minimum length.
For example, to check whether the context is between 5 and 10 characters long:
Duplication indicates whether the answer from the LLM contains duplicate information.
Duplication is a binary integer (1 for true and 0 for false).
Hate speech content indicates whether the answer from the LLM contains hate speech.
Hate speech content is a binary integer (1 for true and 0 for false).
Latency indicates whether the length of time that it took for the LLM to return an answer is less than a length of time in seconds that you provide.
Latency is a binary integer (1 for true and 0 for false).
For example, to check whether the LLM responded within 5 seconds:
Regex indicates whether the answer from the LLM contains the expected number of matches to a provided regular expression.
Regex is a binary integer (1 for true and 0 for false).
For example, to check whether the response contained 2 matches for the regular expression Fid*o
:
Response length indicates whether the length of the answer from the LLM falls within a provided range.
The request includes:
The display name to use for the metric on the Validate application
The minimum length. If you do not provide a minimum length, then the metric checks whether the answer is shorter than the maximum length.
The maximum length. If you do not provide a maximum length, then the metric checks whether the answer is longer than the minimum length.
Response length is a binary integer (1 for true and 0 for false).
For example, to check whether the answer is between 5 and 10 characters long:
When you ask a question of a RAG LLM, you can view whether the RAG system retrieves or uses the correct context as a binary classification problem on the set of context vectors. The labels indicate whether a context vector provides relevant context for the question.
This is a very imbalanced binary classification problem that has very few positive classes. The vector database contains many vectors, and only a few of those vectors are relevant.
In an imbalanced binary classification problem with few positive classes, the relevant metrics are:
precision = (number of correctly predicted positive classes)/(number of predicted positive classes)
recall = (number of correctly predicted positive classes)/(number of positive classes).
For vector retrieval, unless you already know the question and relevant context, you cannot know all of the relevant context for a question. For this reason, it is difficult to explicitly calculate recall. However, we can calculate precision. This binary classification setup is how we define the retrieval precision and augmentation precision metrics.
The context retrieval step of a RAG system predicts the context that is relevant to answering the question.
Retrieval precision is the percentage of retrieved context that is relevant to answer the question.
Retrieval precision is a float between 0 and 1.
To measure retrieval precision, for each context vector, we ask the LLM evaluator whether the context is relevant to use to answer the question.
Augmentation precision is the percentage of relevant retrieved context for which some portion of the context appears in the answer from the RAG system.
For the context from the retrieval step, this score measures how well the LLM used in the RAG system recognizes relevant context.
Augmentation precision is a float between 0 and 1.
To measure augmentation precision, for each relevant piece of context, we ask the LLM evaluator whether information from the relevant context appears in the answer.
The overall score is the mean of the normalized values of the metrics. For answer similarity, the normalized value is the original value divided by 5. For the other metrics, the normalized value is the same as the original value. Normalizing answer similarity ensures that all of the metrics have values between 0 and 1.
In Validate, the overall score is calculated from the metrics that are used. If only some of the metrics are calculated, then the overall score is the mean of those metrics. The overall score does not include metrics that were not calculated.
We opt for this definition because it is both:
Simple, which makes it easy to understand.
Flexible, because you can specify only the metrics that you are interested in calculating, and then the overall score depends only on those metrics.
In general, the best way to improve your RAG system is to focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.
If needed, however, the overall score is a simple representation of all the metrics in one number.
For an existing Validate development project, the following example creates a run and logs the responses to the UI.
When the RAG system response and retrieved context is logged to the Tonic Validate application, RAG metrics are calculated using calls to Open AI.
The Tonic Validate SDK (tonic_validate
) allows you to:
Manage projects and benchmarks
Start runs against Validate development projects
Send questions from a RAG system to a Validate production monitoring project
To use tonic_validate
, you must have a tonic_validate
API key.
When you sign up for your Validate account, it automatically creates an API key for you. You can also create and revoke API keys from the Validate application.
On the Validate Home page, the API Keys panel displays the list of API keys.
To create a new key:
In the API Keys panel on the Validate home page, click Create an API Key.
In the Name field, provide a name to use to identify the API key.
Click Create API Key.
Validate displays a message that the key was created. To copy the API key to use later, click Copy to Clipboard.
From the API Keys panel, to display an API key, click the key icon for the key.
You can then copy the key to use later.
To revoke an API key, in the API Keys panel, click the delete icon for the key to revoke.
RAG metrics include the following types of scores:
A benchmark is a set of questions that can optionally include the expected answers. For a Tonic Validate development project, a benchmark is one way to provide the questions for a run.
A run assesses how your RAG system answers the benchmark questions. If your benchmark includes answers, then Validate compares the answers from the benchmark with the answers from your RAG system.
To create and update benchmarks, you can use either the Validate application or the Validate SDK.
To display your list of benchmarks, in the Validate navigation menu, click Benchmarks.
For each benchmark, the Benchmarks page displays:
The name of the benchmark
The number of questions in the benchmark
You create a benchmark from the Benchmarks page.
To create a benchmark from the Benchmarks page:
Click Create A New Benchmark.
In the Name field, enter a name for the benchmark.
Click Save.
You can update the name and questions for an existing benchmark.
To update a benchmark:
On the Benchmarks page, either:
Click the benchmark name.
Click the options menu for the benchmark, then click Edit.
On the Edit Benchmark panel, to change the benchmark name, in the Name field, enter the new name.
You can also:
To save the changes, click Save.
To delete a benchmark, on the Benchmarks page:
Click the options menu for the benchmark.
In the options menu, click Delete.
A benchmark consists of a set of questions. For each question, you can optionally provide the expected response.
To add a question to a benchmark:
Click Add Q&A.
In the Question field, type the text of the question.
Optionally, in the Answer field, type the text of the expected answer. If you do not provide an answer, then Validate cannot calculate an answer similarity score for the question.
Click Finish Editing.
To update an existing question:
Click the edit icon for the question.
Update the Question and Answer fields.
Click Finish Editing.
To delete a question from a benchmark, click the delete icon for the question.
To delete all of the questions, click Clear All.
You can use the benchmarks from the UI in the Validate SDK via calling get_benchmark
You can use the Validate SDK to create a benchmark from a list of questions and answers.
To upload this benchmark to the UI, use the new_benchmark
method in the ValidateApi
Tonic Validate supports the following types of projects:
Development projects
Production monitoring projects
A Validate development project contains the results of Validate runs.
For a set of questions, which can be from a Validate benchmark, the run assesses the quality of the answers from your RAG system. If you also provide expected answers, then the run compares the answers from your RAG system against the provided answers. It also analyzes how the RAG system used additional context to answer the questions.
Each run generates an overall score and metrics.
For information about starting a run, go to .
For information about viewing run results, go to .
For development projects, you can also use our Ragas integration, , which allows you to display Validate visualizations of Ragas results.
A production monitoring project tracks the performance of a production RAG system.
You configure the RAG system to send to the production monitoring project:
The questions the RAG system receives
The answers it provided
The context it used to determine the answer
Validate then generates metrics for each question, and allows you track the RAG system performance over time. Note that production monitoring does not use Ragas.
For information about configuring your RAG system to send questions to a Validate production monitoring project, go to Configuring your RAG system to send questions to Validate.
For information about viewing the results, go to Viewing the metric scores and logged questions.
The Validate home page includes the list of projects.
For each project, the list displays:
The project name
For development projects, when the most recent run occurred
For development projects, a chart that maps the average overall score for each run over time
To display the details for a project, click the project tile.
For a new development project that does not have any runs, the project details page guides you through the required steps to create a run.
If you use Validate runs to generate and visualize metrics, then select Validate as the logging framework.
For a development project that has completed runs, the project details page displays the list of runs, and provides an overview of the scores across the runs and questions.
The tiles across the top contain the average overall score and average metrics scores for the most recent run.
By default, the graph displays the overall score across all of the runs over time. When you click a metric score tile, the graph updates to show the average metric score across the runs.
Below the graph is the list of questions in the project benchmark. For each question, the list shows the overall score for a month ago and for the most recent run.
To filter the question list, in the filter field, type text from the question.
When you click a question, the graph is updated to show the average overall or metric score across runs for that specific question. To deselect the question, click it again.
For a new production monitoring project that does not have any results, the project details page guides you through the required steps to set up a feed of questions from the RAG system to the project.
For a production monitoring project that has received questions, the project details page shows a set of overall scores based on the most recent questions that the project received.
The overall scores are followed by a timeline that shows changes in the average metric scores over a selected timeframe.
Below the timeline is the list of questions with metric scores. When you click a point in the timeline, the questions are filtered to display questions that were received during that time.
For a new project, you provide a name and select the project type.
To create a project:
On the Validate home page, click Create a Project.
In the Name field, type the name of the project, then click Next.
Click the type of project to create, then click Save.
Validate displays the project details page.
For an existing project, you can change the name.
To edit the project name:
Either:
On the Edit Project panel, in the Project Name field, type the new name for the project.
Click Save.
To delete a project, from the projects list:
Click the options icon for the project.
In the options menu, click Delete.
When you calculate the metric, you provide the list of PII types to look for. For information about the available types, go to the .
When you calculate the metric, you provide the list of PII types to look for. For information about available types, go to the .
.
If you use to visualize Ragas results in Validate, then select Ragas as the logging framework.
At the left is the list of runs for the project. From there, you can .
From the Validate Home page, click the options menu for the project, then click Edit.
From the project details page, click Edit Name.
Retrieved context
LLM answer
(Count of main points in answer that can be attributed to context) /
(Count of main points in the answer)
Whether the LLM answer contains information that does not come from the context.
Prompt builder
LLM
LLM answer
List of PII types
Calculated by Textual
Whether the LLM answer contains personally identifiable information (PII) of the specified types. Requires a Tonic Textual API key.
Prompt builder
LLM
LLM answer
Text string
Case-sensitivity flag
Compare LLM answer to text string
Whether the answer matches the provided text string.
LLM
Question
Reference answer
LLM answer
Score between 0 and 5
How well the reference answer matches the LLM answer. Cannot be used for production monitoring projects.
All components
Retrieved context
LLM answer
(Count of retrieved context in LLM answer) /
(Count of retrieved context)
Whether all of the context is in the LLM answer.
Prompt builder
LLM
Question
Retrieved context
LLM answer
(Count of relevant retrieved context in LLM answer) / (Count of relevant retrieved context)
Whether the relevant context is in the LLM answer.
Prompt builder
LLM
Callback
User-defined
Returns a true or false value based on a callback function that you provide. Cannot be used for production monitoring projects.
User-defined
LLM answer
Text string
Text.in(LLM answer)
Whether the response contains the provided text string.
LLM
Retrieved context
List of PII types
Calculated by Textual
Whether the context used for the response contains PII of the specified types. Requires a Tonic Textual API key.
Prompt builder
Retrieved context
Minimum length
Maximum length
(Minimum length) <= len(Context) <= (Maximum length)
Whether the length of a context item falls within the specified range.
Prompt builder
LLM answer
Returns 1 or 0 based on whether there is duplicate information
Whether the response contains duplicate information.
LLM
LLM answer
Returns 1 or 0 based on whether there is hate speech
Whether the response contains hate speech.
LLM
Target length of time
(Run time) <= (Target time)
Whether the response takes longer than the provided target time.
Entire system
LLM answer
Regular expression
Expected number of matches
Runs a regex search and then counts the matches. Returns true if the number of matches is equal to the expected match count.
Whether the response contains the expected number of matches for the provided regular expression.
LLM
LLM answer
Minimum length
Maximum length
(Minimum length) <= len(LLM response) <= (Maximum length)
Whether the response length falls within the specified range.
LLM
Question
Retrieved context
(Count of relevant retrieved context) / (Count of retrieved context)
Whether the context retrieved is relevant to answer the given question.
Chunker
Embedder
Retriever
To configure your RAG system to log questions and answers to Tonic Validate, whenever your RAG system answers a question from a user, you add a call to the Validate log
function.
The call to log
includes:
The identifier of the Validate production monitoring project to send the question to
The text of the question from the user to the RAG system
The answer that the RAG system provided to the user
The context that the RAG system used to answer the question
As your RAG system sends questions to the Validate production managing project, Validate by default generates the following metrics scores for each question:
You can also request additional metrics. For information about the available metrics, go to RAG metrics reference.
To start and upload a run, use ValidateScorer
to score the run.
Next, to upload the run with ValidateApi
, use an API token to connect to Tonic Validate, then specify the project identifier.
When you create a run, you specify the LLM evaluator and the metrics to calculate on the responses that are logged during the run.
For the LLM evaluator, we currently support the following models:
OpenAI, including Azure's OpenAI service
Gemini
Anthropic
To change your model, use the model_evaluator
argument to pass the model string to ValidateScorer
.
For example:
If you use Azure, then instead of the model name, you pass in your deployment name:
For information on how to set up an Azure deployment, go to #validate-sdk-azure-openai.
By default, the following RAG metrics are calculated:
Answer similarity - Note that if you do not provide expected answers to your benchmark questions, then Tonic Validate cannot determine answer similarity.
Augmentation precision
Answer consistency
When you create a run, if you only pass in the LLM evaluator, then the run calculates all of these metrics.
To specify the metrics to calculate during the run, pass a list of the metrics:
To provide the RAG system response, create a callback function that returns:
A string that contains the RAG system's response
A list of strings that represent the list of retrieved context from the RAG system
When you log the RAG system answer and retrieved context, the metrics for the answer and retrieved context are calculated on your machine. They are then sent to the Validate application.
This may take some time. To calculate each metric requires at least one call to the LLM's API.
If you want to speed up your runs, you can pass a parallelism argument to the score
function to create additional threads to score your metrics faster
scoring_parallelism
controls the number of threads that are used to score the responses.
For example, if scoring_parallelism
is set to 2, then for scoring, 2 threads can call the LLM's API simultaneously.
callback_parallelism
controls the number of threads that are used to call the callback that you provided.
For example, if callback_parallelism
is 2, then two threads can call your callback function simultaneously.
This development project example uses:
The data found in the examples/paul_graham_essays folder of the tonic_validate
SDK repository
The list of questions and reference answers found in examples/question_and_answer_list.json
We use six Paul Graham essays about startup founders taken from his blog. With these we build a RAG system that uses the simplest LlamaIndex default model.
We load the question and answer list and use it to create a Tonic Validate benchmark.
Next, we connect to Validate using an API token we generated from the Validate application, and create a new development project and benchmark.
Finally, we can create a run and score it.
After you execute this code, you can upload your results to the Validate application and view it.
The metrics are automatically calculated and logged to Validate. The distribution of the scores over the benchmark are also graphed.
At the top of the project details page are the average metrics scores. The scores include the most recently received questions.
When you click a score, the timeline chart is updated to display the average scores for that metric across time. The averages are grouped by hour.
By default, the time range for the timeline reflects all time since Validate began to receive questions from the RAG system.
You can use the date pickers above the timeline to select a different time range.
Below the timeline is the list of questions that Validate received from the RAG system, along with the metric scores for that question.
By default, the list of questions includes all questions that were received during the time frame for the timeline.
To set a time range for which to display questions, use the date pickers above the question list.
To filter the questions to only those that Validate received during a specific point on the timeline, click that point.
You use the Tonic Validate application to view the results of each run.
If you use the Ragas integration, tonic_ragas_logger, then each upload of Ragas results is displayed as a run in Validate.
From the run list on the project details page, to display the run results, click the run.
The run details provide details about the run questions and scores. It also provides access to delete the run.
The run details replace the project overview with the metrics, chart, and questions. To return to the overview, click Show Project Overview.
The Overview tab summarizes the scores for the questions.
The tiles at the top of the Overview tab show the average overall score and metrics scores from across the entire run.
Below the composite scores are bar graphs for the overall score and the metrics scores.
For each range of score values in the x-axis, the graph displays the number of questions that received scores that fall within that range.
The Scores tab provides a spreadsheet list of the run questions and their overall and metrics scores.
You can sort the list by any of the columns. To sort by a selected column, click the column heading. To reverse the sort, click the heading again.
The Metadata tab provides any metadata that was provided when the run was started.
The Questions & Answers tab provides a detailed list of questions that were included in the run.
For each question, the list includes:
The text of the question.
The reference answer - this is the answer that you expected.
The answer that your LLM provided.
The context that the LLM used to answer the question.
The overall and metrics scores for the question.
Here is a full question entry from the Questions & Answers tab:
To delete a run:
In the run details heading, click Delete.
On the confirmation panel, click Delete.