To start and upload a run, use ValidateScorer
to score the run.
Next, to upload the run with ValidateApi
, use an API token to connect to Tonic Validate, then specify the project identifier.
When you create a run, you specify the LLM evaluator and the metrics to calculate on the responses that are logged during the run.
For the LLM evaluator, we currently support the following models:
OpenAI, including Azure's OpenAI service
Gemini
Anthropic
To change your model, use the model_evaluator
argument to pass the model string to ValidateScorer
.
For example:
If you use Azure, then instead of the model name, you pass in your deployment name:
For information on how to set up an Azure deployment, go to #validate-sdk-azure-openai.
By default, the following RAG metrics are calculated:
Answer similarity - Note that if you do not provide expected answers to your benchmark questions, then Tonic Validate cannot determine answer similarity.
Augmentation precision
Answer consistency
When you create a run, if you only pass in the LLM evaluator, then the run calculates all of these metrics.
To specify the metrics to calculate during the run, pass a list of the metrics:
To provide the RAG system response, create a callback function that returns:
A string that contains the RAG system's response
A list of strings that represent the list of retrieved context from the RAG system
When you log the RAG system answer and retrieved context, the metrics for the answer and retrieved context are calculated on your machine. They are then sent to the Validate application.
This may take some time. To calculate each metric requires at least one call to the LLM's API.
If you want to speed up your runs, you can pass a parallelism argument to the score
function to create additional threads to score your metrics faster
scoring_parallelism
controls the number of threads that are used to score the responses.
For example, if scoring_parallelism
is set to 2, then for scoring, 2 threads can call the LLM's API simultaneously.
callback_parallelism
controls the number of threads that are used to call the callback that you provided.
For example, if callback_parallelism
is 2, then two threads can call your callback function simultaneously.