Starting a Validate run
To start and upload a run, use ValidateScorer
to score the run.
Next, to upload the run with ValidateApi
, use an API token to connect to Tonic Validate, then specify the project identifier.
from tonic_validate import ValidateScorer, ValidateApi, Benchmark
# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
return {
"llm_answer": "Paris",
"llm_context_list": ["Paris is the capital of France."]
}
benchmark = Benchmark(questions=["What is the capital of France?"], answers=["Paris"])
# Score the responses for each question and answer pair
scorer = ValidateScorer()
run = scorer.score(benchmark, get_rag_response)
# Uploads the project to the UI
validate_api = ValidateApi("your-api-key")
validate_api.upload_run("your-project-id", run)
Configuring the run
When you create a run, you specify the LLM evaluator and the metrics to calculate on the responses that are logged during the run.
For the LLM evaluator, we currently support the following models:
OpenAI, including Azure's OpenAI service
Gemini
Anthropic
To change your model, use the model_evaluator
argument to pass the model string to ValidateScorer
.
For example:
scorer = ValidateScorer(model_evaluator="gpt-4")
scorer = ValidateScorer(model_evaluator="gemini/gemini-1.5-pro-latest")
scorer = ValidateScorer(model_evaluator="claude-3")
If you use Azure, then instead of the model name, you pass in your deployment name:
scorer = ValidateScorer(model_evaluator="my-azure-deployment-name")
For information on how to set up an Azure deployment, go to Using Azure's OpenAI service.
Configuring the run metrics
By default, the following RAG metrics are calculated:
Answer similarity - Note that if you do not provide expected answers to your benchmark questions, then Tonic Validate cannot determine answer similarity.
Augmentation precision
Answer consistency
When you create a run, if you only pass in the LLM evaluator, then the run calculates all of these metrics.
To specify the metrics to calculate during the run, pass a list of the metrics:
from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AugmentationAccuracyMetric
scorer = ValidateScorer([
AnswerConsistencyMetric(),
AugmentationAccuracyMetric()
])
Providing the RAG system responses and retrieved context to Validate
To provide the RAG system response, create a callback function that returns:
A string that contains the RAG system's response
A list of strings that represent the list of retrieved context from the RAG system
# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
return {
"llm_answer": "Paris",
"llm_context_list": ["Paris is the capital of France."]
}
# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, ask_rag)
When you log the RAG system answer and retrieved context, the metrics for the answer and retrieved context are calculated on your machine. They are then sent to the Validate application.
This may take some time. To calculate each metric requires at least one call to the LLM's API.
Using parallelism to speed up runs
If you want to speed up your runs, you can pass a parallelism argument to the score
function to create additional threads to score your metrics faster
scorer.score(
benchmark,
get_rag_response,
scoring_parallelism=2,
callback_parallelism=2
)
scoring_parallelism
controls the number of threads that are used to score the responses. For example, ifscoring_parallelism
is set to 2, then for scoring, 2 threads can call the LLM's API simultaneously.callback_parallelism
controls the number of threads that are used to call the callback that you provided. For example, ifcallback_parallelism
is 2, then two threads can call your callback function simultaneously.
Last updated
Was this helpful?