Starting a Validate run

To start and upload a run, use ValidateScorer to score the run.

Next, to upload the run with ValidateApi, use an API token to connect to Tonic Validate, then specify the project identifier.

from tonic_validate import ValidateScorer, ValidateApi, Benchmark

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }

benchmark = Benchmark(questions=["What is the capital of France?"], answers=["Paris"])
# Score the responses for each question and answer pair
scorer = ValidateScorer()
run = scorer.score(benchmark, get_rag_response)

# Uploads the project to the UI
validate_api = ValidateApi("your-api-key")
validate_api.upload_run("your-project-id", run)

Configuring the run

When you create a run, you specify the LLM evaluator and the metrics to calculate on the responses that are logged during the run.

For the LLM evaluator, we currently support the following models:

  • OpenAI, including Azure's OpenAI service

  • Gemini

  • Anthropic

To change your model, use the model_evaluator argument to pass the model string to ValidateScorer .

For example:

scorer = ValidateScorer(model_evaluator="gpt-4")
scorer = ValidateScorer(model_evaluator="gemini/gemini-1.5-pro-latest")
scorer = ValidateScorer(model_evaluator="claude-3")

If you use Azure, then instead of the model name, you pass in your deployment name:

scorer = ValidateScorer(model_evaluator="my-azure-deployment-name")

Configuring the run metrics

By default, the following RAG metrics are calculated:

  • Answer similarity - Note that if you do not provide expected answers to your benchmark questions, then Tonic Validate cannot determine answer similarity.

  • Augmentation precision

  • Answer consistency

When you create a run, if you only pass in the LLM evaluator, then the run calculates all of these metrics.

To specify the metrics to calculate during the run, pass a list of the metrics:

from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AugmentationAccuracyMetric

scorer = ValidateScorer([
    AnswerConsistencyMetric(),
    AugmentationAccuracyMetric()
])

Providing the RAG system responses and retrieved context to Validate

To provide the RAG system response, create a callback function that returns:

  • A string that contains the RAG system's response

  • A list of strings that represent the list of retrieved context from the RAG system

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }
     
# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, ask_rag)

When you log the RAG system answer and retrieved context, the metrics for the answer and retrieved context are calculated on your machine. They are then sent to the Validate application.

This may take some time. To calculate each metric requires at least one call to the LLM's API.

Using parallelism to speed up runs

If you want to speed up your runs, you can pass a parallelism argument to the score function to create additional threads to score your metrics faster

scorer.score(
    benchmark,
    get_rag_response,
    scoring_parallelism=2,
    callback_parallelism=2
)
  • scoring_parallelism controls the number of threads that are used to score the responses. For example, if scoring_parallelism is set to 2, then for scoring, 2 threads can call the LLM's API simultaneously.

  • callback_parallelism controls the number of threads that are used to call the callback that you provided. For example, if callback_parallelism is 2, then two threads can call your callback function simultaneously.

Last updated