About using metrics to evaluate a RAG system

Evaluating a RAG system is difficult because:

  • There are many moving pieces. For a list of components, go to RAG components summary.

  • You need to evaluate natural language responses.

  • Building RAG systems is relatively new. There are not well-established systems to evaluate the performance of a RAG system.

About our RAG metrics

Our RAG metrics use the input and output of your RAG system to test all of the components of the system.

Each metric calculates a score that measures the performance of a different aspect of a typical RAG system.

The metrics use LLM-assisted evaluation to evaluate the natural language aspect of a RAG system question, answer, and retrieved context.

The metrics include both end-to-end RAG system metrics, and metrics that are based on a binary classification framework.

How to use the metrics

To use the RAG metrics to evaluate and improve your RAG system, you first build a benchmark dataset of question and reference answers, or only questions, to test the RAG system with.

Next, use your RAG system to get the answers and the retrieved context that was used to generate the answers to the questions in your benchmark.

After you have this information, you can calculate each RAG metric and see how different aspects of your RAG system perform.

To improve your RAG system, focus on the RAG metrics where the system performs poorly, and then modify the parts of the RAG system that affect those metrics.

Last updated