Overview

The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use agenta to quickly evaluate and compare the performance of your LLM applications.

Set up evaluation

📄️ Configure Evaluators

Configure evaluators for your use case

📄️ Create Test Sets

Create Test Sets

Run evaluations

📄️ Run Evaluations from the web UI

Learn about the evaluation process in Agenta

📄️ Run Evaluations with the SDK

Learn about the evaluation process in Agenta

Available evaluators

Evaluator Name	Use Case	Type	Description
Exact Match	Classification/Entity Extraction	Pattern Matching	Checks if the output exactly matches the expected result.
Contains JSON	Classification/Entity Extraction	Pattern Matching	Ensures the output contains valid JSON.
Regex Test	Classification/Entity Extraction	Pattern Matching	Checks if the output matches a given regex pattern.
JSON Field Match	Classification/Entity Extraction	Pattern Matching	Compares specific fields within JSON data.
JSON Diff Match	Classification/Entity Extraction	Similarity Metrics	Compares generated JSON with a ground truth JSON based on schema or values.
Similarity Match	Text Generation / Chatbot	Similarity Metrics	Compares generated output with expected using Jaccard similarity.
Semantic Similarity Match	Text Generation / Chatbot	Semantic Analysis	Compares the meaning of the generated output with the expected result.
Starts With	Text Generation / Chatbot	Pattern Matching	Checks if the output starts with a specified prefix.
Ends With	Text Generation / Chatbot	Pattern Matching	Checks if the output ends with a specified suffix.
Contains	Text Generation / Chatbot	Pattern Matching	Checks if the output contains a specific substring.
Contains Any	Text Generation / Chatbot	Pattern Matching	Checks if the output contains any of a list of substrings.
Contains All	Text Generation / Chatbot	Pattern Matching	Checks if the output contains all of a list of substrings.
Levenshtein Distance	Text Generation / Chatbot	Similarity Metrics	Calculates the Levenshtein distance between output and expected result.
LLM-as-a-judge	Text Generation / Chatbot	LLM-based	Sends outputs to an LLM model for critique and evaluation.
RAG Faithfulness	RAG / Text Generation / Chatbot	LLM-based	Evaluates if the output is faithful to the retrieved documents in RAG workflows.
RAG Context Relevancy	RAG / Text Generation / Chatbot	LLM-based	Measures the relevancy of retrieved documents to the given question in RAG.
Custom Code Evaluation	Custom Logic	Custom	Allows users to define their own evaluator in Python.
Webhook Evaluator	Custom Logic	Custom	Sends output to a webhook for external evaluation.

Set up evaluation​