Skip to main content

Overview

The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use agenta to quickly evaluate and compare the performance of your LLM applications.

Set up evaluation

Run evaluations

Available evaluators

Evaluator NameUse CaseTypeDescription
Exact MatchClassification/Entity ExtractionPattern MatchingChecks if the output exactly matches the expected result.
Contains JSONClassification/Entity ExtractionPattern MatchingEnsures the output contains valid JSON.
Regex TestClassification/Entity ExtractionPattern MatchingChecks if the output matches a given regex pattern.
JSON Field MatchClassification/Entity ExtractionPattern MatchingCompares specific fields within JSON data.
JSON Diff MatchClassification/Entity ExtractionSimilarity MetricsCompares generated JSON with a ground truth JSON based on schema or values.
Similarity MatchText Generation / ChatbotSimilarity MetricsCompares generated output with expected using Jaccard similarity.
Semantic Similarity MatchText Generation / ChatbotSemantic AnalysisCompares the meaning of the generated output with the expected result.
Starts WithText Generation / ChatbotPattern MatchingChecks if the output starts with a specified prefix.
Ends WithText Generation / ChatbotPattern MatchingChecks if the output ends with a specified suffix.
ContainsText Generation / ChatbotPattern MatchingChecks if the output contains a specific substring.
Contains AnyText Generation / ChatbotPattern MatchingChecks if the output contains any of a list of substrings.
Contains AllText Generation / ChatbotPattern MatchingChecks if the output contains all of a list of substrings.
Levenshtein DistanceText Generation / ChatbotSimilarity MetricsCalculates the Levenshtein distance between output and expected result.
LLM-as-a-judgeText Generation / ChatbotLLM-basedSends outputs to an LLM model for critique and evaluation.
RAG FaithfulnessRAG / Text Generation / ChatbotLLM-basedEvaluates if the output is faithful to the retrieved documents in RAG workflows.
RAG Context RelevancyRAG / Text Generation / ChatbotLLM-basedMeasures the relevancy of retrieved documents to the given question in RAG.
Custom Code EvaluationCustom LogicCustomAllows users to define their own evaluator in Python.
Webhook EvaluatorCustom LogicCustomSends output to a webhook for external evaluation.