Configuring Evaluators
Evaluators are functions that check if your application's output is correct. You can write your own custom evaluators or use Agenta's built-in evaluators.
Custom Evaluators
Custom evaluators are Python functions decorated with @ag.evaluator. They receive inputs from your test data and the application's output, then return a dictionary with scores.
Basic Structure
import agenta as ag
@ag.evaluator(
slug="my_evaluator",
name="My Evaluator",
description="Checks if the output meets my criteria"
)
async def my_evaluator(expected: str, outputs: str):
is_correct = outputs == expected
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}
The evaluator decorator takes these parameters:
- slug (required): A unique identifier for your evaluator
- name (optional): A human-readable name shown in the UI
- description (optional): Explains what the evaluator checks
Understanding Evaluator Inputs
Evaluators receive two types of inputs:
- Test case fields: Any field from your test data
- Application output: Always called
outputs
When you run an evaluation, Agenta passes both the test case data and what your application returned to the evaluator.
Example:
# Your test case
test_case = {
"question": "What is 2+2?",
"correct_answer": "4",
"difficulty": "easy"
}
# Your evaluator can access any of these fields
@ag.evaluator(slug="math_checker")
async def math_checker(
correct_answer: str, # From test case
difficulty: str, # From test case
outputs: str # What the application returned
):
# Check if the application's output matches the correct answer
is_correct = outputs == correct_answer
# You can use other fields in your logic
if difficulty == "easy":
return {"score": 1.0 if is_correct else 0.0}
else:
# More lenient scoring for hard questions
return {"score": 0.5 if is_correct else 0.0}
Return Values
Evaluators must return a dictionary. You can include any metrics you want, but these fields have special meaning:
- score: A numeric value (typically 0.0 to 1.0) indicating quality
- success: A boolean flag indicating pass/fail
@ag.evaluator(slug="detailed_checker")
async def detailed_checker(expected: str, outputs: str):
return {
"score": 0.85, # Overall score
"success": True, # Did it pass?
"length_match": len(outputs) == len(expected),
"exact_match": outputs == expected,
"custom_metric": 42,
}
All values in the result dictionary are stored and displayed in the evaluation results.
Practical Examples
Case-Insensitive Match
@ag.evaluator(
slug="case_insensitive_match",
name="Case Insensitive Match"
)
async def case_insensitive_match(expected: str, outputs: str):
match = outputs.lower() == expected.lower()
return {
"score": 1.0 if match else 0.0,
"success": match,
}
Length Check
@ag.evaluator(
slug="length_validator",
name="Length Validator"
)
async def length_validator(outputs: str):
"""Check if output is within acceptable length."""
length = len(outputs)
is_valid = 10 <= length <= 500
return {
"success": is_valid,
"length": length,
"score": 1.0 if is_valid else 0.0,
}
Contains Keywords
@ag.evaluator(
slug="keyword_checker",
name="Keyword Checker"
)
async def keyword_checker(keywords: list[str], outputs: str):
"""Check if output contains required keywords."""
found = [kw for kw in keywords if kw.lower() in outputs.lower()]
score = len(found) / len(keywords) if keywords else 0.0
return {
"score": score,
"success": score >= 0.8,
"found_keywords": found,
"missing_keywords": [kw for kw in keywords if kw not in found],
}
Built-in Evaluators
Agenta provides pre-built evaluators for common evaluation tasks. You import them from agenta.sdk.workflows.builtin and pass them directly to the aevaluate() function.
LLM-as-a-Judge
The LLM-as-a-judge evaluator uses a language model to evaluate your application's output. This is useful when you need nuanced judgments that simple string matching cannot provide.
from agenta.sdk.workflows import builtin
from agenta.sdk.evaluations import aevaluate
llm_evaluator = builtin.auto_ai_critique(
slug="quality_evaluator",
name="Quality Evaluator",
description="Uses an LLM to judge response quality",
correct_answer_key="expected_answer",
model="gpt-3.5-turbo", # or "gpt-4", "claude-3-sonnet", etc.
prompt_template=[
{
"role": "system",
"content": "You are an expert evaluator of AI responses.",
},
{
"role": "user",
"content": (
"Expected answer: {{expected_answer}}\n"
"Actual answer: {{outputs}}\n\n"
"Rate the quality of the actual answer from 0.0 to 1.0.\n"
"Respond with ONLY a number, nothing else."
),
},
],
)
# Use it in evaluation
result = await aevaluate(
testsets=[testset.id],
applications=[my_app],
evaluators=[llm_evaluator],
)
Parameters:
- slug (required): Unique identifier for the evaluator
- prompt_template (required): List of message dictionaries with
roleandcontent- Use
{{field_name}}placeholders that will be replaced with test case values {{outputs}}is always available for the application's output
- Use
- correct_answer_key (optional): Field name in test case containing the expected answer
- model (optional): Which LLM to use (default: "gpt-3.5-turbo")
- name (optional): Display name
- description (optional): Description of what this evaluator checks
The prompt template uses curly brace syntax {{variable}} for placeholders. All fields from your test case are available, plus {{outputs}}.
String Matching Evaluators
Exact Match
Checks if the output exactly matches the expected answer.
from agenta.sdk.workflows import builtin
from agenta.sdk.evaluations import aevaluate
exact_match = builtin.auto_exact_match(
correct_answer_key="expected"
)
# Use in evaluation
result = await aevaluate(
testsets=[testset.id],
applications=[my_app],
evaluators=[exact_match],
)
Parameters:
- correct_answer_key (optional): Field name in test case with expected value (default: "correct_answer")
Returns:
success: True if output exactly matches expected value
Starts With
Checks if the output starts with a specific prefix.
prefix_check = builtin.auto_starts_with(
prefix="Answer:",
case_sensitive=True
)
Parameters:
- prefix (required): The string the output should start with
- case_sensitive (optional): Whether to match case (default: True)
Returns:
success: True if output starts with the prefix
Ends With
Checks if the output ends with a specific suffix.
suffix_check = builtin.auto_ends_with(
suffix="Thank you!",
case_sensitive=False
)
Parameters:
- suffix (required): The string the output should end with
- case_sensitive (optional): Whether to match case (default: True)
Returns:
success: True if output ends with the suffix
Contains
Checks if the output contains a substring.
contains_check = builtin.auto_contains(
substring="important keyword",
case_sensitive=False
)
Parameters:
- substring (required): The string to search for
- case_sensitive (optional): Whether to match case (default: True)
Returns:
success: True if output contains the substring
Contains Any
Checks if the output contains at least one of several substrings.
any_check = builtin.auto_contains_any(
substrings=["hello", "hi", "greetings"],
case_sensitive=False
)
Parameters:
- substrings (required): List of strings to search for
- case_sensitive (optional): Whether to match case (default: True)
Returns:
success: True if output contains at least one substring
Contains All
Checks if the output contains all of several substrings.
all_check = builtin.auto_contains_all(
substrings=["name", "age", "email"],
case_sensitive=False
)
Parameters:
- substrings (required): List of strings that must all be present
- case_sensitive (optional): Whether to match case (default: True)
Returns:
success: True if output contains all substrings
Regex Evaluator
Checks if the output matches a regular expression pattern.
regex_check = builtin.auto_regex_test(
regex_pattern=r"\d{3}-\d{3}-\d{4}", # Phone number pattern
regex_should_match=True,
case_sensitive=False
)
Parameters:
- regex_pattern (required): Regular expression pattern to test
- regex_should_match (optional): Whether pattern should match (default: True)
- case_sensitive (optional): Whether to match case (default: True)
Returns:
success: True if pattern match result equalsregex_should_match
JSON Evaluators
Contains JSON
Checks if the output contains valid JSON.
json_check = builtin.auto_contains_json()
Returns:
success: True if output contains parseable JSON
JSON Field Match
Checks if a specific field in JSON output matches the expected value.
field_check = builtin.field_match_test(
json_field="status",
correct_answer_key="expected_status"
)
Parameters:
- json_field (required): Name of field to extract from JSON output
- correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
Returns:
success: True if extracted field matches expected value
JSON Diff
Compares JSON structures and calculates similarity score.
json_diff = builtin.auto_json_diff(
correct_answer_key="expected_json",
threshold=0.8,
compare_schema_only=False
)
Parameters:
- correct_answer_key (optional): Test case field with expected JSON (default: "correct_answer")
- threshold (optional): Minimum similarity score to pass (default: 0.5)
- predict_keys (optional): Whether to predict which keys to compare (default: False)
- case_insensitive_keys (optional): Whether to ignore case in key names (default: False)
- compare_schema_only (optional): Only compare structure, not values (default: False)
Returns:
score: Similarity score from 0.0 to 1.0success: True if score meets threshold
Similarity Evaluators
Levenshtein Distance
Calculates edit distance between output and expected value.
levenshtein = builtin.auto_levenshtein_distance(
correct_answer_key="expected",
threshold=0.8,
case_sensitive=False
)
Parameters:
- correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
- case_sensitive (optional): Whether to match case (default: True)
- threshold (optional): Minimum similarity to pass (default: 0.5)
Returns:
score: Normalized similarity score from 0.0 to 1.0success: True if score meets threshold
Similarity Match
Uses Python's SequenceMatcher to calculate similarity.
similarity = builtin.auto_similarity_match(
correct_answer_key="expected",
threshold=0.75
)
Parameters:
- correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
- threshold (optional): Minimum similarity to pass (default: 0.5)
Returns:
score: Similarity ratio from 0.0 to 1.0success: True if score meets threshold
Semantic Similarity
Uses embeddings to measure semantic similarity.
semantic = builtin.auto_semantic_similarity(
correct_answer_key="expected",
embedding_model="text-embedding-3-small",
threshold=0.8
)
Parameters:
- correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
- embedding_model (optional): OpenAI embedding model (default: "text-embedding-3-small")
- threshold (optional): Minimum similarity to pass (default: 0.5)
Returns:
score: Cosine similarity from 0.0 to 1.0success: True if score meets threshold