Configuring Evaluators

Evaluators are functions that check if your application's output is correct. You can write your own custom evaluators or use Agenta's built-in evaluators.

Custom Evaluators

Custom evaluators are Python functions decorated with @ag.evaluator. They receive inputs from your test data and the application's output, then return a dictionary with scores.

Basic Structure

import agenta as ag

@ag.evaluator(
    slug="my_evaluator",
    name="My Evaluator",
    description="Checks if the output meets my criteria"
)
async def my_evaluator(expected: str, outputs: str):
    is_correct = outputs == expected
    return {
        "score": 1.0 if is_correct else 0.0,
        "success": is_correct,
    }

The evaluator decorator takes these parameters:

slug (required): A unique identifier for your evaluator
name (optional): A human-readable name shown in the UI
description (optional): Explains what the evaluator checks

Understanding Evaluator Inputs

Evaluators receive two types of inputs:

Test case fields: Any field from your test data
Application output: Always called outputs

When you run an evaluation, Agenta passes both the test case data and what your application returned to the evaluator.

Example:

# Your test case
test_case = {
    "question": "What is 2+2?",
    "correct_answer": "4",
    "difficulty": "easy"
}

# Your evaluator can access any of these fields
@ag.evaluator(slug="math_checker")
async def math_checker(
    correct_answer: str,    # From test case
    difficulty: str,         # From test case
    outputs: str            # What the application returned
):
    # Check if the application's output matches the correct answer
    is_correct = outputs == correct_answer

    # You can use other fields in your logic
    if difficulty == "easy":
        return {"score": 1.0 if is_correct else 0.0}
    else:
        # More lenient scoring for hard questions
        return {"score": 0.5 if is_correct else 0.0}

Return Values

Evaluators must return a dictionary. You can include any metrics you want, but these fields have special meaning:

score: A numeric value (typically 0.0 to 1.0) indicating quality
success: A boolean flag indicating pass/fail

@ag.evaluator(slug="detailed_checker")
async def detailed_checker(expected: str, outputs: str):
    return {
        "score": 0.85,              # Overall score
        "success": True,             # Did it pass?
        "length_match": len(outputs) == len(expected),
        "exact_match": outputs == expected,
        "custom_metric": 42,
    }

All values in the result dictionary are stored and displayed in the evaluation results.

Practical Examples

Case-Insensitive Match

@ag.evaluator(
    slug="case_insensitive_match",
    name="Case Insensitive Match"
)
async def case_insensitive_match(expected: str, outputs: str):
    match = outputs.lower() == expected.lower()
    return {
        "score": 1.0 if match else 0.0,
        "success": match,
    }

Length Check

@ag.evaluator(
    slug="length_validator",
    name="Length Validator"
)
async def length_validator(outputs: str):
    """Check if output is within acceptable length."""
    length = len(outputs)
    is_valid = 10 <= length <= 500

    return {
        "success": is_valid,
        "length": length,
        "score": 1.0 if is_valid else 0.0,
    }

Contains Keywords

@ag.evaluator(
    slug="keyword_checker",
    name="Keyword Checker"
)
async def keyword_checker(keywords: list[str], outputs: str):
    """Check if output contains required keywords."""
    found = [kw for kw in keywords if kw.lower() in outputs.lower()]
    score = len(found) / len(keywords) if keywords else 0.0

    return {
        "score": score,
        "success": score >= 0.8,
        "found_keywords": found,
        "missing_keywords": [kw for kw in keywords if kw not in found],
    }

Built-in Evaluators

Agenta provides pre-built evaluators for common evaluation tasks. You import them from agenta.sdk.workflows.builtin and pass them directly to the aevaluate() function.

LLM-as-a-Judge

The LLM-as-a-judge evaluator uses a language model to evaluate your application's output. This is useful when you need nuanced judgments that simple string matching cannot provide.

from agenta.sdk.workflows import builtin
from agenta.sdk.evaluations import aevaluate

llm_evaluator = builtin.auto_ai_critique(
    slug="quality_evaluator",
    name="Quality Evaluator",
    description="Uses an LLM to judge response quality",
    correct_answer_key="expected_answer",
    model="gpt-3.5-turbo",  # or "gpt-4", "claude-3-sonnet", etc.
    prompt_template=[
        {
            "role": "system",
            "content": "You are an expert evaluator of AI responses.",
        },
        {
            "role": "user",
            "content": (
                "Expected answer: {{expected_answer}}\n"
                "Actual answer: {{outputs}}\n\n"
                "Rate the quality of the actual answer from 0.0 to 1.0.\n"
                "Respond with ONLY a number, nothing else."
            ),
        },
    ],
)

# Use it in evaluation
result = await aevaluate(
    testsets=[testset.id],
    applications=[my_app],
    evaluators=[llm_evaluator],
)

Parameters:

slug (required): Unique identifier for the evaluator
prompt_template (required): List of message dictionaries with role and content
- Use {{field_name}} placeholders that will be replaced with test case values
- {{outputs}} is always available for the application's output
correct_answer_key (optional): Field name in test case containing the expected answer
model (optional): Which LLM to use (default: "gpt-3.5-turbo")
name (optional): Display name
description (optional): Description of what this evaluator checks

The prompt template uses curly brace syntax {{variable}} for placeholders. All fields from your test case are available, plus {{outputs}}.

String Matching Evaluators

Exact Match

Checks if the output exactly matches the expected answer.

from agenta.sdk.workflows import builtin
from agenta.sdk.evaluations import aevaluate

exact_match = builtin.auto_exact_match(
    correct_answer_key="expected"
)

# Use in evaluation
result = await aevaluate(
    testsets=[testset.id],
    applications=[my_app],
    evaluators=[exact_match],
)

Parameters:

correct_answer_key (optional): Field name in test case with expected value (default: "correct_answer")

Returns:

success: True if output exactly matches expected value

Starts With

Checks if the output starts with a specific prefix.

prefix_check = builtin.auto_starts_with(
    prefix="Answer:",
    case_sensitive=True
)

Parameters:

prefix (required): The string the output should start with
case_sensitive (optional): Whether to match case (default: True)

Returns:

success: True if output starts with the prefix

Ends With

Checks if the output ends with a specific suffix.

suffix_check = builtin.auto_ends_with(
    suffix="Thank you!",
    case_sensitive=False
)

Parameters:

suffix (required): The string the output should end with
case_sensitive (optional): Whether to match case (default: True)

Returns:

success: True if output ends with the suffix

Contains

Checks if the output contains a substring.

contains_check = builtin.auto_contains(
    substring="important keyword",
    case_sensitive=False
)

Parameters:

substring (required): The string to search for
case_sensitive (optional): Whether to match case (default: True)

Returns:

success: True if output contains the substring

Contains Any

Checks if the output contains at least one of several substrings.

any_check = builtin.auto_contains_any(
    substrings=["hello", "hi", "greetings"],
    case_sensitive=False
)

Parameters:

substrings (required): List of strings to search for
case_sensitive (optional): Whether to match case (default: True)

Returns:

success: True if output contains at least one substring

Contains All

Checks if the output contains all of several substrings.

all_check = builtin.auto_contains_all(
    substrings=["name", "age", "email"],
    case_sensitive=False
)

Parameters:

substrings (required): List of strings that must all be present
case_sensitive (optional): Whether to match case (default: True)

Returns:

success: True if output contains all substrings

Regex Evaluator

Checks if the output matches a regular expression pattern.

regex_check = builtin.auto_regex_test(
    regex_pattern=r"\d{3}-\d{3}-\d{4}",  # Phone number pattern
    regex_should_match=True,
    case_sensitive=False
)

Parameters:

regex_pattern (required): Regular expression pattern to test
regex_should_match (optional): Whether pattern should match (default: True)
case_sensitive (optional): Whether to match case (default: True)

Returns:

success: True if pattern match result equals regex_should_match

JSON Evaluators

Contains JSON

Checks if the output contains valid JSON.

json_check = builtin.auto_contains_json()

Returns:

success: True if output contains parseable JSON

JSON Field Match

Checks if a specific field in JSON output matches the expected value.

field_check = builtin.field_match_test(
    json_field="status",
    correct_answer_key="expected_status"
)

Parameters:

json_field (required): Name of field to extract from JSON output
correct_answer_key (optional): Test case field with expected value (default: "correct_answer")

Returns:

success: True if extracted field matches expected value

JSON Diff

Compares JSON structures and calculates similarity score.

json_diff = builtin.auto_json_diff(
    correct_answer_key="expected_json",
    threshold=0.8,
    compare_schema_only=False
)

Parameters:

correct_answer_key (optional): Test case field with expected JSON (default: "correct_answer")
threshold (optional): Minimum similarity score to pass (default: 0.5)
predict_keys (optional): Whether to predict which keys to compare (default: False)
case_insensitive_keys (optional): Whether to ignore case in key names (default: False)
compare_schema_only (optional): Only compare structure, not values (default: False)

Returns:

score: Similarity score from 0.0 to 1.0
success: True if score meets threshold

Similarity Evaluators

Levenshtein Distance

Calculates edit distance between output and expected value.

levenshtein = builtin.auto_levenshtein_distance(
    correct_answer_key="expected",
    threshold=0.8,
    case_sensitive=False
)

Parameters:

correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
case_sensitive (optional): Whether to match case (default: True)
threshold (optional): Minimum similarity to pass (default: 0.5)

Returns:

score: Normalized similarity score from 0.0 to 1.0
success: True if score meets threshold

Similarity Match

Uses Python's SequenceMatcher to calculate similarity.

similarity = builtin.auto_similarity_match(
    correct_answer_key="expected",
    threshold=0.75
)

Parameters:

correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
threshold (optional): Minimum similarity to pass (default: 0.5)

Returns:

score: Similarity ratio from 0.0 to 1.0
success: True if score meets threshold

Semantic Similarity

Uses embeddings to measure semantic similarity.

semantic = builtin.auto_semantic_similarity(
    correct_answer_key="expected",
    embedding_model="text-embedding-3-small",
    threshold=0.8
)

Parameters:

correct_answer_key (optional): Test case field with expected value (default: "correct_answer")
embedding_model (optional): OpenAI embedding model (default: "text-embedding-3-small")
threshold (optional): Minimum similarity to pass (default: 0.5)

Returns:

score: Cosine similarity from 0.0 to 1.0
success: True if score meets threshold

Custom Evaluators​

Basic Structure​

Understanding Evaluator Inputs​

Return Values​

Practical Examples​

Case-Insensitive Match​

Length Check​

Contains Keywords​

Built-in Evaluators​

LLM-as-a-Judge​

String Matching Evaluators​

Exact Match​

Starts With​

Ends With​

Contains​

Contains Any​

Contains All​

Regex Evaluator​

JSON Evaluators​

Contains JSON​

JSON Field Match​

JSON Diff​

Similarity Evaluators​

Levenshtein Distance​

Similarity Match​

Semantic Similarity​

Custom Evaluators

Basic Structure

Understanding Evaluator Inputs

Return Values

Practical Examples

Case-Insensitive Match

Length Check

Contains Keywords

Built-in Evaluators

LLM-as-a-Judge

String Matching Evaluators

Exact Match

Starts With

Ends With

Contains

Contains Any

Contains All

Regex Evaluator

JSON Evaluators

Contains JSON

JSON Field Match

JSON Diff

Similarity Evaluators

Levenshtein Distance

Similarity Match

Semantic Similarity