Quick Start

This guide shows you how to create your first evaluation using the Agenta SDK. You'll build a simple application that answers geography questions, then create evaluators to check if the answers are correct.

Open in Google Colaboratory

What You'll Build

By the end of this guide, you'll have:

An application that returns country capitals
Two evaluators that check if answers are correct
A complete evaluation run with results

The entire example takes less than 100 lines of code.

Prerequisites

Install the Agenta SDK:

pip install agenta

Set your environment variables:

export AGENTA_API_KEY="your-api-key"
export AGENTA_HOST="https://cloud.agenta.ai"
export OPENAI_API_KEY="your-openai-api-key" # Required for LLM-as-a-judge evaluator

Step 1: Initialize Agenta

Create a new Python file and initialize the SDK:

import agenta as ag

ag.init()

Step 2: Create Your Application

An application is any function that processes inputs and returns outputs. Use the @ag.application decorator to mark your function:

@ag.application(
    slug="capital_finder",
    name="Capital Finder",
    description="Returns the capital of a given country"
)
async def capital_finder(country: str):
    """
    Your application logic goes here.
    For this example, we'll use a simple dictionary lookup.
    """
    capitals = {
        "Germany": "Berlin",
        "France": "Paris",
        "Spain": "Madrid",
        "Italy": "Rome",
    }
    return capitals.get(country, "Unknown")

The function receives parameters from your test data. In this case, it gets country from the testcase and returns the capital city.

Step 3: Create an Evaluator

An evaluator checks if your application's output is correct. Use the @ag.evaluator decorator:

@ag.evaluator(
    slug="exact_match",
    name="Exact Match Evaluator",
    description="Checks if the output exactly matches the expected answer"
)
async def exact_match(capital: str, outputs: str):
    """
    Compare the application's output to the expected answer.

    Args:
        capital: The expected answer from the testcase
        outputs: What your application returned

    Returns:
        A dictionary with score and success flag
    """
    is_correct = outputs == capital
    return {
        "score": 1.0 if is_correct else 0.0,
        "success": is_correct,
    }

The evaluator receives two types of inputs:

Fields from your testcase (like capital)
The application's output (always called outputs)

Step 4: Create Test Data

Define your test cases as a list of dictionaries:

test_data = [
    {"country": "Germany", "capital": "Berlin"},
    {"country": "France", "capital": "Paris"},
    {"country": "Spain", "capital": "Madrid"},
    {"country": "Italy", "capital": "Rome"},
]

Each dictionary represents one test case. The keys become parameters that your application and evaluators can access.

Step 5: Run the Evaluation

Import the evaluation functions and run your test:

import asyncio
from agenta.sdk.evaluations import aevaluate

async def run_evaluation():
    # Create a testset from your data
    testset = await ag.testsets.acreate(
        name="Country Capitals",
        data=test_data,
    )

    # Run evaluation
    result = await aevaluate(
        testsets=[testset.id],
        applications=[capital_finder],
        evaluators=[exact_match],
    )

    return result

# Run the evaluation
if __name__ == "__main__":
    eval_result = asyncio.run(run_evaluation())
    print(f"Evaluation complete!")

Complete Example

Here's the full code in one place:

import asyncio
import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize SDK
ag.init()

# Define test data
test_data = [
    {"country": "Germany", "capital": "Berlin"},
    {"country": "France", "capital": "Paris"},
    {"country": "Spain", "capital": "Madrid"},
    {"country": "Italy", "capital": "Rome"},
]

# Create application
@ag.application(
    slug="capital_finder",
    name="Capital Finder",
)
async def capital_finder(country: str):
    capitals = {
        "Germany": "Berlin",
        "France": "Paris",
        "Spain": "Madrid",
        "Italy": "Rome",
    }
    return capitals.get(country, "Unknown")

# Create evaluator
@ag.evaluator(
    slug="exact_match",
    name="Exact Match",
)
async def exact_match(capital: str, outputs: str):
    is_correct = outputs == capital
    return {
        "score": 1.0 if is_correct else 0.0,
        "success": is_correct,
    }

# Run evaluation
async def main():
    testset = await ag.testsets.acreate(
        name="Country Capitals",
        data=test_data,
    )

    result = await aevaluate(
        testsets=[testset.id],
        applications=[capital_finder],
        evaluators=[exact_match],
    )

    print(f"Evaluation complete!")

if __name__ == "__main__":
    asyncio.run(main())

Understanding the Data Flow

When you run an evaluation, here's what happens:

Testcase data flows to the application
- Input: {"country": "Germany", "capital": "Berlin"}
- Application receives: country="Germany"
- Application returns: "Berlin"
Both testcase data and application output flow to the evaluator
- Evaluator receives: capital="Berlin" (expected answer from testcase)
- Evaluator receives: outputs="Berlin" (what the application returned)
- Evaluator compares them and returns: {"score": 1.0, "success": True}
Results are collected and stored in Agenta
- You can view them in the web interface
- Or access them programmatically from the result object

Next Steps

Now that you've created your first evaluation, you can:

Learn how to configure custom evaluators with different scoring logic
Explore built-in evaluators like LLM-as-a-judge
Understand how to configure your application for different use cases
Run multiple evaluators in a single evaluation

Common Patterns

Using Multiple Evaluators

You can run several evaluators on the same application:

result = await aevaluate(
    testsets=[testset.id],
    applications=[capital_finder],
    evaluators=[
        exact_match,
        case_insensitive_match,
        similarity_check,
    ],
)

Each evaluator runs independently and produces its own scores.

Accessing Additional Test Data

Your evaluators can access any field from the testcase:

@ag.evaluator(slug="region_aware")
async def region_aware(country: str, region: str, outputs: str):
    # You can access multiple fields from the testcase
    # and use them in your evaluation logic
    pass

Returning Multiple Metrics

Evaluators can return multiple scores:

@ag.evaluator(slug="detailed_eval")
async def detailed_eval(expected: str, outputs: str):
    return {
        "exact_match": 1.0 if outputs == expected else 0.0,
        "length_diff": abs(len(outputs) - len(expected)),
        "success": outputs == expected,
    }

Getting Help

If you run into issues:

Join our Discord community
Open an issue on GitHub

What You'll Build​

Prerequisites​

Step 1: Initialize Agenta​

Step 2: Create Your Application​

Step 3: Create an Evaluator​

Step 4: Create Test Data​

Step 5: Run the Evaluation​

Complete Example​

Understanding the Data Flow​

Next Steps​

Common Patterns​

Using Multiple Evaluators​

Accessing Additional Test Data​

Returning Multiple Metrics​

Getting Help​