Skip to main content

Quick Start

This guide shows you how to create your first evaluation using the Agenta SDK. You'll build a simple application that answers geography questions, then create evaluators to check if the answers are correct.

What You'll Build

By the end of this guide, you'll have:

  • An application that returns country capitals
  • Two evaluators that check if answers are correct
  • A complete evaluation run with results

The entire example takes less than 100 lines of code.

Prerequisites

Install the Agenta SDK:

pip install agenta

Set your environment variables:

export AGENTA_API_KEY="your-api-key"
export AGENTA_HOST="https://cloud.agenta.ai"
export OPENAI_API_KEY="your-openai-api-key" # Required for LLM-as-a-judge evaluator

Step 1: Initialize Agenta

Create a new Python file and initialize the SDK:

import agenta as ag

ag.init()

Step 2: Create Your Application

An application is any function that processes inputs and returns outputs. Use the @ag.application decorator to mark your function:

@ag.application(
slug="capital_finder",
name="Capital Finder",
description="Returns the capital of a given country"
)
async def capital_finder(country: str):
"""
Your application logic goes here.
For this example, we'll use a simple dictionary lookup.
"""
capitals = {
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}
return capitals.get(country, "Unknown")

The function receives parameters from your test data. In this case, it gets country from the testcase and returns the capital city.

Step 3: Create an Evaluator

An evaluator checks if your application's output is correct. Use the @ag.evaluator decorator:

@ag.evaluator(
slug="exact_match",
name="Exact Match Evaluator",
description="Checks if the output exactly matches the expected answer"
)
async def exact_match(capital: str, outputs: str):
"""
Compare the application's output to the expected answer.

Args:
capital: The expected answer from the testcase
outputs: What your application returned

Returns:
A dictionary with score and success flag
"""
is_correct = outputs == capital
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}

The evaluator receives two types of inputs:

  • Fields from your testcase (like capital)
  • The application's output (always called outputs)

Step 4: Create Test Data

Define your test cases as a list of dictionaries:

test_data = [
{"country": "Germany", "capital": "Berlin"},
{"country": "France", "capital": "Paris"},
{"country": "Spain", "capital": "Madrid"},
{"country": "Italy", "capital": "Rome"},
]

Each dictionary represents one test case. The keys become parameters that your application and evaluators can access.

Step 5: Run the Evaluation

Import the evaluation functions and run your test:

import asyncio
from agenta.sdk.evaluations import aevaluate

async def run_evaluation():
# Create a testset from your data
testset = await ag.testsets.acreate(
name="Country Capitals",
data=test_data,
)

# Run evaluation
result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[exact_match],
)

return result

# Run the evaluation
if __name__ == "__main__":
eval_result = asyncio.run(run_evaluation())
print(f"Evaluation complete!")

Complete Example

Here's the full code in one place:

import asyncio
import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize SDK
ag.init()

# Define test data
test_data = [
{"country": "Germany", "capital": "Berlin"},
{"country": "France", "capital": "Paris"},
{"country": "Spain", "capital": "Madrid"},
{"country": "Italy", "capital": "Rome"},
]

# Create application
@ag.application(
slug="capital_finder",
name="Capital Finder",
)
async def capital_finder(country: str):
capitals = {
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}
return capitals.get(country, "Unknown")

# Create evaluator
@ag.evaluator(
slug="exact_match",
name="Exact Match",
)
async def exact_match(capital: str, outputs: str):
is_correct = outputs == capital
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}

# Run evaluation
async def main():
testset = await ag.testsets.acreate(
name="Country Capitals",
data=test_data,
)

result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[exact_match],
)

print(f"Evaluation complete!")

if __name__ == "__main__":
asyncio.run(main())

Understanding the Data Flow

When you run an evaluation, here's what happens:

  1. Testcase data flows to the application

    • Input: {"country": "Germany", "capital": "Berlin"}
    • Application receives: country="Germany"
    • Application returns: "Berlin"
  2. Both testcase data and application output flow to the evaluator

    • Evaluator receives: capital="Berlin" (expected answer from testcase)
    • Evaluator receives: outputs="Berlin" (what the application returned)
    • Evaluator compares them and returns: {"score": 1.0, "success": True}
  3. Results are collected and stored in Agenta

    • You can view them in the web interface
    • Or access them programmatically from the result object

Next Steps

Now that you've created your first evaluation, you can:

Common Patterns

Using Multiple Evaluators

You can run several evaluators on the same application:

result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[
exact_match,
case_insensitive_match,
similarity_check,
],
)

Each evaluator runs independently and produces its own scores.

Accessing Additional Test Data

Your evaluators can access any field from the testcase:

@ag.evaluator(slug="region_aware")
async def region_aware(country: str, region: str, outputs: str):
# You can access multiple fields from the testcase
# and use them in your evaluation logic
pass

Returning Multiple Metrics

Evaluators can return multiple scores:

@ag.evaluator(slug="detailed_eval")
async def detailed_eval(expected: str, outputs: str):
return {
"exact_match": 1.0 if outputs == expected else 0.0,
"length_diff": abs(len(outputs) - len(expected)),
"success": outputs == expected,
}

Getting Help

If you run into issues: