Quick Start
This guide shows you how to create your first evaluation using the Agenta SDK. You'll build a simple application that answers geography questions, then create evaluators to check if the answers are correct.
What You'll Build
By the end of this guide, you'll have:
- An application that returns country capitals
- Two evaluators that check if answers are correct
- A complete evaluation run with results
The entire example takes less than 100 lines of code.
Prerequisites
Install the Agenta SDK:
pip install agenta
Set your environment variables:
export AGENTA_API_KEY="your-api-key"
export AGENTA_HOST="https://cloud.agenta.ai"
export OPENAI_API_KEY="your-openai-api-key" # Required for LLM-as-a-judge evaluator
Step 1: Initialize Agenta
Create a new Python file and initialize the SDK:
import agenta as ag
ag.init()
Step 2: Create Your Application
An application is any function that processes inputs and returns outputs. Use the @ag.application decorator to mark your function:
@ag.application(
slug="capital_finder",
name="Capital Finder",
description="Returns the capital of a given country"
)
async def capital_finder(country: str):
"""
Your application logic goes here.
For this example, we'll use a simple dictionary lookup.
"""
capitals = {
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}
return capitals.get(country, "Unknown")
The function receives parameters from your test data. In this case, it gets country from the testcase and returns the capital city.
Step 3: Create an Evaluator
An evaluator checks if your application's output is correct. Use the @ag.evaluator decorator:
@ag.evaluator(
slug="exact_match",
name="Exact Match Evaluator",
description="Checks if the output exactly matches the expected answer"
)
async def exact_match(capital: str, outputs: str):
"""
Compare the application's output to the expected answer.
Args:
capital: The expected answer from the testcase
outputs: What your application returned
Returns:
A dictionary with score and success flag
"""
is_correct = outputs == capital
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}
The evaluator receives two types of inputs:
- Fields from your testcase (like
capital) - The application's output (always called
outputs)
Step 4: Create Test Data
Define your test cases as a list of dictionaries:
test_data = [
{"country": "Germany", "capital": "Berlin"},
{"country": "France", "capital": "Paris"},
{"country": "Spain", "capital": "Madrid"},
{"country": "Italy", "capital": "Rome"},
]
Each dictionary represents one test case. The keys become parameters that your application and evaluators can access.
Step 5: Run the Evaluation
Import the evaluation functions and run your test:
import asyncio
from agenta.sdk.evaluations import aevaluate
async def run_evaluation():
# Create a testset from your data
testset = await ag.testsets.acreate(
name="Country Capitals",
data=test_data,
)
# Run evaluation
result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[exact_match],
)
return result
# Run the evaluation
if __name__ == "__main__":
eval_result = asyncio.run(run_evaluation())
print(f"Evaluation complete!")
Complete Example
Here's the full code in one place:
import asyncio
import agenta as ag
from agenta.sdk.evaluations import aevaluate
# Initialize SDK
ag.init()
# Define test data
test_data = [
{"country": "Germany", "capital": "Berlin"},
{"country": "France", "capital": "Paris"},
{"country": "Spain", "capital": "Madrid"},
{"country": "Italy", "capital": "Rome"},
]
# Create application
@ag.application(
slug="capital_finder",
name="Capital Finder",
)
async def capital_finder(country: str):
capitals = {
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}
return capitals.get(country, "Unknown")
# Create evaluator
@ag.evaluator(
slug="exact_match",
name="Exact Match",
)
async def exact_match(capital: str, outputs: str):
is_correct = outputs == capital
return {
"score": 1.0 if is_correct else 0.0,
"success": is_correct,
}
# Run evaluation
async def main():
testset = await ag.testsets.acreate(
name="Country Capitals",
data=test_data,
)
result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[exact_match],
)
print(f"Evaluation complete!")
if __name__ == "__main__":
asyncio.run(main())
Understanding the Data Flow
When you run an evaluation, here's what happens:
-
Testcase data flows to the application
- Input:
{"country": "Germany", "capital": "Berlin"} - Application receives:
country="Germany" - Application returns:
"Berlin"
- Input:
-
Both testcase data and application output flow to the evaluator
- Evaluator receives:
capital="Berlin"(expected answer from testcase) - Evaluator receives:
outputs="Berlin"(what the application returned) - Evaluator compares them and returns:
{"score": 1.0, "success": True}
- Evaluator receives:
-
Results are collected and stored in Agenta
- You can view them in the web interface
- Or access them programmatically from the result object
Next Steps
Now that you've created your first evaluation, you can:
- Learn how to configure custom evaluators with different scoring logic
- Explore built-in evaluators like LLM-as-a-judge
- Understand how to configure your application for different use cases
- Run multiple evaluators in a single evaluation
Common Patterns
Using Multiple Evaluators
You can run several evaluators on the same application:
result = await aevaluate(
testsets=[testset.id],
applications=[capital_finder],
evaluators=[
exact_match,
case_insensitive_match,
similarity_check,
],
)
Each evaluator runs independently and produces its own scores.
Accessing Additional Test Data
Your evaluators can access any field from the testcase:
@ag.evaluator(slug="region_aware")
async def region_aware(country: str, region: str, outputs: str):
# You can access multiple fields from the testcase
# and use them in your evaluation logic
pass
Returning Multiple Metrics
Evaluators can return multiple scores:
@ag.evaluator(slug="detailed_eval")
async def detailed_eval(expected: str, outputs: str):
return {
"exact_match": 1.0 if outputs == expected else 0.0,
"length_diff": abs(len(outputs) - len(expected)),
"success": outputs == expected,
}
Getting Help
If you run into issues:
- Join our Discord community
- Open an issue on GitHub
