Running Evaluations
Once you have defined your testsets, applications, and evaluators, you can run evaluations using the aevaluate() function. This function executes your application on test data and scores the outputs using your evaluators.
Basic Usage
The aevaluate() function requires three inputs:
from agenta.sdk.evaluations import aevaluate
result = await aevaluate(
testsets=[testset.id],
applications=[my_application],
evaluators=[my_evaluator],
)
Required Parameters:
testsets: A list of testset IDs or testset dataapplications: A list of application functions or IDsevaluators: A list of evaluator functions or IDs
The function runs each test case through your application and evaluates the output with all specified evaluators.
Passing Testsets
You can provide testsets in two ways:
Using testset IDs:
# Create a testset first
testset = await ag.testsets.acreate(
name="My Test Data",
data=[
{"input": "Hello", "expected": "Hi"},
{"input": "Goodbye", "expected": "Bye"},
],
)
# Use the ID in aevaluate
result = await aevaluate(
testsets=[testset.id],
applications=[my_app],
evaluators=[my_eval],
)
Using inline data:
# Pass test data directly
result = await aevaluate(
testsets=[
[
{"input": "Hello", "expected": "Hi"},
{"input": "Goodbye", "expected": "Bye"},
]
],
applications=[my_app],
evaluators=[my_eval],
)
When you pass inline data, Agenta automatically creates a testset for you.
Running Multiple Evaluators
You can run several evaluators on the same application in a single evaluation:
result = await aevaluate(
testsets=[testset.id],
applications=[my_application],
evaluators=[
exact_match,
fuzzy_match,
llm_judge,
],
)
Adding Names and Descriptions
You can add metadata to make your evaluations easier to identify:
result = await aevaluate(
name="Product Launch Evaluation",
description="Testing the new response format against our quality criteria",
testsets=[testset.id],
applications=[my_application],
evaluators=[my_evaluator],
)
The name and description appear in the Agenta UI and help you track evaluations over time.
Complete Example
Here's a full evaluation workflow:
import agenta as ag
from agenta.sdk.evaluations import aevaluate
# Initialize
ag.init()
# Define application
@ag.application(slug="greeting_app")
async def greeting_app(message: str):
return f"Hello, {message}!"
# Define evaluator
@ag.evaluator(slug="length_check")
async def length_check(outputs: str):
return {
"score": len(outputs),
"success": len(outputs) < 50,
}
# Create testset
testset = await ag.testsets.acreate(
name="Greeting Tests",
data=[
{"message": "Alice"},
{"message": "Bob"},
{"message": "Charlie"},
],
)
# Run evaluation
result = await aevaluate(
name="Greeting App Test",
testsets=[testset.id],
applications=[greeting_app],
evaluators=[length_check],
)
print(f"Evaluation complete! Run ID: {result['run'].id}")