Skip to main content

Running Evaluations

Once you have defined your testsets, applications, and evaluators, you can run evaluations using the aevaluate() function. This function executes your application on test data and scores the outputs using your evaluators.

Basic Usage

The aevaluate() function requires three inputs:

from agenta.sdk.evaluations import aevaluate

result = await aevaluate(
testsets=[testset.id],
applications=[my_application],
evaluators=[my_evaluator],
)

Required Parameters:

  • testsets: A list of testset IDs or testset data
  • applications: A list of application functions or IDs
  • evaluators: A list of evaluator functions or IDs

The function runs each test case through your application and evaluates the output with all specified evaluators.

Passing Testsets

You can provide testsets in two ways:

Using testset IDs:

# Create a testset first
testset = await ag.testsets.acreate(
name="My Test Data",
data=[
{"input": "Hello", "expected": "Hi"},
{"input": "Goodbye", "expected": "Bye"},
],
)

# Use the ID in aevaluate
result = await aevaluate(
testsets=[testset.id],
applications=[my_app],
evaluators=[my_eval],
)

Using inline data:

# Pass test data directly
result = await aevaluate(
testsets=[
[
{"input": "Hello", "expected": "Hi"},
{"input": "Goodbye", "expected": "Bye"},
]
],
applications=[my_app],
evaluators=[my_eval],
)

When you pass inline data, Agenta automatically creates a testset for you.

Running Multiple Evaluators

You can run several evaluators on the same application in a single evaluation:

result = await aevaluate(
testsets=[testset.id],
applications=[my_application],
evaluators=[
exact_match,
fuzzy_match,
llm_judge,
],
)

Adding Names and Descriptions

You can add metadata to make your evaluations easier to identify:

result = await aevaluate(
name="Product Launch Evaluation",
description="Testing the new response format against our quality criteria",
testsets=[testset.id],
applications=[my_application],
evaluators=[my_evaluator],
)

The name and description appear in the Agenta UI and help you track evaluations over time.

Complete Example

Here's a full evaluation workflow:

import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize
ag.init()

# Define application
@ag.application(slug="greeting_app")
async def greeting_app(message: str):
return f"Hello, {message}!"

# Define evaluator
@ag.evaluator(slug="length_check")
async def length_check(outputs: str):
return {
"score": len(outputs),
"success": len(outputs) < 50,
}

# Create testset
testset = await ag.testsets.acreate(
name="Greeting Tests",
data=[
{"message": "Alice"},
{"message": "Bob"},
{"message": "Charlie"},
],
)

# Run evaluation
result = await aevaluate(
name="Greeting App Test",
testsets=[testset.id],
applications=[greeting_app],
evaluators=[length_check],
)

print(f"Evaluation complete! Run ID: {result['run'].id}")