Evaluate from SDK

This guide explains how to run evaluations programmaticaly from the SDK in agenta. We will do the following:

Create a test set
Create and configure an evaluator
Run an evaluation
Retrieve the results of evaluations

How agenta evaluation works

In agenta, evaluation is a fully managed service . It takes place entirely on our backend, and take care of:

Queuing and managing evaluation jobs
Batching LLM app calls to optimize performance and avoid exceeding rate limits.
Handle retries and errors automatically to ensure robust evaluation runs.

Our evaluation service takes a set of test sets, evaluators, and app variants and runs asynchronous jobs for evaluation.

info

You can open this guide in a jupyter notebook

1. Setup

Create an LLM app and a couple of variants in agenta and install our sdk using pip install -U agenta.
Retrieve the application id either from the url.

2. Setup the SDK client

app_id = "667d8cfad1812781f7e375d9"

# You can create the API key under the settings page. If you are using the OSS version, you should keep this as an empty string
api_key = "EUqJGOUu.xxxx"

# Host.
host = "https://cloud.agenta.ai"

# Initialize the client

client = AgentaApi(base_url=host + "/api", api_key=api_key)

3. Create a test set

from agenta.client.types.new_testset import NewTestset

csvdata = [
        {"country": "france", "capital": "Paris"},
        {"country": "Germany", "capital": "Berlin"}
    ]

response = client.testsets.create_testset(request=NewTestset(name="test set", csvdata=csvdata))
test_set_id = response.id

4. Create evaluators

Let's create a custom code evaluator that return 1.0 if the first letter of the app output is uppercase

code_snippet = """
from typing import Dict

def evaluate(
    app_params: Dict[str, str],
    inputs: Dict[str, str],
    output: str,  # output of the llm app
    datapoint: Dict[str, str]  # contains the testset row
) -> float:
    if output and output[0].isupper():
        return 1.0
    else:
        return 0.0
"""

response = client.evaluators.create_new_evaluator_config(app_id=app_id, name="capital_letter_evaluator", evaluator_key="auto_custom_code_run", settings_values={"code": code_snippet})
letter_match_eval_id = response.id

5. Run an evaluation

First let's grab the first variants in the app

response = client.apps.list_app_variants(app_id=app_id)
print(response)
myvariant_id = response[0].variant_id

Then , let's start the evaluation jobs

from agenta.client.types.llm_run_rate_limit import LlmRunRateLimit

rate_limit_config = LlmRunRateLimit(
                        batch_size=10, # number of rows to call in parallel
                        max_retries=3, # max number of time to retry a failed llm call
                        retry_delay=2, # delay before retrying a failed llm call
                        delay_between_batches=5, # delay between batches
                    )
response = client.evaluations.create_evaluation(app_id=app_id,
                                                variant_ids=[myvariant_id],
                                                testset_id=test_set_id,
                                                evaluators_configs=[letter_match_eval_id],
                                                rate_limit=rate_limit_config)
print(response)

Now we can check for the status of the job

client.evaluations.fetch_evaluation_status('667d98fbd1812781f7e3761a')

As soon as it is done, we can fetch the overall results

response = client.evaluations.fetch_evaluation_results('667d98fbd1812781f7e3761a')

results = [(evaluator["evaluator_config"]["name"], evaluator["result"]) for evaluator in response["results"]]

and the detailed results

client.evaluations.fetch_evaluation_scenarios(evaluations_ids='667d98fbd1812781f7e3761a')

How agenta evaluation works​

1. Setup​

2. Setup the SDK client​

3. Create a test set​

4. Create evaluators​

5. Run an evaluation​