Custom Code Evaluator

Sometimes, the default evaluators in Agenta may not be sufficient for your specific use case. In such cases, you can create a custom evaluator to suit your specific needs. Custom evaluators are written in Python.

info

For the moment, there are limitation on the code that can be written in the custom evaluator. Our backend uses RestrictedPython to execute the code which limits the libraries that can be used.

Evaluation code

Your custom evaluator should include a function called evaluate with the following signature:

from typing import Dict

def evaluate(
    app_params: Dict[str, str],
    inputs: Dict[str, str],
    output: str,
    correct_answer: str
) -> float:

This function should return a float value representing the evaluation score. The score ranges from 0.0 to 1.0, where 0.0 indicates a failed evaluation and 1.0 indicates a perfect score.

The function parameters are:

app_params: A dictionary containing the configuration of the app. This would include the prompt, model and all the other parameters specified in the playground with the same naming.
inputs: A dictionary containing the inputs of the app.
output: The generated output of the app.
correct_answer: The correct answer of the app.

Here's an example implementation of an exact match evaluator:

from typing import Dict

def evaluate(
    app_params: Dict[str, str],
    inputs: Dict[str, str],
    output: str,
    correct_answer: str
) -> float:
    return 1 if output == correct_answer else 0

Evaluation code​

Evaluation code