Creating Test Sets from Traces

Overview

Creating test sets is one of the most critical parts of building reliable LLM-powered applications. Without test sets, it's very hard to evaluate your application, find edge cases, improve the application for these cases, or discover regressions when they appear.

In this guide, we'll show you how to create test sets in Agenta by using your production data and playground experiments.

What is a Test Set?

A test set is a collection of test cases, each containing:

Inputs: The data your LLM application expects (required)
Ground Truth: The expected answer from your application (optional)
Annotations: Additional metadata or rules about the test case (optional)

The inputs are always required - without them, we can't invoke the LLM application. The ground truth and annotations are optional but provide additional capabilities for evaluation.

Creating Test Sets from Production Data

Adding a Single Trace to a Test Set

Navigate to the Observability view in Agenta
Find a trace you want to add to a test set
Click the Add to Test Set button at the top of the trace view
Select "Create New" to make a new test set (or select an existing one)
Name your test set (e.g., "docs-test-set")
Check the mapping between trace data and test set columns:
Optionally, edit the correct answer if you don't agree with the output
Click Save to add the trace to your test set

Adding Multiple Traces at Once

In the Observability view, use the search function to find related traces
- For example, search for "I don't have enough information" to find cases where your application couldn't answer
Select all relevant traces by checking the boxes next to them
Click Add to Test Set
Choose an existing test set or create a new one
Review the mapping for the traces
Click Save to add all selected traces to your test set

Creating Test Sets from the Playground

While working in the playground, you may find interesting cases that would make good test examples:

Work with your application in the playground
When you find an interesting case or edge case, click Add to Test Set
Select an existing test set or create a new one
Review the data mapping
Click Save to add the case to your test set

Using Your Test Sets

Once you have a test set, you can use it in several ways:

Load it in the playground:
- Click the "Load" button in the playground
- Select your test set and the specific test cases you want to use
- Use these test cases to iterate on your prompt or application
Create evaluations:
- Use your test set as the basis for automated or human evaluations
- Compare your application's output against ground truth answers
- Measure performance across different variants

For more information on evaluations, see our Evaluation documentation.

Test Set Best Practices

Types of Test Sets

Happy Path Tests:
- Verify your application works correctly under normal, expected conditions
- Help ensure core functionality remains intact as you make changes
- Useful for regression testing and quality assurance
Grumpy Path Tests:
- Check how your application handles edge cases or problematic scenarios
- Include prompt injection attempts, malformed inputs, or out-of-scope requests
- Help identify vulnerabilities and improve robustness

Evaluation with Test Sets

Even with just inputs (no ground truth), you can evaluate your application using:

Human evaluation: Have people review the outputs for quality
LLM as a judge: Use a prompt that assesses outputs based on criteria like relevance or accuracy

Adding ground truth expands your evaluation options, allowing you to:

Compare outputs against expected answers
Use metrics like exact match or semantic similarity
Measure accuracy and quality objectively

Overview​

What is a Test Set?​

Creating Test Sets from Production Data​

Adding a Single Trace to a Test Set​

Adding Multiple Traces at Once​

Creating Test Sets from the Playground​

Using Your Test Sets​

Test Set Best Practices​

Types of Test Sets​

Evaluation with Test Sets​

Related Resources​