Skip to main content

Concepts

What is evaluation?

The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters.

Key concepts

Evaluators

Evaluators are functions that assess the output of an LLM application.

Evaluators typically take as input:

  • The output of the LLM application
  • (Optional) The reference answer (i.e., expected output or ground truth)
  • (Optional) The inputs to the LLM application
  • Any other relevant data, such as context

Evaluators return different types of results based on the evaluator type. Simple evaluators return single values like boolean (true/false) or numeric scores. Evaluators with schemas (such as LLM-as-a-Judge or custom evaluators) can return structured results with multiple fields, allowing you to capture various aspects of the evaluation in a single result.

Test sets

Test sets are collections of test cases used to evaluate your LLM application. Each test case contains:

  • Inputs: The data your LLM application expects (required)
  • Ground Truth: The expected answer from your application (optional, often stored as "correct_answer")
  • Annotations: Additional metadata or rules about the test case (optional)

Test sets are critical for:

  • Evaluating your application systematically
  • Finding edge cases
  • Preventing regressions
  • Measuring improvements over time

Evaluation workflows

Agenta supports multiple evaluation workflows:

  1. Automated Evaluation (UI): Run evaluations from the web interface with configurable evaluators
  2. Automated Evaluation (SDK): Run evaluations programmatically for integration into CI/CD pipelines
  3. Online Evaluation: Run evaluations on new traces as they are generated by your LLM application
  4. Human Evaluation: Collect expert feedback and annotations for qualitative assessment

Next steps