Skip to main content

Evaluator Testing Playground and a New Evaluation View


Many users faced challenges configuring evaluators in the web UI. Some evaluators, such as LLM as a judge, custom code, or RAG evaluators can be tricky to set up correctly on the first try. Until now, users needed to setup, run an evaluation, check the errors, then do it again.

To address this, we've introduced a new evaluator test/debug playground. This feature allows you to test the evaluator live on real data, helping you test the configuration before committing to it and using it for evaluations.

Additionally, we have improved and redesigned the evaluation view. Both automatic and human evaluations are now within the same view but in different tabs. We're moving towards unifying all evaluator results and consolidating them in one view, allowing you to quickly get an overview of what's working.


UI Redesign and Configuration Management and Overview View

We've completely redesigned the platform's UI. Additionally we have introduced a new overview view for your applications. This is part of a series of upcoming improvements slated for the next few weeks.

The new overview view offers:

  • A dashboard displaying key metrics of your application
  • A table with all the variants of your applications
  • A summary of your application's most recent evaluations

We've also added a new JSON Diff evaluator. This evaluator compares two JSON objects and provides a similarity score.

Lastly, we've updated the UI of our documentation.


New Alpha Version of the SDK for Creating Custom Applications

We've released a new version of the SDK for creating custom applications. This Pydantic-based SDK significantly simplifies the process of building custom applications. It's fully backward compatible, so your existing code will continue to work seamlessly. We'll soon be rolling out comprehensive documentation and examples for the new SDK.

In the meantime, here's a quick example of how to use it:

import agenta as ag
from agenta import Agenta
from pydantic import BaseModel, Field

ag.init()

# Define the configuration of the application (that will be shown in the playground )
class MyConfig(BaseModel):
temperature: float = Field(default=0.2)
prompt_template: str = Field(default="What is the capital of {country}?")

# Creates an endpoint for the entrypoint of the application
@ag.route("/", config_schema=MyConfig)
def generate(country: str) -> str:
# Fetch the config from the request
config: MyConfig = ag.ConfigManager.get_from_route(schema=MyConfig)
prompt = config.prompt_template.format(country=country)
chat_completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=config.temperature,
)
return chat_completion.choices[0].message.content


RAGAS Evaluators and Traces in the Playground

We're excited to announce two major features this week:

  1. We've integrated RAGAS evaluators into agenta. Two new evaluators have been added: RAG Faithfulness (measuring how consistent the LLM output is with the context) and Context Relevancy (assessing how relevant the retrieved context is to the question). Both evaluators use intermediate outputs within the trace to calculate the final score.

    Check out the tutorial to learn how to use RAG evaluators.

  1. You can now view traces directly in the playground. This feature enables you to debug your application while configuring it—for example, by examining the prompts sent to the LLM or reviewing intermediate outputs.

note

Both features are available exclusively in the cloud and enterprise versions of agenta.


Migration from MongoDB to Postgres

We have migrated the Agenta database from MongoDB to Postgres. As a result, the platform is much more faster (up to 10x in some use cases).

However, if you are self-hosting agenta, note that this is a breaking change that requires you to manually migrate your data from MongoDB to Postgres.

If you are using the cloud version of Agenta, there is nothing you need to do (other than enjoying the new performance improvements).


More Reliable Evaluations

We have worked extensively on improving the reliability of evaluations. Specifically:

  • We improved the status for evaluations and added a new Queued status
  • We improved the error handling in evaluations. Now we show the exact error message that caused the evaluation to fail.
  • We fixed issues that caused evaluations to run infinitely
  • We fixed issues in the calculation of scores in human evaluations.
  • We fixed small UI issues with large output in human evaluations.
  • We have added a new export button in the evaluation view to export the results as a CSV file.

Additionally, we have added a new Cookbook for run evaluation using the SDK.

In observability:

  • We have added a new integration with Litellm to automatically trace all LLM calls done through it.
  • Now we automatically propagate cost and token usage from spans to traces.

Evaluators can access all columns

Evaluators now can access all columns in the test set. Previously, you were limited to using only the correct_answer column for the ground truth / reference answer in evaluation. Now you can configure your evaluator to use any column in the test set as the ground truth. To do that, open the collapsable Advanced Settings when configuring the evaluator, and define the Expected Answer Column to the name of the columns containing the reference answer you want to use.

In addition to this:

  • We've upgraded the SDK to pydantic v2.
  • We have improved by 10x the speed for the get config endpoint
  • We have add documentation for observability

Playground Improvements

v0.14.1-13

  • We've improved the workflow for adding outputs to a dataset in the playground. In the past, you had to select the name of the test set each time. Now, the last used test set is selected by default..
  • We have significantly improved the debugging experience when creating applications from code. Now, if an application fails, you can view the logs to understand the reason behind the failure.
  • We moved the copy message button in the playground to the output text area.
  • We now hide the cost and usage in the playground when they aren't specified
  • We've made improvements to error messages in the playground

Bug Fixes

  • Fixed the order of the arguments when running a custom code evaluator
  • Fixed the timestamp in the Testset view (previous stamps was droppping the trailing 0)
  • Fixed the creation of application from code in the self-hosted version when using Windows

Prompt and Configuration Registry

We've introduced a feature that allows you to use Agenta as a prompt registry or management system. In the deployment view, we now provide an endpoint to directly fetch the latest version of your prompt. Here is how it looks like:


from agenta import Agenta
agenta = Agenta()
config = agenta.get_config(base_id="xxxxx", environment="production", cache_timeout=200) # Fetches the configuration with caching

You can find additional documentation here.

Improvements

  • Previously, publishing a variant from the playground to an environment was a manual process., from now on we are publishing by default to the production environment.