Skip to main content

Prompt Versioning

We've introduced the feature to version prompts, allowing you to track changes made by the team and revert to previous versions. To view the change history of the configuration, click on the sign in the playground to access all previous versions.


New JSON Evaluator

We have added a new evaluator to match JSON fields and added the possiblity to use other columns in the test set other than the correct_answer column as the ground truth.


Improved error handling in evaluation

We have improved error handling in evaluation to return more information about the exact source of the error in the evaluation view.

Improvements:

  • Added the option in A/B testing human evaluation to mark both variants as correct
  • Improved loading state in Human Evaluation

Bring your own API key

Up until know, we required users to use our OpenAI API key when using cloud. Starting now, you can use your own API key for any new application you create.


Improved human evaluation workflow

Faster human evaluation workflow

We have updated the human evaluation table view to add annotation and correct answer columns.

Improvements:

  • Simplified the database migration process
  • Fixed environment variable injection to enable cloud users to use their own keys
  • Disabled import from endpoint in cloud due to security reasons
  • Improved query lookup speed for evaluation scenarios
  • Improved error handling in playground

Bug fixes:

  • Resolved failing Backend tests
  • Fixed a bug in rate limit configuration validation
  • Fixed issue with all aggregated results
  • Resolved issue with live results in A/B testing evaluation not updating

Revamping evaluation

We've spent the past month re-engineering our evaluation workflow. Here's what's new:

Running Evaluations

  1. Simultaneous Evaluations: You can now run multiple evaluations for different app variants and evaluators concurrently.
  1. Rate Limit Parameters: Specify these during evaluations and reattempts to ensure reliable results without exceeding open AI rate limits.
  1. Reusable Evaluators: Configure evaluators such as similarity match, regex match, or AI critique and use them across multiple evaluations.

Evaluation Reports

  1. Dashboard Improvements: We've upgraded our dashboard interface to better display evaluation results. You can now filter and sort results by evaluator, test set, and outcomes.
  1. Comparative Analysis: Select multiple evaluation runs and view the results of various LLM applications side-by-side.

Adding Cost and Token Usage to the Playground

caution

This change requires you to pull the latest version of the agenta platform if you're using the self-serve version.

We've added a feature that allows you to compare the time taken by an LLM app, its cost, and track token usage, all in one place.

----#

Comprehensive Updates and Bug Fixes

  • Incorporated all chat turns to the chat set
  • Rectified self-hosting documentation
  • Introduced asynchronous support for applications
  • Added 'register_default' alias
  • Fixed a bug in the side-by-side feature