Skip to main content

Improvements to the Playground and Custom Workflows

We've made several improvements to the playground, including:

  • Improved scrolling behavior
  • Increased discoverability of variants creation and comparison
  • Implemented stop functionality in the playground

As for custom workflows, now they work with sub-routes. This means you can have multiple routes in one file and create multiple custom workflows from the same file.


OpenTelemetry Compliance and Custom workflows from API

We've introduced major improvements to Agenta, focusing on OpenTelemetry compliance and simplified custom workflow debugging.

OpenTelemetry (OTel) Support:

Agenta is now fully OpenTelemetry-compliant. This means you can seamlessly integrate Agenta with thousands of OTel-compatible services using existing SDKs. To integrate your application with Agenta, simply configure an OTel exporter pointing to your Agenta endpoint—no additional setup required.

We've enhanced distributed tracing capabilities to better debug complex distributed agent systems. All HTTP interactions between agents—whether running within Agenta's SDK or externally—are automatically traced, making troubleshooting and monitoring easier.

Detailed instructions and examples are available in our distributed tracing documentation.

Improved Custom Workflows:

Based on your feedback, we've streamlined debugging and running custom workflows:

  • Run workflows from your environments: You no longer need the Agenta CLI to manage custom workflows. Setting up custom workflows now involves simply adding the Agenta SDK to your code, creating an endpoint, and connecting it to Agenta via the web UI. You can check how it's done in the quick start guide.

  • Custom Workflows in the new playground: Custom workflows are now fully compatible with the new playground. You can now nest configurations, run side-by-side comparisons, and debug your agents and complex workflows very easily.


New Playground


We've rebuilt our playground from scratch to make prompt engineering faster and more intuitive. The old playground took 20 seconds to create a prompt - now it's instant.

Key improvements:

  • Create prompts with multiple messages using our new template system
  • Format variables easily with curly bracket syntax and a built-in validator
  • Switch between chat and completion prompts in one interface
  • Load test sets directly in the playground to iterate faster
  • Save successful outputs as test cases with one click
  • Compare different prompts side-by-side
  • Deploy changes straight to production

For developers, now you create prompts programmatically through our API.

You can explore these features in our updated playground documentation.


Quality of life improvements

New collapsible side menu

Small release today with quality of life improvements, while we're preparing the huge release coming up in the next days:

  • Added a collapsible side menu for better space management
  • Enhanced frontend performance and responsiveness
  • Implemented a confirmation modal when deleting test sets
  • Improved permission handling across the platform
  • Improved frontend test coverage

Agenta is SOC 2 Type 1 Certified

We've achieved SOC 2 Type 1 certification, validating our security controls for protecting sensitive LLM development data. This certification covers our entire platform, including prompt management, evaluation frameworks, and observability tools.

Key security features and improvements:

  • Data encryption in transit and at rest
  • Enhanced access control and authentication
  • Comprehensive security monitoring
  • Regular third-party security assessments
  • Backup and disaster recovery protocols

This certification represents a significant milestone for teams using Agenta in production environments. Whether you're using our open-source platform or cloud offering, you can now build LLM applications with enterprise-grade security confidence.

We've also updated our trust center with detailed information about our security practices and compliance standards. For teams interested in learning more about our security controls or requesting our SOC 2 report, please contact team@agenta.ai.


New Onboarding Flow

We've redesigned our platform's onboarding to make getting started simpler and more intuitive. Key improvements include:

  • Streamlined tracing setup process
  • Added a demo RAG playground project showcasing custom workflows
  • Enhanced frontend performance
  • Fixed scroll behavior in trace view

You can check out the tutorial for the RAG demo project here.


Add Spans to Test Sets


This release introduces the ability to add spans to test sets, making it easier to bootstrap your evaluation data from production. The new feature lets you:

  • Add individual or batch spans to test sets
  • Create custom mappings between spans and test sets
  • Preview test set changes before committing them

Additional improvements:

  • Fixed CSV test set upload issues
  • Prevented viewing of incomplete evaluations
  • Added mobile compatibility warning
  • Added support for custom ports in self-hosted installations

Viewing Traces in the Playground and Authentication for Deployed Applications

Viewing traces in the playground:

You can now see traces directly in the playground. For simple applications, this means you can view the prompts sent to LLMs. For custom workflows, you get an overview of intermediate steps and outputs. This makes it easier to understand what’s happening under the hood and debug your applications.

Authentication improvements:

We’ve strengthened authentication for deployed applications. As you know, Agenta lets you either fetch the app’s config or call it with Agenta acting as a proxy. Now, we’ve added authentication to the second method. The APIs we create are now protected and can be called using an API key. You can find code snippets for calling the application in the overview page.

Documentation improvements:

We’ve added new cookbooks and updated existing documentation:

Bug fixes:

  • Fixed an issue with the observability SDK not being compatible with LiteLLM.
  • Fixed an issue where cost and token usage were not correctly computed for all calls.

Observability and Prompt Management

This release is one of our biggest yet—one changelog hardly does it justice.

First up: Observability

We’ve had observability in beta for a while, but now it’s been completely rewritten, with a brand-new UI and fully open-source code.

The new Observability SDK is compatible with OpenTelemetry (Otel) and gen-ai semantic conventions. This means you get a lot of integrations right out of the box, like LangChain, OpenAI, and more.

We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:

  • A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.

  • The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.

  • The SDK is Otel-compatible, and we’ve already tested integrations for OpenAI, LangChain, LiteLLM, and Instructor, with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.

Next: Prompt Management

We’ve completely rewritten the prompt management SDK, giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this here.

And finally: LLM-as-a-Judge Overhaul

We’ve made significant upgrades to the LLM-as-a-Judge evaluator. It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we’re seeing better results with it.

Configuring the LLM-as-a-Judge evaluator

New Application Management View and Various Improvements

We updated the Application Management View to improve the UI. Many users struggled to find their applications when they had a large number, so we've improved the view and added a search bar for quick filtering. Additionally, we are moving towards a new project structure for the application. We moved test sets and evaluators outside of the application scope. So now, you can use the same test set and evaluators in multiple applications.

Bug Fixes

  • Added an export button in the evaluation view to export results from the main view.
  • Eliminated Pydantic warnings in the CLI.
  • Improved error messages when fetch_config is called with wrong arguments.
  • Enhanced the custom code evaluation sandbox and removed the limitation that results need to be between 0 and 1