Assessing the Orchestrator

In my last entry, I noted that the general framework of this project is pretty much complete, and that’s true! Especially for the agent components, where I’ve built a collection of task-specific agents, plus an Orchestrator Agent capable of delegating tasks to these agents depending on what the task is. However, there’s a lot further this project could go. I could create dozens of agents for different tasks - ordering things online, scheduling appointments, integrating with home appliances, tracking workouts - the options are limitless! However, as Anthropic noted in this recent article, as agents evolve, be it by plugging in a new LLM model, tweaking the harness, or expanding it’s capabilities, they can display unexpected and difficult to debug regressions. In this article, I’ll walk through my first pass at building an evaluation framework to validate the current state of my orchestrator agent, and to enable performance tracking as it evolves over time.

Tooling

For my evaluation framework I decided to use Promptfoo. It’s open-source, lightweight, and super easy to set up - perfect for a project like this!

At a high level, Promptfoo works by taking a LLM model or an agent (the “provider”) and a bunch of test cases as described in a .yaml file, running all the tests against the provider, outputting the results, and (optionally) evaluating those results against some pre-defined criteria that was also described in the .yaml file.

Setup for Orchestrator Evaluation

In this case, we’re interested in seeing how well our Orchestrator Agent routes queries to task-specific agents, of which there are currently three (a Gmail Agent, a Ski Report Agent, and a Notes Agent). The overall flow of the evaluations will look something like this:

I’ll breakdown each of these components below, but the gist of it is that I have some pre-defined evaluations (queries and the correct agents that they should be routed to), Promptfoo runs those evaluations through an Intent Router (the brains of the Orchestrator Agent), and evaluates the outputs against the correct answers, finally outputting a summary report. Let’s take a closer look.

Test Input

For this assessment, I actually set up a few different evaluation suites - one for straightforward cases (routing.yaml), one for followup queries (follow.yaml) and one for edge cases (edge.yaml). Each of these includes some query, and an “assertion” stating what the correct output should be. For example, one of my basic tests is as follows:

- description: "Gmail: check inbox"
    vars:
      query: "Check my inbox"
    assert:
      - type: javascript
        value: "JSON.parse(output).agent_name === 'gmail'"

In this case, the query that’s being passed to my Intent Router is “Check my Inbox”. The Intent Router should route that query to the gmail agent, so when the result comes back, the assertion runs a quick javascript test to confirm that the output appropriately indicated that the query should be routed to the Gmail Agent. If the result is ‘True’, then the test passes.

Runner

Promptfoo runs locally via npx. It accepts the one of the .yaml test inputs, and passes each test to the “provider”, which in this case is our Intent Router, representing the Orchestration Agent.

Provider

I mentioned earlier that Promptfoo is commonly used to assess prompts against out-of-the-box LLM models (like Claude Opus 4.5 or Gemini 3 Pro). However in our case, we’re not interested in testing a direct LLM model - we want to test our own agent. In order to do so, Promptfoo expects to see a ‘provider.py’ file, containing a function called ‘call_api()’. In our case, the ‘call_api()’ function calls our intent router to assess the query that Promptfoo passed in. In this case, I also configured some “dummy agents” in ‘shared.py’ to represent the task-specific agents. This way we can quickly test the orchestration logic without actually having to wait for each agent to do its thing.

Router

For this specific assessment, the thing we’re checking is our Orchestration Agents ability to determine which agent to delegate a task to, which is primarily managed by our Intent Router. To understand more about this, check out my last article on Orchestrating Agents.

Output

Once the ‘call_api()’ function has gotten a routing decision from the intent router, it bundles that output in a json blob and passes it back to Promptfoo.

Assert

As I mentioned in the Test Input section above, every test case includes an assertion, which describes the criteria as a javascript command that returns True/False to indicate a pass or a fail. When Promptfoo receives the output from ‘call_api()’, it checks that output against the assertion to determine whether that test passed.

Report

Once Promptfoo has iterated through all the tests described in the .yaml file, it outputs a nice report that can be viewed in your browser.

End Result

65 tests later - we got a smashing success - a 100% pass rate!

Actually though, this is good news, but it only takes us so far! Generally, evals are helpful not just for tracking regressions in the future (where you want to retain a 100% pass rate), but also to see how close you are to your ideal state. In that latter case, you might want to come up with some evaluations that your agent can’t pass yet, but eventually will with future improvements. In my case, since I’m not entirely sure which agents I’ll add and how I might want to re-work my Orchestrator Agent, I’m not sure what those “future-state” tests should be, so for now these evaluations will mostly be used for regression testing.

It’s also worth noting that all of the assertions here were code-based (ie did the javascript test return True or False). Agent evals can also use LLMs (or even humans) to handle fuzzier grading, so at some point it might be worth trying to build that out for some of the more complicated task specific agents.

In any case, this is a solid foundation that I’ll continue to build on for future agents.

Evaluating the Orchestrator

Tooling

Setup for Orchestrator Evaluation

Test Input

Runner

Provider

Router

Output

Assert

Report

End Result

Comments

Building Clarvis

Orchestrating Agents

More from this blog

Where Things Stand

Orchestrating Agents

Clarvis Agent Prototyping

Hardware Setup

Command Palette

Tooling

Setup for Orchestrator Evaluation

Test Input

Runner

Provider

Router

Output

Assert

Report

End Result

Comments

Building Clarvis

Orchestrating Agents

More from this blog