Phase 7 of 12 · Platform Operator

Agent Evals

Phase 7 is the work of evaluating non-deterministic agent behaviour with scenarios and traces, where humans design judges, golden sets, and regression checks.

Evaluate non-deterministic model and agent behaviour with scenarios, traces, and regression sets.

Decision rules

Each rule connects a real situation to the skill or playbook that fits it. Linked terms open canonical sources.

Decision rules for Agent Evals
Situation Missing skill Recommended playbook Alternatives Why
An agent's behaviour is currently being judged by vibes rather than measured against a rubric. Scenario and rubric design OpenAI Evals Promptfoo OpenAI Evals scales to a regression corpus and judge rubric; Promptfoo is lighter and runs locally, which is enough when you're still defining the scenarios.
Two prompts or agents are running side by side and no one knows which is actually better. Experiment analysis pm-data-analytics:ab-test-analysis Manual eval review A/B analysis is correct once volume is high enough for significance; rubric-based manual review is the honest answer when it isn't.
Eval scores look fine in aggregate but specific user segments keep complaining. Cohort segmentation pm-data-analytics:cohort-analysis Trace sampling by segment Cohort analysis segments the eval corpus before you trust the headline number; trace sampling does the same job by hand when you don't yet have the data pipeline.

Watch

Reality

Agent behaviour is not fully covered by unit tests. It needs scenario design, trajectory review, judge criteria, trace inspection, and ongoing regression checks.

Required skills

  • Scenario design
  • Golden dataset curation
  • Judge calibration
  • Trace review
  • Adversarial case design

Failure modes

  • Thin eval sets
  • Judge drift
  • Cost blind spots
  • Unmeasured tool failures

Next operating step

Define acceptable agent behaviour with scenario sets, golden examples, judge criteria, trajectory traces, cost gates, and regression checks for drift.

Working through Agent Evals?

I advise teams on this part of the lifecycle. Get in touch → if you want a direct, vendor-free conversation about what's worth doing next.