Phase 7 of 12 · Platform Operator
Agent Evals
Phase 7 is the work of evaluating non-deterministic agent behaviour with scenarios and traces, where humans design judges, golden sets, and regression checks.
Evaluate non-deterministic model and agent behaviour with scenarios, traces, and regression sets.
Decision rules
Each rule connects a real situation to the skill or playbook that fits it. Linked terms open canonical sources.
| Situation | Missing skill | Recommended playbook | Alternatives | Why |
|---|---|---|---|---|
| An agent's behaviour is currently being judged by vibes rather than measured against a rubric. | Scenario and rubric design | OpenAI Evals | Promptfoo | OpenAI Evals scales to a regression corpus and judge rubric; Promptfoo is lighter and runs locally, which is enough when you're still defining the scenarios. |
| Two prompts or agents are running side by side and no one knows which is actually better. | Experiment analysis | pm-data-analytics:ab-test-analysis | Manual eval review | A/B analysis is correct once volume is high enough for significance; rubric-based manual review is the honest answer when it isn't. |
| Eval scores look fine in aggregate but specific user segments keep complaining. | Cohort segmentation | pm-data-analytics:cohort-analysis | Trace sampling by segment | Cohort analysis segments the eval corpus before you trust the headline number; trace sampling does the same job by hand when you don't yet have the data pipeline. |
Watch
Reality
Agent behaviour is not fully covered by unit tests. It needs scenario design, trajectory review, judge criteria, trace inspection, and ongoing regression checks.
Required skills
- Scenario design
- Golden dataset curation
- Judge calibration
- Trace review
- Adversarial case design
Viable tools
Failure modes
- Thin eval sets
- Judge drift
- Cost blind spots
- Unmeasured tool failures
Next operating step
Define acceptable agent behaviour with scenario sets, golden examples, judge criteria, trajectory traces, cost gates, and regression checks for drift.
Working through Agent Evals?
I advise teams on this part of the lifecycle. Get in touch → if you want a direct, vendor-free conversation about what's worth doing next.