Phase 7 of 12 · Platform Operator

Agent Evals

Phase 7 is the work of evaluating non-deterministic agent behaviour with scenarios and traces, where humans design judges, golden sets, and regression checks.

Evaluate non-deterministic model and agent behaviour with scenarios, traces, and regression sets.

Decision rules

Each rule connects a real situation to the skill or playbook that fits it. Linked terms open canonical sources.

Decision rules for Agent Evals
Situation	Missing skill	Recommended playbook	Alternatives	Why
An agent's behaviour is currently being judged by vibes rather than measured against a rubric.	Scenario and rubric design	OpenAI Evals	Promptfoo	OpenAI Evals scales to a regression corpus and judge rubric; Promptfoo is lighter and runs locally, which is enough when you're still defining the scenarios.
Two prompts or agents are running side by side and no one knows which is actually better.	Experiment analysis	pm-data-analytics:ab-test-analysis	Manual eval review	A/B analysis is correct once volume is high enough for significance; rubric-based manual review is the honest answer when it isn't.
Eval scores look fine in aggregate but specific user segments keep complaining.	Cohort segmentation	pm-data-analytics:cohort-analysis	Trace sampling by segment	Cohort analysis segments the eval corpus before you trust the headline number; trace sampling does the same job by hand when you don't yet have the data pipeline.

Watch

AI Evaluations Clearly Explained in 50 Minutes (Real Example)

Hamel Husain · 2025-09-28 · 50 min

Reality

Agent behaviour is not fully covered by unit tests. It needs scenario design, trajectory review, judge criteria, trace inspection, and ongoing regression checks.

Required skills

Scenario design
Golden dataset curation
Judge calibration
Trace review
Adversarial case design

Viable tools

Failure modes

Thin eval sets
Judge drift
Cost blind spots
Unmeasured tool failures

Next operating step

Define acceptable agent behaviour with scenario sets, golden examples, judge criteria, trajectory traces, cost gates, and regression checks for drift.

Working through Agent Evals?

I advise teams on this part of the lifecycle. Get in touch → if you want a direct, vendor-free conversation about what's worth doing next.