Phase 10 of 12 · Platform Operator
Agent Operations
Phase 10 is the work of explaining production agent behaviour and detecting degradation, where humans interpret traces, attribute cost, and write the runbooks.
Explain production agent behaviour and detect degradation over time.
Decision rules
Each rule connects a real situation to the skill or playbook that fits it. Linked terms open canonical sources.
| Situation | Missing skill | Recommended playbook | Alternatives | Why |
|---|---|---|---|---|
| Deploys pass CI but production behaviour drifts away from what was tested. | Post-deploy verification | verification-before-completion | Manual smoke tests | Verification-before-completion runs the behavioural checks as part of the deploy pipeline; manual smoke tests work for low-frequency releases but don't catch slow drift. |
| Latency or cost regressions only surface days after the deploy that caused them. | Performance regression detection | Datadog | Grafana | Datadog covers tracing, metrics and alerting in one product; Grafana is the right choice when you already run your own metrics stack and want to keep ownership. |
| Production incidents keep happening and the same class of failure keeps repeating. | Reliability review | systematic-debugging | SRE incident review | Systematic-debugging turns each incident's root cause into a guardrail in code; an SRE-style review is broader and looks at process as well, which matters in larger orgs. |
| Aggregate metrics look stable but specific user cohorts are degrading quietly. | Cohort segmentation | pm-data-analytics:cohort-analysis | Support quality-tail analytics | Cohort analysis catches the degradation in product metrics directly; support quality-tail analytics catches the same problem through complaint patterns, which is later but cheaper to instrument. |
Watch
Reality
Traditional APM was built for deterministic systems, not agent trajectories, tool calls, context quality, reasoning paths, and policy checks.
Required skills
- Trace interpretation
- Incident review
- Cost attribution
- Runbook design
- Drift detection
Viable tools
Failure modes
- No explanation for agent decisions
- Unbounded cost
- Silent workflow degradation
Next operating step
Instrument production agents so operators can see what the agent saw, decided, called, cost, changed, escalated, and recovered from.
Working through Agent Operations?
I advise teams on this part of the lifecycle. Get in touch → if you want a direct, vendor-free conversation about what's worth doing next.