Phase 10 of 12 · Platform Operator

Agent Operations

Phase 10 is the work of explaining production agent behaviour and detecting degradation, where humans interpret traces, attribute cost, and write the runbooks.

Explain production agent behaviour and detect degradation over time.

Decision rules

Each rule connects a real situation to the skill or playbook that fits it. Linked terms open canonical sources.

Decision rules for Agent Operations
Situation Missing skill Recommended playbook Alternatives Why
Deploys pass CI but production behaviour drifts away from what was tested. Post-deploy verification verification-before-completion Manual smoke tests Verification-before-completion runs the behavioural checks as part of the deploy pipeline; manual smoke tests work for low-frequency releases but don't catch slow drift.
Latency or cost regressions only surface days after the deploy that caused them. Performance regression detection Datadog Grafana Datadog covers tracing, metrics and alerting in one product; Grafana is the right choice when you already run your own metrics stack and want to keep ownership.
Production incidents keep happening and the same class of failure keeps repeating. Reliability review systematic-debugging SRE incident review Systematic-debugging turns each incident's root cause into a guardrail in code; an SRE-style review is broader and looks at process as well, which matters in larger orgs.
Aggregate metrics look stable but specific user cohorts are degrading quietly. Cohort segmentation pm-data-analytics:cohort-analysis Support quality-tail analytics Cohort analysis catches the degradation in product metrics directly; support quality-tail analytics catches the same problem through complaint patterns, which is later but cheaper to instrument.

Watch

Reality

Traditional APM was built for deterministic systems, not agent trajectories, tool calls, context quality, reasoning paths, and policy checks.

Required skills

  • Trace interpretation
  • Incident review
  • Cost attribution
  • Runbook design
  • Drift detection

Failure modes

  • No explanation for agent decisions
  • Unbounded cost
  • Silent workflow degradation

Next operating step

Instrument production agents so operators can see what the agent saw, decided, called, cost, changed, escalated, and recovered from.

Working through Agent Operations?

I advise teams on this part of the lifecycle. Get in touch → if you want a direct, vendor-free conversation about what's worth doing next.