Phase 10 of 12 · Platform Operator

Agent Operations

Phase 10 is the work of explaining production agent behaviour and detecting degradation, where humans interpret traces, attribute cost, and write the runbooks.

Explain production agent behaviour and detect degradation over time.

Decision rules

Each rule connects a real situation to the skill or playbook that fits it. Linked terms open canonical sources.

Decision rules for Agent Operations
Situation	Missing skill	Recommended playbook	Alternatives	Why
Deploys pass CI but production behaviour drifts away from what was tested.	Post-deploy verification	verification-before-completion	Manual smoke tests	Verification-before-completion runs the behavioural checks as part of the deploy pipeline; manual smoke tests work for low-frequency releases but don't catch slow drift.
Latency or cost regressions only surface days after the deploy that caused them.	Performance regression detection	Datadog	Grafana	Datadog covers tracing, metrics and alerting in one product; Grafana is the right choice when you already run your own metrics stack and want to keep ownership.
Production incidents keep happening and the same class of failure keeps repeating.	Reliability review	systematic-debugging	SRE incident review	Systematic-debugging turns each incident's root cause into a guardrail in code; an SRE-style review is broader and looks at process as well, which matters in larger orgs.
Aggregate metrics look stable but specific user cohorts are degrading quietly.	Cohort segmentation	pm-data-analytics:cohort-analysis	Support quality-tail analytics	Cohort analysis catches the degradation in product metrics directly; support quality-tail analytics catches the same problem through complaint patterns, which is later but cheaper to instrument.

Watch

The Agent Development Lifecycle: Build, Test, Deploy, Monitor

Harrison Chase & Ankush Gola · LangChain Interrupt 2026 keynote · 2026-05-14 · 20k views

Observability: the present and future

Charity Majors · Honeycomb · Pragmatic Engineer · 2025-01 · 81k views

Reality

Traditional APM was built for deterministic systems, not agent trajectories, tool calls, context quality, reasoning paths, and policy checks.

Required skills

Trace interpretation
Incident review
Cost attribution
Runbook design
Drift detection

Viable tools

Failure modes

No explanation for agent decisions
Unbounded cost
Silent workflow degradation

Next operating step

Instrument production agents so operators can see what the agent saw, decided, called, cost, changed, escalated, and recovered from.

Working through Agent Operations?

I advise teams on this part of the lifecycle. Get in touch → if you want a direct, vendor-free conversation about what's worth doing next.