Agent observability: instrumenting LLM workflows in production
When an LLM is your workflow, traces, evals, and rollback are non-negotiable.

When an LLM is your workflow, “the system worked” is no longer a binary. The workflow can run, return a response, and still be wrong in ways that only show up in aggregate. Without observability, the failure modes are invisible — until a customer complaint, a regulator question, or a CEO email surfaces them.
Production LLM observability is the boring engineering layer that separates a working agent from a deployable one. This post is what to instrument and why.
The four observability surfaces
A production agent needs four kinds of visibility:
- Per-call traces. What was the input, what was the prompt, what tools were called, what was the output, what was the latency, what was the cost?
- Per-call evaluation. Did this call meet quality thresholds? What was the confidence? What downstream actions did the output trigger?
- Aggregate quality metrics. Across the population of calls in the last 24 hours / 7 days / 30 days, what fraction met quality thresholds? What’s the trend?
- Cost and rate metrics. What is the per-call cost? Per-tenant cost? Per-day burn rate? Are any tenants hitting rate limits?
Most teams ship (1) — the trace — and stop there. The other three are where production-grade observability begins.
Per-call traces
Standard span data plus LLM-specific fields:
- Span: workflow name, request ID, tenant ID, user ID (where applicable).
- Inputs: original user input, retrieved context, tool call results.
- Prompt: the full assembled prompt (or a hash of it for sensitive workflows).
- Model invocation: model name, temperature, max tokens, system prompt version.
- Output: the raw model output, the parsed structured output (if applicable), any post-processing.
- Tool calls: each tool invoked, its arguments, its result.
- Latency: per-step latency, total latency.
- Cost: per-call cost (input tokens × input price + output tokens × output price).
We use OpenTelemetry as the trace contract, with custom attributes for the LLM-specific fields. Backend can be Tempo, Jaeger, Honeycomb, or any OTel-compatible system.
Per-call evaluation
Two evaluation patterns coexist:
- Synthetic evaluation. A fixed eval set (golden inputs with expected outputs) that runs on a schedule and on every prompt or model change. Catches regressions before deployment.
- Production evaluation. A sample of production calls, evaluated against quality criteria (sometimes automatically with an LLM-as-judge, sometimes with a human review queue). Catches drift after deployment.
The synthetic eval set is the regression gate. The production eval is the drift detector. Both are needed; neither substitutes for the other.
Aggregate quality metrics
The dashboards engineering leadership actually looks at:
- Quality rate over time — % of calls passing the eval criteria, day over day.
- Latency distribution — P50 / P95 / P99 latency for the full workflow and per-step.
- Cost per outcome — total cost per business outcome (per resolved ticket, per drafted document, etc.), not just per call.
- Tool call distribution — which tools are called, how often, with what success rate.
- Failure mode distribution — when calls fail, why? Categorised: timeout, rate limit, parse error, low-confidence rejection, downstream error.
These dashboards are how leadership knows the agent is healthy. Without them, the only signal of unhealth is customer complaints — which is too late.
Cost and rate metrics
LLM costs can compound 100× from a bad prompt or a feedback loop. The controls that exist to prevent the surprise:
- Per-tenant rate limits. Hard caps on calls/minute per tenant.
- Per-day cost caps. Hard caps on cost per tenant per day, with alerts at 50% / 80%.
- Runaway detection. Anomaly detection on call rate; alert + auto-throttle when a tenant’s call rate exceeds 5× their 30-day average.
- Per-prompt-version cost tracking. When a new prompt is deployed, immediate visibility into whether it’s more or less expensive than the prior version (longer outputs, more retries, etc.).
These controls don’t exist by default. They’re engineering work, scoped into the engagement.
The eval harness as a first-class artefact
A production agent has an eval harness with:
- A versioned set of input/output pairs (the eval set).
- A runner that executes the agent against the eval set and produces pass/fail per case.
- A diff view showing which cases changed between prompt versions.
- A history of eval runs over time.
Adding a case to the eval set is a code change, reviewed in PR. The eval set grows with the agent. Customer complaints become eval cases (so the same complaint can never recur).
The eval harness is what makes prompt changes safe. Without it, prompt changes are deployments-by-vibes.
Rollback
Every prompt and model version is tagged. Deployments are atomic at the per-workflow level. A bad deployment can be rolled back in one command:
agent-deploy rollback workflow-name --to-version 42
The rollback is exercise weekly in a deliberate drill, not first-used in an outage.
What this costs
A first-class observability stack for an agent in production adds 20–30% to the build cost of the agent itself. It is not optional. The cost of running an agent without observability — measured in surprised CEOs, regulator letters, and ad-hoc forensics on what went wrong last Tuesday — is dramatically higher.
We scope the observability stack into every Agents engagement. Clients sometimes ask if it can be cut to lower the price. The answer is no.
Read more: /agents/ · /case-studies/agency-bid-automation · /method/
Run the matching free calculator
Each one runs in 3 minutes and emails you an 8-page memo.