Observability for AI agents in production
How I design observability for AI agents: traces, logs, metrics, costs, errors, privacy and continuous improvement.
An AI agent in production cannot be managed only by checking whether it "answers well". When the system uses tools, queries data, calls APIs, decides whether to escalate, consumes tokens and executes real actions, I need to know what happened at each step.
Observability is what turns an AI automation into an operable system. Without it, a failure becomes a debate: nobody knows whether the problem was the prompt, the model, an API, a credential, the retrieved context, a business rule or a human decision that arrived late.
In projects with n8n, LLMs, Python, FastAPI, Spring, web search and workflows connected to business processes, I would design observability from the beginning. Not as a nice extra, but as a condition for maintaining the system.
Direct answer: what to observe in an AI agent
An AI agent should be observed by recording execution traces, structured inputs and outputs, tool calls, classified errors, costs, latency, escalation decisions, prompt versions and context used, while minimising sensitive data.
Observability should answer concrete questions:
| Question | Required signal |
|---|---|
| What did the user ask? | Normalised input and channel |
| What did the agent understand? | Intent, entities and confidence |
| What context did it use? | Source, version and relevant snippets |
| Which tool did it call? | Name, parameters, result and error |
| What did it cost? | Tokens, model, calls and estimated cost |
| Why did it escalate? | Rule, confidence or missing data |
| What answer did it give? | Final output and execution state |
| Can it be debugged? | Correlation ID and full trace |
An observable agent is not one that stores everything. It is one that stores enough to understand, debug, audit and improve without exposing unnecessary data.
Why AI observability is different
In a traditional API, status codes, latency, errors and throughput often tell much of the story. In an AI agent, that is not enough.
An agent can fail even when the status code is 200:
- it classified the intent incorrectly;
- it retrieved irrelevant context;
- it used an outdated source;
- it selected the wrong tool;
- it generated invalid JSON;
- it replied with the wrong tone;
- it spent too many tokens;
- it failed to escalate when required;
- it escalated too many simple cases;
- it mixed data from different tenants.
AI observability must look at the whole process, not only whether the call completed.
This connects with my guide on how I evaluate AI agents before production: if I cannot review traces, I cannot evaluate well either.
Logs, metrics, traces and events
I would not use "logs" as a word for everything. I would separate four types of signals:
| Signal | What it is for |
|---|---|
| Logs | Understanding details of a specific execution |
| Metrics | Seeing trends, volume, latency, cost and errors |
| Traces | Reconstructing the full path across components |
| Events | Recording important state changes |
An observable flow could look like this:
request_received
-> intent_detected
-> context_retrieved
-> tool_called
-> validation_passed
-> response_sent
Each step should share a common identifier. Without that identifier, following an execution across n8n, backend, LLM provider, CRM and notification system becomes too slow.
Correlation ID: the small piece that prevents chaos
Every execution should start with a correlation_id. That identifier should travel through every
component:
- webhook;
- n8n workflow;
- FastAPI or Spring service;
- LLM call;
- external tool;
- database;
- log;
- notification;
- human escalation.
When something fails, I do not want to search for "the message at 10:32". I want to search for an identifier and see the whole story.
{
"correlation_id": "exec_20260618_abc123",
"tenant": "restaurant_demo",
"workflow": "booking_agent",
"step": "tool_called",
"tool": "check_availability",
"status": "success",
"latency_ms": 842
}
This structure makes filtering, measuring and auditing much easier. It also reduces dependence on the developer's memory.
What I would record for each execution
I would not store complete prompts by default, or personal data without a reason. But I would keep operational information.
| Field | Reason |
|---|---|
| correlation_id | Follow the full execution |
| tenant or client | Separate data and costs |
| channel | Know whether it came from web, WhatsApp, email or API |
| workflow and version | Reproduce behaviour |
| prompt version | Detect regressions |
| model | Compare cost and quality |
| detected intent | Review classification |
| tools called | Audit actions |
| final state | completed, failed, escalated |
| error reason | Classify failures |
| estimated cost | Financial control |
| total latency | User experience |
The key is storing structured data, not only free text. If everything is left as loose sentences, it cannot be analysed properly later.
Measure costs before the invoice arrives
In agents with LLMs, cost is not a secondary detail. A workflow can be correct and still not viable if it makes too many calls or retrieves too much context.
I would measure:
- input tokens;
- output tokens;
- cost by model;
- number of LLM calls;
- number of tools used;
- cost per completed execution;
- cost per escalated execution;
- cost per tenant;
- cost per channel;
- cost per use case.
This enables technical decisions:
| Problem | Possible action |
|---|---|
| Too much context cost | Improve retrieval or summarise first |
| Too many model calls | Merge steps or use rules |
| Unnecessarily long outputs | Limit response format |
| Expensive escalated cases | Detect earlier and stop execution |
| Expensive model for simple task | Switch to a smaller model |
The question is not only "how much does it cost?". It is "which part of the system is consuming cost without adding value?".
Classify errors to improve
A generic failed error is not useful. I would classify failures by origin:
| Category | Example |
|---|---|
| Input | Missing data, invalid format, ambiguity |
| Model | Invalid JSON, low confidence, out-of-policy response |
| Context | Source not found, insufficient evidence |
| Tool | API down, timeout, permission denied |
| Business | Action not allowed, closed hours, limit exceeded |
| Security | Sensitive data, prompt injection, wrong tenant |
| Human | Pending approval, rejection, manual edit |
This classification turns failures into actionable information. If 40% of errors come from missing data, the problem may be the form. If they come from tools, retries or circuit breakers may be missing. If they come from the model, the prompt, structured output or evaluation may need work.
Observability and privacy
Observing does not mean storing everything. In fact, storing too much can create new risks.
I would apply these rules:
- do not log secrets;
- do not store API tokens;
- avoid full prompts by default;
- mask emails, phone numbers and identifiers when possible;
- separate logs by tenant;
- define data retention;
- restrict access to sensitive traces;
- record metadata before full content;
- store complete samples only when there is a clear reason.
Observability must coexist with data minimisation. I also develop this in security and privacy for enterprise AI agents.
Dashboards that actually help
A dashboard should not be a wall of numbers. It should answer operational questions.
I would review:
- executions per day;
- completion rate;
- escalation rate;
- main escalation reasons;
- cost per resolved case;
- p50, p95 and p99 latency;
- errors by integration;
- tenants with more failures;
- prompts or versions with regressions;
- most used tools;
- manually corrected actions.
I would also create concrete alerts:
- sudden increase in errors;
- cost per execution above threshold;
- latency too high;
- escalation rate above normal;
- repeated failure of a critical tool;
- many cases blocked by human approval.
A good dashboard helps decide. If it only decorates, it is not doing its job.
From observability to continuous improvement
Observability is not only useful for incidents. It improves the product.
With good traces I can detect:
- prompts that generate more escalations;
- tools that fail frequently;
- steps that consume too much cost;
- fields that users do not fill in;
- cases a person always approves;
- overly strict rules;
- sources that provide little evidence;
- tasks that could be automated further.
This connects very well with the human-in-the-loop pattern for AI agents: human decisions do not only resolve cases, they also generate data to improve the system.
AI agent observability checklist
Before deploying an agent, I would review:
- Every execution has a
correlation_id. - Workflow, prompt and model versions are recorded.
- Tools return structured results.
- Errors are classified by category.
- Cost is measured per execution and tenant.
- Total and per-step latency are measured.
- Traces can reconstruct real actions.
- Logs minimise personal data.
- Alerts exist for cost, errors and latency.
- Human escalations have a recorded reason.
- Metrics can detect regressions.
- There is a clear way to pause a problematic workflow.
Final criterion
An AI agent in production is not finished when it answers. It is finished when it can be operated.
For me, that means being able to see what it did, why it did it, how much it cost, where it failed, what data it used, which tool it called and which person intervened if needed.
Observability does not make an agent more spectacular in a demo. It does something more important: it makes the system trustworthy when there are users, real data and decisions that matter.