Observability for AI agents in production

How I design observability for AI agents: traces, logs, metrics, costs, errors, privacy and continuous improvement.

An AI agent in production cannot be managed only by checking whether it "answers well". When the system uses tools, queries data, calls APIs, decides whether to escalate, consumes tokens and executes real actions, I need to know what happened at each step.

Observability is what turns an AI automation into an operable system. Without it, a failure becomes a debate: nobody knows whether the problem was the prompt, the model, an API, a credential, the retrieved context, a business rule or a human decision that arrived late.

In projects with n8n, LLMs, Python, FastAPI, Spring, web search and workflows connected to business processes, I would design observability from the beginning. Not as a nice extra, but as a condition for maintaining the system.

Direct answer: what to observe in an AI agent

An AI agent should be observed by recording execution traces, structured inputs and outputs, tool calls, classified errors, costs, latency, escalation decisions, prompt versions and context used, while minimising sensitive data.

Observability should answer concrete questions:

Question	Required signal
What did the user ask?	Normalised input and channel
What did the agent understand?	Intent, entities and confidence
What context did it use?	Source, version and relevant snippets
Which tool did it call?	Name, parameters, result and error
What did it cost?	Tokens, model, calls and estimated cost
Why did it escalate?	Rule, confidence or missing data
What answer did it give?	Final output and execution state
Can it be debugged?	Correlation ID and full trace

An observable agent is not one that stores everything. It is one that stores enough to understand, debug, audit and improve without exposing unnecessary data.

Why AI observability is different

In a traditional API, status codes, latency, errors and throughput often tell much of the story. In an AI agent, that is not enough.

An agent can fail even when the status code is 200:

it classified the intent incorrectly;
it retrieved irrelevant context;
it used an outdated source;
it selected the wrong tool;
it generated invalid JSON;
it replied with the wrong tone;
it spent too many tokens;
it failed to escalate when required;
it escalated too many simple cases;
it mixed data from different tenants.

AI observability must look at the whole process, not only whether the call completed.

This connects with my guide on how I evaluate AI agents before production: if I cannot review traces, I cannot evaluate well either.

Logs, metrics, traces and events

I would not use "logs" as a word for everything. I would separate four types of signals:

Signal	What it is for
Logs	Understanding details of a specific execution
Metrics	Seeing trends, volume, latency, cost and errors
Traces	Reconstructing the full path across components
Events	Recording important state changes

An observable flow could look like this:

request_received
    -> intent_detected
        -> context_retrieved
            -> tool_called
                -> validation_passed
                    -> response_sent

Each step should share a common identifier. Without that identifier, following an execution across n8n, backend, LLM provider, CRM and notification system becomes too slow.

Correlation ID: the small piece that prevents chaos

Every execution should start with a correlation_id. That identifier should travel through every component:

webhook;
n8n workflow;
FastAPI or Spring service;
LLM call;
external tool;
database;
log;
notification;
human escalation.

When something fails, I do not want to search for "the message at 10:32". I want to search for an identifier and see the whole story.

{
  "correlation_id": "exec_20260618_abc123",
  "tenant": "restaurant_demo",
  "workflow": "booking_agent",
  "step": "tool_called",
  "tool": "check_availability",
  "status": "success",
  "latency_ms": 842
}

This structure makes filtering, measuring and auditing much easier. It also reduces dependence on the developer's memory.

What I would record for each execution

I would not store complete prompts by default, or personal data without a reason. But I would keep operational information.

Field	Reason
correlation_id	Follow the full execution
tenant or client	Separate data and costs
channel	Know whether it came from web, WhatsApp, email or API
workflow and version	Reproduce behaviour
prompt version	Detect regressions
model	Compare cost and quality
detected intent	Review classification
tools called	Audit actions
final state	completed, failed, escalated
error reason	Classify failures
estimated cost	Financial control
total latency	User experience

The key is storing structured data, not only free text. If everything is left as loose sentences, it cannot be analysed properly later.

For those traces to be useful, prompts and workflows also need clear versions. I develop this in versioning prompts and workflows for AI agents.

Measure costs before the invoice arrives

In agents with LLMs, cost is not a secondary detail. A workflow can be correct and still not viable if it makes too many calls or retrieves too much context.

I would measure:

input tokens;
output tokens;
cost by model;
number of LLM calls;
number of tools used;
cost per completed execution;
cost per escalated execution;
cost per tenant;
cost per channel;
cost per use case.

This enables technical decisions:

Problem	Possible action
Too much context cost	Improve retrieval or summarise first
Too many model calls	Merge steps or use rules
Unnecessarily long outputs	Limit response format
Expensive escalated cases	Detect earlier and stop execution
Expensive model for simple task	Switch to a smaller model

The question is not only "how much does it cost?". It is "which part of the system is consuming cost without adding value?".

This point is becoming even more relevant in enterprise AI. In AI news June 2026: agents, search and enterprise control I analyze how recent updates from OpenAI, Google and Anthropic push toward usage control, traces and system-level cost awareness.

Classify errors to improve

A generic failed error is not useful. I would classify failures by origin:

Category	Example
Input	Missing data, invalid format, ambiguity
Model	Invalid JSON, low confidence, out-of-policy response
Context	Source not found, insufficient evidence
Tool	API down, timeout, permission denied
Business	Action not allowed, closed hours, limit exceeded
Security	Sensitive data, prompt injection, wrong tenant
Human	Pending approval, rejection, manual edit

This classification turns failures into actionable information. If 40% of errors come from missing data, the problem may be the form. If they come from tools, retries or circuit breakers may be missing. If they come from the model, the prompt, structured output or evaluation may need work.

Observability and privacy

Observing does not mean storing everything. In fact, storing too much can create new risks.

I would apply these rules:

do not log secrets;
do not store API tokens;
avoid full prompts by default;
mask emails, phone numbers and identifiers when possible;
separate logs by tenant;
define data retention;
restrict access to sensitive traces;
record metadata before full content;
store complete samples only when there is a clear reason.

Observability must coexist with data minimisation. I also develop this in security and privacy for enterprise AI agents.

Dashboards that actually help

A dashboard should not be a wall of numbers. It should answer operational questions.

I would review:

executions per day;
completion rate;
escalation rate;
main escalation reasons;
cost per resolved case;
p50, p95 and p99 latency;
errors by integration;
tenants with more failures;
prompts or versions with regressions;
most used tools;
manually corrected actions.

I would also create concrete alerts:

sudden increase in errors;
cost per execution above threshold;
latency too high;
escalation rate above normal;
repeated failure of a critical tool;
many cases blocked by human approval.

A good dashboard helps decide. If it only decorates, it is not doing its job.

From observability to continuous improvement

Observability is not only useful for incidents. It improves the product.

With good traces I can detect:

prompts that generate more escalations;
tools that fail frequently;
steps that consume too much cost;
fields that users do not fill in;
cases a person always approves;
overly strict rules;
sources that provide little evidence;
tasks that could be automated further.

This connects very well with the human-in-the-loop pattern for AI agents: human decisions do not only resolve cases, they also generate data to improve the system.

AI agent observability checklist

Before deploying an agent, I would review:

Final criterion

An AI agent in production is not finished when it answers. It is finished when it can be operated.

For me, that means being able to see what it did, why it did it, how much it cost, where it failed, what data it used, which tool it called and which person intervened if needed.

Observability does not make an agent more spectacular in a demo. It does something more important: it makes the system trustworthy when there are users, real data and decisions that matter.