Skip to content
Gorka Hernandez Villalon, iOS developer and AI automation specialistGorka Hernandez
Back to blog
ObservabilityAI AgentsLLMn8nProduction

Observability for AI agents in production

How I design observability for AI agents: traces, logs, metrics, costs, errors, privacy and continuous improvement.

June 18, 2026 9 min readby Gorka Hernandez Villalon

An AI agent in production cannot be managed only by checking whether it "answers well". When the system uses tools, queries data, calls APIs, decides whether to escalate, consumes tokens and executes real actions, I need to know what happened at each step.

Observability is what turns an AI automation into an operable system. Without it, a failure becomes a debate: nobody knows whether the problem was the prompt, the model, an API, a credential, the retrieved context, a business rule or a human decision that arrived late.

In projects with n8n, LLMs, Python, FastAPI, Spring, web search and workflows connected to business processes, I would design observability from the beginning. Not as a nice extra, but as a condition for maintaining the system.

Direct answer: what to observe in an AI agent

An AI agent should be observed by recording execution traces, structured inputs and outputs, tool calls, classified errors, costs, latency, escalation decisions, prompt versions and context used, while minimising sensitive data.

Observability should answer concrete questions:

QuestionRequired signal
What did the user ask?Normalised input and channel
What did the agent understand?Intent, entities and confidence
What context did it use?Source, version and relevant snippets
Which tool did it call?Name, parameters, result and error
What did it cost?Tokens, model, calls and estimated cost
Why did it escalate?Rule, confidence or missing data
What answer did it give?Final output and execution state
Can it be debugged?Correlation ID and full trace

An observable agent is not one that stores everything. It is one that stores enough to understand, debug, audit and improve without exposing unnecessary data.

Why AI observability is different

In a traditional API, status codes, latency, errors and throughput often tell much of the story. In an AI agent, that is not enough.

An agent can fail even when the status code is 200:

  • it classified the intent incorrectly;
  • it retrieved irrelevant context;
  • it used an outdated source;
  • it selected the wrong tool;
  • it generated invalid JSON;
  • it replied with the wrong tone;
  • it spent too many tokens;
  • it failed to escalate when required;
  • it escalated too many simple cases;
  • it mixed data from different tenants.

AI observability must look at the whole process, not only whether the call completed.

This connects with my guide on how I evaluate AI agents before production: if I cannot review traces, I cannot evaluate well either.

Logs, metrics, traces and events

I would not use "logs" as a word for everything. I would separate four types of signals:

SignalWhat it is for
LogsUnderstanding details of a specific execution
MetricsSeeing trends, volume, latency, cost and errors
TracesReconstructing the full path across components
EventsRecording important state changes

An observable flow could look like this:

request_received
    -> intent_detected
        -> context_retrieved
            -> tool_called
                -> validation_passed
                    -> response_sent

Each step should share a common identifier. Without that identifier, following an execution across n8n, backend, LLM provider, CRM and notification system becomes too slow.

Correlation ID: the small piece that prevents chaos

Every execution should start with a correlation_id. That identifier should travel through every component:

  • webhook;
  • n8n workflow;
  • FastAPI or Spring service;
  • LLM call;
  • external tool;
  • database;
  • log;
  • notification;
  • human escalation.

When something fails, I do not want to search for "the message at 10:32". I want to search for an identifier and see the whole story.

{
  "correlation_id": "exec_20260618_abc123",
  "tenant": "restaurant_demo",
  "workflow": "booking_agent",
  "step": "tool_called",
  "tool": "check_availability",
  "status": "success",
  "latency_ms": 842
}

This structure makes filtering, measuring and auditing much easier. It also reduces dependence on the developer's memory.

What I would record for each execution

I would not store complete prompts by default, or personal data without a reason. But I would keep operational information.

FieldReason
correlation_idFollow the full execution
tenant or clientSeparate data and costs
channelKnow whether it came from web, WhatsApp, email or API
workflow and versionReproduce behaviour
prompt versionDetect regressions
modelCompare cost and quality
detected intentReview classification
tools calledAudit actions
final statecompleted, failed, escalated
error reasonClassify failures
estimated costFinancial control
total latencyUser experience

The key is storing structured data, not only free text. If everything is left as loose sentences, it cannot be analysed properly later.

Measure costs before the invoice arrives

In agents with LLMs, cost is not a secondary detail. A workflow can be correct and still not viable if it makes too many calls or retrieves too much context.

I would measure:

  • input tokens;
  • output tokens;
  • cost by model;
  • number of LLM calls;
  • number of tools used;
  • cost per completed execution;
  • cost per escalated execution;
  • cost per tenant;
  • cost per channel;
  • cost per use case.

This enables technical decisions:

ProblemPossible action
Too much context costImprove retrieval or summarise first
Too many model callsMerge steps or use rules
Unnecessarily long outputsLimit response format
Expensive escalated casesDetect earlier and stop execution
Expensive model for simple taskSwitch to a smaller model

The question is not only "how much does it cost?". It is "which part of the system is consuming cost without adding value?".

Classify errors to improve

A generic failed error is not useful. I would classify failures by origin:

CategoryExample
InputMissing data, invalid format, ambiguity
ModelInvalid JSON, low confidence, out-of-policy response
ContextSource not found, insufficient evidence
ToolAPI down, timeout, permission denied
BusinessAction not allowed, closed hours, limit exceeded
SecuritySensitive data, prompt injection, wrong tenant
HumanPending approval, rejection, manual edit

This classification turns failures into actionable information. If 40% of errors come from missing data, the problem may be the form. If they come from tools, retries or circuit breakers may be missing. If they come from the model, the prompt, structured output or evaluation may need work.

Observability and privacy

Observing does not mean storing everything. In fact, storing too much can create new risks.

I would apply these rules:

  • do not log secrets;
  • do not store API tokens;
  • avoid full prompts by default;
  • mask emails, phone numbers and identifiers when possible;
  • separate logs by tenant;
  • define data retention;
  • restrict access to sensitive traces;
  • record metadata before full content;
  • store complete samples only when there is a clear reason.

Observability must coexist with data minimisation. I also develop this in security and privacy for enterprise AI agents.

Dashboards that actually help

A dashboard should not be a wall of numbers. It should answer operational questions.

I would review:

  • executions per day;
  • completion rate;
  • escalation rate;
  • main escalation reasons;
  • cost per resolved case;
  • p50, p95 and p99 latency;
  • errors by integration;
  • tenants with more failures;
  • prompts or versions with regressions;
  • most used tools;
  • manually corrected actions.

I would also create concrete alerts:

  • sudden increase in errors;
  • cost per execution above threshold;
  • latency too high;
  • escalation rate above normal;
  • repeated failure of a critical tool;
  • many cases blocked by human approval.

A good dashboard helps decide. If it only decorates, it is not doing its job.

From observability to continuous improvement

Observability is not only useful for incidents. It improves the product.

With good traces I can detect:

  • prompts that generate more escalations;
  • tools that fail frequently;
  • steps that consume too much cost;
  • fields that users do not fill in;
  • cases a person always approves;
  • overly strict rules;
  • sources that provide little evidence;
  • tasks that could be automated further.

This connects very well with the human-in-the-loop pattern for AI agents: human decisions do not only resolve cases, they also generate data to improve the system.

AI agent observability checklist

Before deploying an agent, I would review:

  • Every execution has a correlation_id.
  • Workflow, prompt and model versions are recorded.
  • Tools return structured results.
  • Errors are classified by category.
  • Cost is measured per execution and tenant.
  • Total and per-step latency are measured.
  • Traces can reconstruct real actions.
  • Logs minimise personal data.
  • Alerts exist for cost, errors and latency.
  • Human escalations have a recorded reason.
  • Metrics can detect regressions.
  • There is a clear way to pause a problematic workflow.

Final criterion

An AI agent in production is not finished when it answers. It is finished when it can be operated.

For me, that means being able to see what it did, why it did it, how much it cost, where it failed, what data it used, which tool it called and which person intervened if needed.

Observability does not make an agent more spectacular in a demo. It does something more important: it makes the system trustworthy when there are users, real data and decisions that matter.