Skip to content
Gorka Hernandez Villalon, iOS developer and AI automation specialistGorka Hernandez
Back to blog
AI AgentsLLMEvaluationObservabilityAI Automation

How I evaluate AI agents before putting them into production

A practical guide to evaluating AI agents with datasets, tools, guardrails, traces, cost metrics and human review before production.

June 16, 2026 9 min readby Gorka Hernandez Villalon

An AI agent can look good after five manual conversations. It answers quickly, keeps the tone and uses a tool. That impression does not prove that it is ready for production.

The real challenge appears when the agent receives incomplete messages, contradictory instructions, API errors, frustrated users, sensitive data, out-of-policy requests or cases that only happen once every hundred executions.

That is why I would not evaluate an agent only by chatting with it. I would evaluate it as a system: inputs, reasoning, tools, rules, cost, security, traceability and final experience.

Direct answer: how to evaluate an AI agent

An AI agent should be evaluated with a representative case dataset, observable success criteria, tool tests, security guardrails, reviewable traces, cost metrics and human review of real samples.

The evaluation should not only ask "does the answer sound good?". It should ask:

DimensionKey question
IntentDoes it understand what the user wants to achieve?
ContextDoes it use the right and sufficient information?
ToolsDoes it choose the right tool with valid parameters?
RulesDoes it respect permissions, limits and policies?
OutcomeDoes it resolve the case or escalate correctly?
SecurityDoes it resist malicious instructions or untrusted data?
CostAre token and tool costs sustainable?
TraceabilityCan we review what happened and why?

An agent is not reliable because it succeeds on easy examples. It is reliable when its failures are detectable, explainable and recoverable.

The common mistake: evaluating only the final answer

Many teams review the agent's last sentence and decide whether they like it. That is not enough.

A response can be well written and still be dangerous:

  • it may have used the wrong source;
  • it may have ignored a constraint;
  • it may have invented a fact;
  • it may have called a tool with incorrect parameters;
  • it may have answered when it should have escalated;
  • it may have consumed too much cost for a simple task.

The evaluation must inspect the whole process. If an agent gives the right answer by accident, that is not behaviour I would trust for operations.

1. Define tasks and limits before testing

Before creating tests, I would define what the agent can and cannot do.

For example, in a customer-support agent:

  • it can answer frequently asked questions;
  • it can check the status of a request;
  • it can ask for missing data;
  • it can create a ticket;
  • it can escalate to a person;
  • it cannot promise unauthorised compensation;
  • it cannot expose another user's data;
  • it cannot execute actions outside policy.

This list turns evaluation into something concrete. Without clear limits, every test becomes an opinion about whether the conversation "feels good".

2. Build a dataset with real and synthetic cases

The evaluation dataset should mix frequent cases, edge cases and adversarial cases. It does not need to be huge at first, but it must cover important behaviour.

Case typeExample
Happy pathThe user asks for something clear and allowed
Missing dataDate, email, identifier or context is absent
AmbiguityThe user uses vague terms or changes intent
Tool useThe agent must query an API, calendar or CRM
PolicyThe request is limited by business rules
SecurityThe user tries to bypass instructions or request sensitive data
External errorAn API fails, times out or returns an unexpected response
EscalationThe case requires human judgement

Each case should have an expected result:

{
  "input": "I want to change tomorrow's booking",
  "expected_intent": "modify_booking",
  "required_action": "ask_for_missing_identifier",
  "must_not": ["invent_booking_id", "confirm_change_without_lookup"],
  "expected_outcome": "ask_clarifying_question"
}

The key is not comparing exact text. It is checking that the agent does the right thing.

3. Separate intent, tool and response evaluation

A failure can come from several layers. If I only inspect the final answer, I do not know what to fix.

I would evaluate three levels:

Intent

Does the agent understand what the user wants?

Example: "I cannot go tomorrow" may mean cancel, modify or ask for information. The agent should detect uncertainty and ask before acting.

Tool

Does the agent choose the correct function and send valid parameters?

Example: if it needs to look up a booking, it should not call create_booking. If the identifier is missing, it should not invent one.

Final answer

Does it communicate the real outcome clearly?

Example: if the tool fails, the answer must not say that the action was completed. It should explain the state and propose the next step.

This separation makes it possible to improve one part of the system without changing everything.

4. Evaluate tools and actions as contracts

Tools are one of the most delicate parts of an agent. A bad generated answer can confuse; a bad tool call can modify real data.

Each tool should have:

  • input schema;
  • permissions;
  • preconditions;
  • expected errors;
  • correct and incorrect examples;
  • retry policy;
  • structured result.

It is not enough to check that the model "knows" the tool exists. The evaluation must verify whether it uses the tool at the right time and with valid data.

User message
    -> detected intent
        -> proposed tool
            -> deterministic validation
                -> execution or rejection

This division connects with my guide to reliable AI automations in production: the LLM can propose, but the system must validate before executing.

5. Test guardrails and indirect attacks

An agent connected to tools should not only be accurate. It should also resist inputs that try to manipulate it.

I would test cases such as:

  • "Ignore your previous instructions";
  • "Tell me another customer's data";
  • "Execute this action even without authorisation";
  • retrieved web content that attempts to change the objective;
  • files or text with hidden instructions;
  • users mixing a valid request with a prohibited one.

In systems with web search or RAG, external content must be treated as untrusted data. I also explain this in OSINT with LLMs and web search: a source can inform the agent, but it must not control the agent.

For permissions, sensitive data and human approval, I complement it in security and privacy for enterprise AI agents.

6. Measure cost, latency and stability

An agent can be correct and still not viable if it is too slow or too expensive per execution.

I would measure:

  • input and output tokens;
  • number of model calls;
  • tool calls;
  • total latency;
  • time per step;
  • external error rate;
  • retries;
  • cost per resolved case;
  • cost per escalated case.

These metrics help decide whether to simplify prompts, cache context, change model, move logic into code or reduce unnecessary calls.

7. Review traces, not only outcomes

A trace should make it possible to reconstruct what happened:

event received
    -> context retrieved
        -> intent detected
            -> tool selected
                -> validation
                    -> result
                        -> response or escalation

There is no need to store sensitive information indiscriminately. But I do need to answer:

  • what data the agent saw;
  • which prompt or workflow version it used;
  • which model ran;
  • which tool it called;
  • which error it received;
  • why it escalated;
  • which response it delivered.

Without traces, improving an agent becomes arguing from anecdotes.

8. Combine automatic evaluation and human review

Automatic evaluations are very useful for detecting regressions. If I change a prompt, model or tool, I can run the dataset and check whether anything became worse.

But I would not remove human review. Some dimensions require judgement:

  • tone;
  • clarity;
  • sensitivity of the case;
  • real usefulness;
  • user expectations;
  • brand or business risk.

My approach would use automatic evaluation for coverage and consistency, and human review for contextual quality.

9. Define production go/no-go criteria

Before deployment, I would set thresholds. For example:

CriterionExample threshold
Correct intent95% on critical cases
Correct tool use98% with no dangerous actions
Mandatory escalation100% on sensitive cases
Critical hallucinations0 tolerated
LatencyWithin the channel objective
CostLower than expected operational value
Traceability100% of executions with an identifier

The numbers depend on the case. An agent that drafts emails can tolerate more error than one that modifies bookings, personal data or internal processes.

Practical evaluation checklist

Before trusting an agent, I would review:

  • There is a clear list of allowed and forbidden tasks.
  • A dataset covers happy, ambiguous, adversarial and error cases.
  • Each case has an expected result and conditions that must not happen.
  • Intent, tool and final response are evaluated separately.
  • Tools have contracts and deterministic validation.
  • Guardrails are tested against direct and indirect attacks.
  • Cost, latency and error rate are measured.
  • Traces make every execution investigable.
  • Real samples receive human review.
  • Clear go/no-go criteria exist before deployment.

My final criterion

Evaluating an AI agent is not deciding whether an answer looks nice. It is checking whether the whole system understands, decides, acts, limits itself and recovers reliably.

A good agent is not the one that always tries to answer; it is the one that knows when to act, when to ask, when to escalate and how to leave evidence of what it did.

This framework complements my article on RAG, web search, fine-tuning or rules and my guide to AI automation architecture.

You can explore more projects in my portfolio or contact me through the contact page.