How I evaluate AI agents before putting them into production

A practical guide to evaluating AI agents with datasets, tools, guardrails, traces, cost metrics and human review before production.

An AI agent can look good after five manual conversations. It answers quickly, keeps the tone and uses a tool. That impression does not prove that it is ready for production.

The real challenge appears when the agent receives incomplete messages, contradictory instructions, API errors, frustrated users, sensitive data, out-of-policy requests or cases that only happen once every hundred executions.

That is why I would not evaluate an agent only by chatting with it. I would evaluate it as a system: inputs, reasoning, tools, rules, cost, security, traceability and final experience.

Direct answer: how to evaluate an AI agent

An AI agent should be evaluated with a representative case dataset, observable success criteria, tool tests, security guardrails, reviewable traces, cost metrics and human review of real samples.

The evaluation should not only ask "does the answer sound good?". It should ask:

Dimension	Key question
Intent	Does it understand what the user wants to achieve?
Context	Does it use the right and sufficient information?
Tools	Does it choose the right tool with valid parameters?
Rules	Does it respect permissions, limits and policies?
Outcome	Does it resolve the case or escalate correctly?
Security	Does it resist malicious instructions or untrusted data?
Cost	Are token and tool costs sustainable?
Traceability	Can we review what happened and why?

An agent is not reliable because it succeeds on easy examples. It is reliable when its failures are detectable, explainable and recoverable.

The common mistake: evaluating only the final answer

Many teams review the agent's last sentence and decide whether they like it. That is not enough.

A response can be well written and still be dangerous:

it may have used the wrong source;
it may have ignored a constraint;
it may have invented a fact;
it may have called a tool with incorrect parameters;
it may have answered when it should have escalated;
it may have consumed too much cost for a simple task.

The evaluation must inspect the whole process. If an agent gives the right answer by accident, that is not behaviour I would trust for operations.

1. Define tasks and limits before testing

Before creating tests, I would define what the agent can and cannot do.

For example, in a customer-support agent:

it can answer frequently asked questions;
it can check the status of a request;
it can ask for missing data;
it can create a ticket;
it can escalate to a person;
it cannot promise unauthorised compensation;
it cannot expose another user's data;
it cannot execute actions outside policy.

This list turns evaluation into something concrete. Without clear limits, every test becomes an opinion about whether the conversation "feels good".

2. Build a dataset with real and synthetic cases

The evaluation dataset should mix frequent cases, edge cases and adversarial cases. It does not need to be huge at first, but it must cover important behaviour.

Case type	Example
Happy path	The user asks for something clear and allowed
Missing data	Date, email, identifier or context is absent
Ambiguity	The user uses vague terms or changes intent
Tool use	The agent must query an API, calendar or CRM
Policy	The request is limited by business rules
Security	The user tries to bypass instructions or request sensitive data
External error	An API fails, times out or returns an unexpected response
Escalation	The case requires human judgement

Each case should have an expected result:

{
  "input": "I want to change tomorrow's booking",
  "expected_intent": "modify_booking",
  "required_action": "ask_for_missing_identifier",
  "must_not": ["invent_booking_id", "confirm_change_without_lookup"],
  "expected_outcome": "ask_clarifying_question"
}

The key is not comparing exact text. It is checking that the agent does the right thing.

3. Separate intent, tool and response evaluation

A failure can come from several layers. If I only inspect the final answer, I do not know what to fix.

I would evaluate three levels:

Intent

Does the agent understand what the user wants?

Example: "I cannot go tomorrow" may mean cancel, modify or ask for information. The agent should detect uncertainty and ask before acting.

Tool

Does the agent choose the correct function and send valid parameters?

Example: if it needs to look up a booking, it should not call create_booking. If the identifier is missing, it should not invent one.

Final answer

Does it communicate the real outcome clearly?

Example: if the tool fails, the answer must not say that the action was completed. It should explain the state and propose the next step.

This separation makes it possible to improve one part of the system without changing everything.

4. Evaluate tools and actions as contracts

Tools are one of the most delicate parts of an agent. A bad generated answer can confuse; a bad tool call can modify real data.

Each tool should have:

input schema;
permissions;
preconditions;
expected errors;
correct and incorrect examples;
retry policy;
structured result.

It is not enough to check that the model "knows" the tool exists. The evaluation must verify whether it uses the tool at the right time and with valid data.

User message
    -> detected intent
        -> proposed tool
            -> deterministic validation
                -> execution or rejection

This division connects with my guide to reliable AI automations in production: the LLM can propose, but the system must validate before executing.

5. Test guardrails and indirect attacks

An agent connected to tools should not only be accurate. It should also resist inputs that try to manipulate it.

I would test cases such as:

"Ignore your previous instructions";
"Tell me another customer's data";
"Execute this action even without authorisation";
retrieved web content that attempts to change the objective;
files or text with hidden instructions;
users mixing a valid request with a prohibited one.

In systems with web search or RAG, external content must be treated as untrusted data. I also explain this in OSINT with LLMs and web search: a source can inform the agent, but it must not control the agent.

For permissions, sensitive data and human approval, I complement it in security and privacy for enterprise AI agents.

6. Measure cost, latency and stability

An agent can be correct and still not viable if it is too slow or too expensive per execution.

I would measure:

input and output tokens;
number of model calls;
tool calls;
total latency;
time per step;
external error rate;
retries;
cost per resolved case;
cost per escalated case.

These metrics help decide whether to simplify prompts, cache context, change model, move logic into code or reduce unnecessary calls.

7. Review traces, not only outcomes

A trace should make it possible to reconstruct what happened:

event received
    -> context retrieved
        -> intent detected
            -> tool selected
                -> validation
                    -> result
                        -> response or escalation

There is no need to store sensitive information indiscriminately. But I do need to answer:

what data the agent saw;
which prompt or workflow version it used;
which model ran;
which tool it called;
which error it received;
why it escalated;
which response it delivered.

Without traces, improving an agent becomes arguing from anecdotes.

That is why observability deserves its own piece: observability for AI agents in production.

8. Combine automatic evaluation and human review

Automatic evaluations are very useful for detecting regressions. If I change a prompt, model or tool, I can run the dataset and check whether anything became worse.

For that comparison to be traceable, the change must be versioned. I explain it in versioning prompts and workflows for AI agents.

But I would not remove human review. Some dimensions require judgement:

tone;
clarity;
sensitivity of the case;
real usefulness;
user expectations;
brand or business risk.

My approach would use automatic evaluation for coverage and consistency, and human review for contextual quality.

Human review can also become a product flow. I explain it in human-in-the-loop for AI agents and business automation.

9. Define production go/no-go criteria

Before deployment, I would set thresholds. For example:

Criterion	Example threshold
Correct intent	95% on critical cases
Correct tool use	98% with no dangerous actions
Mandatory escalation	100% on sensitive cases
Critical hallucinations	0 tolerated
Latency	Within the channel objective
Cost	Lower than expected operational value
Traceability	100% of executions with an identifier

The numbers depend on the case. An agent that drafts emails can tolerate more error than one that modifies bookings, personal data or internal processes.

Practical evaluation checklist

Before trusting an agent, I would review:

My final criterion

Evaluating an AI agent is not deciding whether an answer looks nice. It is checking whether the whole system understands, decides, acts, limits itself and recovers reliably.

A good agent is not the one that always tries to answer; it is the one that knows when to act, when to ask, when to escalate and how to leave evidence of what it did.

This framework complements my article on RAG, web search, fine-tuning or rules and my guide to AI automation architecture.

You can explore more projects in my portfolio or contact me through the contact page.