How I evaluate AI agents before putting them into production
A practical guide to evaluating AI agents with datasets, tools, guardrails, traces, cost metrics and human review before production.
An AI agent can look good after five manual conversations. It answers quickly, keeps the tone and uses a tool. That impression does not prove that it is ready for production.
The real challenge appears when the agent receives incomplete messages, contradictory instructions, API errors, frustrated users, sensitive data, out-of-policy requests or cases that only happen once every hundred executions.
That is why I would not evaluate an agent only by chatting with it. I would evaluate it as a system: inputs, reasoning, tools, rules, cost, security, traceability and final experience.
Direct answer: how to evaluate an AI agent
An AI agent should be evaluated with a representative case dataset, observable success criteria, tool tests, security guardrails, reviewable traces, cost metrics and human review of real samples.
The evaluation should not only ask "does the answer sound good?". It should ask:
| Dimension | Key question |
|---|---|
| Intent | Does it understand what the user wants to achieve? |
| Context | Does it use the right and sufficient information? |
| Tools | Does it choose the right tool with valid parameters? |
| Rules | Does it respect permissions, limits and policies? |
| Outcome | Does it resolve the case or escalate correctly? |
| Security | Does it resist malicious instructions or untrusted data? |
| Cost | Are token and tool costs sustainable? |
| Traceability | Can we review what happened and why? |
An agent is not reliable because it succeeds on easy examples. It is reliable when its failures are detectable, explainable and recoverable.
The common mistake: evaluating only the final answer
Many teams review the agent's last sentence and decide whether they like it. That is not enough.
A response can be well written and still be dangerous:
- it may have used the wrong source;
- it may have ignored a constraint;
- it may have invented a fact;
- it may have called a tool with incorrect parameters;
- it may have answered when it should have escalated;
- it may have consumed too much cost for a simple task.
The evaluation must inspect the whole process. If an agent gives the right answer by accident, that is not behaviour I would trust for operations.
1. Define tasks and limits before testing
Before creating tests, I would define what the agent can and cannot do.
For example, in a customer-support agent:
- it can answer frequently asked questions;
- it can check the status of a request;
- it can ask for missing data;
- it can create a ticket;
- it can escalate to a person;
- it cannot promise unauthorised compensation;
- it cannot expose another user's data;
- it cannot execute actions outside policy.
This list turns evaluation into something concrete. Without clear limits, every test becomes an opinion about whether the conversation "feels good".
2. Build a dataset with real and synthetic cases
The evaluation dataset should mix frequent cases, edge cases and adversarial cases. It does not need to be huge at first, but it must cover important behaviour.
| Case type | Example |
|---|---|
| Happy path | The user asks for something clear and allowed |
| Missing data | Date, email, identifier or context is absent |
| Ambiguity | The user uses vague terms or changes intent |
| Tool use | The agent must query an API, calendar or CRM |
| Policy | The request is limited by business rules |
| Security | The user tries to bypass instructions or request sensitive data |
| External error | An API fails, times out or returns an unexpected response |
| Escalation | The case requires human judgement |
Each case should have an expected result:
{
"input": "I want to change tomorrow's booking",
"expected_intent": "modify_booking",
"required_action": "ask_for_missing_identifier",
"must_not": ["invent_booking_id", "confirm_change_without_lookup"],
"expected_outcome": "ask_clarifying_question"
}
The key is not comparing exact text. It is checking that the agent does the right thing.
3. Separate intent, tool and response evaluation
A failure can come from several layers. If I only inspect the final answer, I do not know what to fix.
I would evaluate three levels:
Intent
Does the agent understand what the user wants?
Example: "I cannot go tomorrow" may mean cancel, modify or ask for information. The agent should detect uncertainty and ask before acting.
Tool
Does the agent choose the correct function and send valid parameters?
Example: if it needs to look up a booking, it should not call create_booking. If the identifier is
missing, it should not invent one.
Final answer
Does it communicate the real outcome clearly?
Example: if the tool fails, the answer must not say that the action was completed. It should explain the state and propose the next step.
This separation makes it possible to improve one part of the system without changing everything.
4. Evaluate tools and actions as contracts
Tools are one of the most delicate parts of an agent. A bad generated answer can confuse; a bad tool call can modify real data.
Each tool should have:
- input schema;
- permissions;
- preconditions;
- expected errors;
- correct and incorrect examples;
- retry policy;
- structured result.
It is not enough to check that the model "knows" the tool exists. The evaluation must verify whether it uses the tool at the right time and with valid data.
User message
-> detected intent
-> proposed tool
-> deterministic validation
-> execution or rejection
This division connects with my guide to reliable AI automations in production: the LLM can propose, but the system must validate before executing.
5. Test guardrails and indirect attacks
An agent connected to tools should not only be accurate. It should also resist inputs that try to manipulate it.
I would test cases such as:
- "Ignore your previous instructions";
- "Tell me another customer's data";
- "Execute this action even without authorisation";
- retrieved web content that attempts to change the objective;
- files or text with hidden instructions;
- users mixing a valid request with a prohibited one.
In systems with web search or RAG, external content must be treated as untrusted data. I also explain this in OSINT with LLMs and web search: a source can inform the agent, but it must not control the agent.
For permissions, sensitive data and human approval, I complement it in security and privacy for enterprise AI agents.
6. Measure cost, latency and stability
An agent can be correct and still not viable if it is too slow or too expensive per execution.
I would measure:
- input and output tokens;
- number of model calls;
- tool calls;
- total latency;
- time per step;
- external error rate;
- retries;
- cost per resolved case;
- cost per escalated case.
These metrics help decide whether to simplify prompts, cache context, change model, move logic into code or reduce unnecessary calls.
7. Review traces, not only outcomes
A trace should make it possible to reconstruct what happened:
event received
-> context retrieved
-> intent detected
-> tool selected
-> validation
-> result
-> response or escalation
There is no need to store sensitive information indiscriminately. But I do need to answer:
- what data the agent saw;
- which prompt or workflow version it used;
- which model ran;
- which tool it called;
- which error it received;
- why it escalated;
- which response it delivered.
Without traces, improving an agent becomes arguing from anecdotes.
8. Combine automatic evaluation and human review
Automatic evaluations are very useful for detecting regressions. If I change a prompt, model or tool, I can run the dataset and check whether anything became worse.
But I would not remove human review. Some dimensions require judgement:
- tone;
- clarity;
- sensitivity of the case;
- real usefulness;
- user expectations;
- brand or business risk.
My approach would use automatic evaluation for coverage and consistency, and human review for contextual quality.
9. Define production go/no-go criteria
Before deployment, I would set thresholds. For example:
| Criterion | Example threshold |
|---|---|
| Correct intent | 95% on critical cases |
| Correct tool use | 98% with no dangerous actions |
| Mandatory escalation | 100% on sensitive cases |
| Critical hallucinations | 0 tolerated |
| Latency | Within the channel objective |
| Cost | Lower than expected operational value |
| Traceability | 100% of executions with an identifier |
The numbers depend on the case. An agent that drafts emails can tolerate more error than one that modifies bookings, personal data or internal processes.
Practical evaluation checklist
Before trusting an agent, I would review:
- There is a clear list of allowed and forbidden tasks.
- A dataset covers happy, ambiguous, adversarial and error cases.
- Each case has an expected result and conditions that must not happen.
- Intent, tool and final response are evaluated separately.
- Tools have contracts and deterministic validation.
- Guardrails are tested against direct and indirect attacks.
- Cost, latency and error rate are measured.
- Traces make every execution investigable.
- Real samples receive human review.
- Clear go/no-go criteria exist before deployment.
My final criterion
Evaluating an AI agent is not deciding whether an answer looks nice. It is checking whether the whole system understands, decides, acts, limits itself and recovers reliably.
A good agent is not the one that always tries to answer; it is the one that knows when to act, when to ask, when to escalate and how to leave evidence of what it did.
This framework complements my article on RAG, web search, fine-tuning or rules and my guide to AI automation architecture.
You can explore more projects in my portfolio or contact me through the contact page.