Skip to content
Gorka Hernandez Villalon, iOS developer and AI automation specialistGorka Hernandez
Back to blog
Agentic AICustomer ServiceRetailLLMNexaVision AI

What working at Mango and Zara taught me about customer service AI agents

How my experience at Mango and Zara shapes the way I design reliable AI agents with context, guardrails, tools, evaluation and human escalation.

June 08, 2026 7 min readby Gorka Hernandez Villalon

Before building conversational assistants with LLMs, WhatsApp, webhooks or phone calls, I worked as a retail assistant at Mango in 2024 and Zara in 2025. At first glance, those roles may seem separate from my current work in AI and automation. To me, they are closely connected.

Customer-facing work teaches something that does not appear in technical documentation: customers do not want to talk to impressive technology. They want their problem resolved quickly, clearly and without having to repeat themselves.

That idea now guides how I build customer service AI agents within NexaVision AI and other automation projects.

Direct answer: what makes a customer service AI agent reliable?

A reliable AI agent is not the one that answers the most questions. It is the one that understands intent, checks real data, applies explicit rules, recognizes its limits and escalates to a person when it cannot guarantee a correct answer.

In practice, it needs five layers:

  1. enough context to understand the conversation,
  2. tools connected to real systems,
  3. guardrails and validation before taking action,
  4. traceability so decisions can be reviewed,
  5. human escalation with the relevant context attached.

The language model is important, but it is not the complete system.

First retail lesson: customers describe problems, not structured intents

In a store, people rarely phrase a request as if they were completing a form. They may say they want "something similar to this", need a gift but do not know the size, or want to return an item without knowing the exact process.

The same happens through digital channels:

I want to change what I ordered yesterday.

That message alone does not identify the product, confirm whether changes are allowed, reveal the correct channel or clarify whether the customer actually wants a return. An agent should detect the likely intent, but it should also request the minimum information needed before taking action.

I therefore separate two responsibilities:

  • the LLM interprets ambiguous language, summarizes and keeps the conversation natural;
  • rules and tools verify data before executing an operation.

This separation reduces hallucinated answers and prevents a convincing sentence from becoming an incorrect action.

Second lesson: a good response does not always solve the problem

Being polite is not enough in customer service. If someone asks for an order status, a well-written answer that never checks the order is still useless.

A useful agent must connect the conversation to tools. Depending on the use case, it could:

  • check an order status,
  • look up availability or stock,
  • retrieve return conditions,
  • create or modify a booking,
  • register an incident,
  • update a CRM,
  • transfer the conversation to a person.

I have applied this approach in systems such as a Gmail customer service agent and a WhatsApp booking assistant. In both cases, the conversational answer is only the visible layer. Underneath it are APIs, webhooks, history, validation and verifiable actions.

Third lesson: knowing when to ask for help is a capability, not a failure

Retail includes simple situations and situations that need human judgment: exceptions, frustrated customers, conflicting information or requests with financial impact.

An AI agent should work the same way. Human escalation should not be an improvised last resort; it should be part of the architecture.

I would define escalation conditions when:

  • intent confidence is insufficient,
  • required information is still missing after several attempts,
  • the user asks to speak with a person,
  • there is financial, legal or privacy risk,
  • tools return conflicting information,
  • the tone indicates frustration or a sensitive situation,
  • the operation falls outside permitted rules.

The transfer should include context. Making customers repeat the entire conversation after an escalation destroys much of the automation's value. The human team should receive a summary, collected data, attempted actions and the escalation reason.

Fourth lesson: tone matters, but accuracy matters more

Every brand has a way of communicating. An assistant should respect its language, tone, vocabulary and level of detail. However, aligning responses with a brand requires more than a prompt saying "be friendly".

Guardrails should define:

  • what the agent can and cannot promise,
  • which sources it may use,
  • which operations require confirmation,
  • which personal data is actually required,
  • how it should respond when it does not know,
  • when it must stop automating and escalate.

To maintain consistency, I prefer separating brand instructions, operational rules and conversation-specific context. This makes it possible to change tone without changing system restrictions.

Fifth lesson: speed only creates value when it reduces friction

An obvious advantage of conversational agents is their ability to handle frequent requests immediately. But fast responses do not help when users enter a loop, receive generic information or must constantly correct the system.

The metric should not only be the number of automated conversations. I would also measure:

  • percentage of requests correctly resolved,
  • time to resolution,
  • number of messages required,
  • escalation rate and reasons,
  • failed or reverted operations,
  • satisfaction after the conversation,
  • how often customers must repeat information,
  • cost per resolved conversation.

An agent that correctly escalates a complex case can be better than one that tries to automate everything and creates a larger problem.

Architecture I would use for a reliable conversational agent

Although every company requires different integrations, I usually think in layers:

1. Multichannel input

The conversation may arrive through a website, WhatsApp, email, an app or phone. The first layer normalizes the message, identifies the channel and preserves the required metadata.

2. Context and memory

The agent retrieves relevant history, language, preferences and authorized data. Memory should neither be infinite nor store information without purpose: it must be useful, limited and privacy-aware.

3. Interpretation and intent

The LLM classifies the request, extracts fields and decides what information is missing. It may propose an action at this stage, but it should not execute it yet.

4. Policies, guardrails and validation

Deterministic rules check permissions, required fields, limits and business conditions. This layer decides whether the action is valid, needs confirmation or must be escalated.

5. Tools and actions

APIs, databases, calendars, CRMs or internal systems execute the operation. Important actions should be idempotent whenever possible to prevent duplicates if a workflow is retried.

6. Observability and evaluation

Every execution should leave useful records: detected intent, tool used, result, latency, errors and escalation reason. This data makes it possible to find failures and build repeatable evaluations.

7. Response or human escalation

The agent communicates the real outcome, not the outcome it expected. If it cannot resolve the request, it transfers the case with context.

How I would evaluate the agent before exposing it to customers

I would not test an agent only by manually chatting with it. I would create a dataset representing normal, ambiguous and adversarial situations:

  • well-formed frequently asked questions,
  • incomplete messages,
  • language changes,
  • conflicting instructions,
  • out-of-policy requests,
  • attempts to access unauthorized data,
  • API failures,
  • customers changing intent,
  • conversations that must be escalated.

For each case I would define the expected outcome: answer, ask, use a tool, reject an action or escalate. I would then measure intent accuracy, tool selection, policy compliance and final response quality.

I would also keep human review of real samples. Automated evaluations help detect regressions, but language and customer experience still require human judgment.

What I bring when building these systems

My experience at Mango and Zara does not make me an AI engineer by itself. My technical experience does not replace understanding the person on the other side either.

The combination helps me think about the complete system:

  • what the customer actually needs,
  • what information the agent has,
  • which actions it can safely execute,
  • what it must verify before responding,
  • when a person should take control.

After working in retail and building more than 30 AI and automation workflows, my conclusion is simple: the best customer service automation does not try to look human; it tries to be useful, reliable and honest about its limits.

To discuss conversational agents, automation or LLM integrations, contact me through my contact page or connect with me on LinkedIn.