From demo to production: how I design reliable AI automations

A practical guide to production AI automation with idempotency, retries, validation, observability and human review.

Building an AI automation demo can be surprisingly fast. A form triggers a workflow, an LLM interprets the message, an API performs an action and a notification arrives. The first test works, and the system appears finished.

The important work usually starts immediately afterwards.

A production-ready automation must remain reliable when an API is slow, a user submits the same request twice, the model returns an unexpected structure, a field is missing or an execution stops halfway through.

In the systems I build, I try to let AI provide flexibility without making the entire process unpredictable. To achieve that, I separate interpretation, rules, actions and recovery.

Direct answer: what does an AI automation need to be reliable?

A reliable AI automation needs validated inputs, idempotent operations, controlled retries, persistent state, observability, explicit limits and a human path when the system cannot act safely.

Layer	Question it must answer
Input contract	What data does the process actually need?
Validation	Is the request complete, consistent and authorised?
Idempotency	What happens if the same event arrives twice?
State	Where is each execution in the process?
Retries	Which failures are temporary and which require intervention?
Observability	Can we reconstruct what happened?
Human review	When should the system stop automating?

Prompt quality matters, but no sentence inside a prompt replaces these layers.

The example: a reservation created from WhatsApp

Imagine an assistant receiving this message:

I would like a table for four tomorrow at nine.

The happy-path demo looks simple:

the LLM extracts the date, time and party size;
the system checks availability;
it creates the reservation;
it sends a confirmation.

Before calling the system reliable, however, several questions need answers:

Which timezone does "tomorrow" refer to?
Is the restaurant open at that time?
Are the customer's name or contact details missing?
What happens if WhatsApp resends the webhook?
What if the reservation is created but the confirmation message fails?
Should the entire execution be retried?
How can a person know whether the reservation was registered?

These details are the difference between an automation that impresses in a demo and one that actually reduces work.

1. Turn ambiguous inputs into clear contracts

Users and LLMs work well with flexible language. Real actions require precise data. After interpreting a request, I would turn the result into a structured contract:

{
  "customer_name": "Gorka",
  "party_size": 4,
  "reservation_at": "2026-06-15T21:00:00+02:00",
  "contact_channel": "whatsapp",
  "request_id": "msg_abc123"
}

Before creating anything, a deterministic layer must verify that required fields exist, formats are correct, the date is valid, the time is allowed and the action is authorised.

The LLM can propose the structure. The system must decide whether that structure is valid. If information is missing, the right response is to ask only for the missing detail while preserving the context already collected.

2. Design idempotent operations

An idempotent operation has the same effect even if requested multiple times. This property is fundamental when webhooks, queues, networks and retries are involved.

If the same event arrives twice, the system must not create two reservations. To prevent this, I would associate every request with a unique key:

idempotency_key = channel + message_identifier

Before executing an action, the service checks whether that key has already been processed:

Existing state	Behaviour
Does not exist	Start the operation
In progress	Wait or return the current state
Completed	Return the stored result
Recoverable failure	Resume from the safe point
Permanent failure	Escalate or request a correction

Idempotency does more than prevent duplicates. It makes retries safer and ensures consistent responses when different components ask about the same process.

3. Separate reversible and irreversible steps

Not every action carries the same risk. Classifying an intent, summarising text or preparing a draft are easy operations to repeat. Sending an email, creating a reservation, charging, cancelling or changing a record has external effects.

Before every irreversible effect, it helps to establish a checkpoint:

Interpret
    -> validate
        -> check permissions
            -> record intention
                -> execute action
                    -> store result
                        -> notify

If the final notification fails, the system should not create the reservation again. It should recover the stored result and retry only the notification. This separation makes recovery much safer than restarting the complete workflow.

4. Use retries deliberately

Retrying everything immediately can make a problem worse. If an API is overloaded, hundreds of executions retrying at once generate more load. If the error is invalid data, repeating the request will not change the outcome.

Failure type	Example	Response
Temporary	Timeout, rate limit, unavailable service	Retry after waiting
Permanent	Invalid field, denied permission, missing resource	Correct or escalate
Unknown	Unexpected response or inconsistency	Stop, record and review

For temporary failures, I would use exponential backoff: progressively longer delays between attempts, with a small random component so that every execution does not call again simultaneously.

I would also set a limit. Once exceeded, the execution should move to an error queue or be marked for review. "Retry forever" is not a recovery strategy.

5. Persist state outside the workflow

A visual workflow is excellent at showing the path a process should follow, but it should not always be the only source of truth. For meaningful operations, I would preserve an explicit state:

RECEIVED -> VALIDATED -> APPROVED -> EXECUTING -> COMPLETED
                                      |
                                      -> FAILED_RETRYABLE
                                      -> NEEDS_REVIEW

Each transition should record a date, execution identifier and minimum necessary context. This makes it possible to know where the process stopped, resume from a safe point, avoid repeating completed actions, investigate incidents and measure time and failure rates.

The workflow orchestrates. Persistence records what actually happened.

6. Design observability that answers questions

Storing many logs does not guarantee that a failure can be understood. Observability should help reconstruct the story of an execution.

I develop this part more deeply in observability for AI agents in production.

At a minimum, I would record:

a correlation identifier shared across components;
event source and type;
workflow, service and prompt version;
model used and latency;
tools or APIs called;
accepted or rejected validations;
state changes;
classified errors;
final result and escalation reason.

I would not indiscriminately log unnecessary personal data or complete prompts. Traceability must coexist with data minimisation, access controls and retention policies.

This only works well when each prompt and workflow has a version. I detail this in versioning prompts and workflows for AI agents.

I expand that balance between traceability and privacy in security and privacy for enterprise AI agents.

A useful test is trying to answer: how many executions failed, at which step, whether the external action completed and whether the system recovered or needs a person. If answering requires opening several systems and guessing, more work is needed.

7. Treat the LLM as a non-deterministic component

A model can return different answers for similar inputs. It can also produce invalid JSON, omit fields or select the wrong tool.

I would not try to eliminate all that variability. I would contain it within boundaries:

structured and validated outputs;
allowlisted tools;
permissions separated by action;
iteration, time and cost limits;
deterministic checks;
a fallback when the model or provider is unavailable;
human review for sensitive decisions.

In an automation, the LLM should interpret, classify, extract or propose. Business rules must decide what can be executed.

When those rules detect risk or uncertainty, the natural next step is a human-in-the-loop flow with human approval.

8. Prepare a useful human path

Escalating to a person should not mean sending a message that says "the agent failed". A good handoff should include what the user wanted, which data was collected, which actions were attempted, what every system returned, why the automation stopped and which decision is missing.

The goal is to continue where the system stopped without repeating the entire investigation or requesting information that is already available. Human escalation is not a defeat for automation: it is a planned route for cases that require judgement, permissions or additional context.

9. Evaluate before and after deployment

Before production, I would test much more than the happy path:

duplicate events;
missing or contradictory fields;
ambiguous dates;
invalid model responses;
timeouts and rate limits;
interruptions after performing an action;
unauthorised requests;
prompt-injection attempts.

After deployment, I would monitor operational and quality metrics:

Metric	What it helps detect
Completion rate	Whether the flow resolves its objective
Prevented duplicates	Whether idempotency is working
Retries by integration	Unstable APIs
Time to resolution	Bottlenecks
Escalations and reasons	Uncovered cases or overly strict rules
Manually corrected actions	Errors hidden by the success rate
Cost per execution	Inefficient models and tools

An automation is not finished when it is deployed. Production generates evidence about how it should improve.

Checklist before automating a real action

My final criterion

The difference between a demo and a production system is not only scale. It is the ability to respond correctly when something happens differently from what was expected.

In a reliable automation, the happy path can be short. The recovery paths demonstrate the quality of the architecture: preventing duplicates, preserving state, limiting damage, explaining what happened and allowing a person to intervene.

Automation is not merely making a task run by itself; it is designing what should happen even when it fails.

This approach complements my guide on choosing between n8n, FastAPI and Spring and the restaurant reservation system with AI and WhatsApp, where a conversation becomes a real action.

I also explain how to separate knowledge, behaviour and decisions in RAG, web search, fine-tuning or rules. And if the system already uses agents, the next piece is measuring them with a serious pre-production evaluation.

You can explore more projects in my portfolio or contact me through the contact page.