From demo to production: how I design reliable AI automations
A practical guide to production AI automation with idempotency, retries, validation, observability and human review.
Building an AI automation demo can be surprisingly fast. A form triggers a workflow, an LLM interprets the message, an API performs an action and a notification arrives. The first test works, and the system appears finished.
The important work usually starts immediately afterwards.
A production-ready automation must remain reliable when an API is slow, a user submits the same request twice, the model returns an unexpected structure, a field is missing or an execution stops halfway through.
In the systems I build, I try to let AI provide flexibility without making the entire process unpredictable. To achieve that, I separate interpretation, rules, actions and recovery.
Direct answer: what does an AI automation need to be reliable?
A reliable AI automation needs validated inputs, idempotent operations, controlled retries, persistent state, observability, explicit limits and a human path when the system cannot act safely.
| Layer | Question it must answer |
|---|---|
| Input contract | What data does the process actually need? |
| Validation | Is the request complete, consistent and authorised? |
| Idempotency | What happens if the same event arrives twice? |
| State | Where is each execution in the process? |
| Retries | Which failures are temporary and which require intervention? |
| Observability | Can we reconstruct what happened? |
| Human review | When should the system stop automating? |
Prompt quality matters, but no sentence inside a prompt replaces these layers.
The example: a reservation created from WhatsApp
Imagine an assistant receiving this message:
I would like a table for four tomorrow at nine.
The happy-path demo looks simple:
- the LLM extracts the date, time and party size;
- the system checks availability;
- it creates the reservation;
- it sends a confirmation.
Before calling the system reliable, however, several questions need answers:
- Which timezone does "tomorrow" refer to?
- Is the restaurant open at that time?
- Are the customer's name or contact details missing?
- What happens if WhatsApp resends the webhook?
- What if the reservation is created but the confirmation message fails?
- Should the entire execution be retried?
- How can a person know whether the reservation was registered?
These details are the difference between an automation that impresses in a demo and one that actually reduces work.
1. Turn ambiguous inputs into clear contracts
Users and LLMs work well with flexible language. Real actions require precise data. After interpreting a request, I would turn the result into a structured contract:
{
"customer_name": "Gorka",
"party_size": 4,
"reservation_at": "2026-06-15T21:00:00+02:00",
"contact_channel": "whatsapp",
"request_id": "msg_abc123"
}
Before creating anything, a deterministic layer must verify that required fields exist, formats are correct, the date is valid, the time is allowed and the action is authorised.
The LLM can propose the structure. The system must decide whether that structure is valid. If information is missing, the right response is to ask only for the missing detail while preserving the context already collected.
2. Design idempotent operations
An idempotent operation has the same effect even if requested multiple times. This property is fundamental when webhooks, queues, networks and retries are involved.
If the same event arrives twice, the system must not create two reservations. To prevent this, I would associate every request with a unique key:
idempotency_key = channel + message_identifier
Before executing an action, the service checks whether that key has already been processed:
| Existing state | Behaviour |
|---|---|
| Does not exist | Start the operation |
| In progress | Wait or return the current state |
| Completed | Return the stored result |
| Recoverable failure | Resume from the safe point |
| Permanent failure | Escalate or request a correction |
Idempotency does more than prevent duplicates. It makes retries safer and ensures consistent responses when different components ask about the same process.
3. Separate reversible and irreversible steps
Not every action carries the same risk. Classifying an intent, summarising text or preparing a draft are easy operations to repeat. Sending an email, creating a reservation, charging, cancelling or changing a record has external effects.
Before every irreversible effect, it helps to establish a checkpoint:
Interpret
-> validate
-> check permissions
-> record intention
-> execute action
-> store result
-> notify
If the final notification fails, the system should not create the reservation again. It should recover the stored result and retry only the notification. This separation makes recovery much safer than restarting the complete workflow.
4. Use retries deliberately
Retrying everything immediately can make a problem worse. If an API is overloaded, hundreds of executions retrying at once generate more load. If the error is invalid data, repeating the request will not change the outcome.
| Failure type | Example | Response |
|---|---|---|
| Temporary | Timeout, rate limit, unavailable service | Retry after waiting |
| Permanent | Invalid field, denied permission, missing resource | Correct or escalate |
| Unknown | Unexpected response or inconsistency | Stop, record and review |
For temporary failures, I would use exponential backoff: progressively longer delays between attempts, with a small random component so that every execution does not call again simultaneously.
I would also set a limit. Once exceeded, the execution should move to an error queue or be marked for review. "Retry forever" is not a recovery strategy.
5. Persist state outside the workflow
A visual workflow is excellent at showing the path a process should follow, but it should not always be the only source of truth. For meaningful operations, I would preserve an explicit state:
RECEIVED -> VALIDATED -> APPROVED -> EXECUTING -> COMPLETED
|
-> FAILED_RETRYABLE
-> NEEDS_REVIEW
Each transition should record a date, execution identifier and minimum necessary context. This makes it possible to know where the process stopped, resume from a safe point, avoid repeating completed actions, investigate incidents and measure time and failure rates.
The workflow orchestrates. Persistence records what actually happened.
6. Design observability that answers questions
Storing many logs does not guarantee that a failure can be understood. Observability should help reconstruct the story of an execution.
At a minimum, I would record:
- a correlation identifier shared across components;
- event source and type;
- workflow, service and prompt version;
- model used and latency;
- tools or APIs called;
- accepted or rejected validations;
- state changes;
- classified errors;
- final result and escalation reason.
I would not indiscriminately log unnecessary personal data or complete prompts. Traceability must coexist with data minimisation, access controls and retention policies.
A useful test is trying to answer: how many executions failed, at which step, whether the external action completed and whether the system recovered or needs a person. If answering requires opening several systems and guessing, more work is needed.
7. Treat the LLM as a non-deterministic component
A model can return different answers for similar inputs. It can also produce invalid JSON, omit fields or select the wrong tool.
I would not try to eliminate all that variability. I would contain it within boundaries:
- structured and validated outputs;
- allowlisted tools;
- permissions separated by action;
- iteration, time and cost limits;
- deterministic checks;
- a fallback when the model or provider is unavailable;
- human review for sensitive decisions.
In an automation, the LLM should interpret, classify, extract or propose. Business rules must decide what can be executed.
8. Prepare a useful human path
Escalating to a person should not mean sending a message that says "the agent failed". A good handoff should include what the user wanted, which data was collected, which actions were attempted, what every system returned, why the automation stopped and which decision is missing.
The goal is to continue where the system stopped without repeating the entire investigation or requesting information that is already available. Human escalation is not a defeat for automation: it is a planned route for cases that require judgement, permissions or additional context.
9. Evaluate before and after deployment
Before production, I would test much more than the happy path:
- duplicate events;
- missing or contradictory fields;
- ambiguous dates;
- invalid model responses;
- timeouts and rate limits;
- interruptions after performing an action;
- unauthorised requests;
- prompt-injection attempts.
After deployment, I would monitor operational and quality metrics:
| Metric | What it helps detect |
|---|---|
| Completion rate | Whether the flow resolves its objective |
| Prevented duplicates | Whether idempotency is working |
| Retries by integration | Unstable APIs |
| Time to resolution | Bottlenecks |
| Escalations and reasons | Uncovered cases or overly strict rules |
| Manually corrected actions | Errors hidden by the success rate |
| Cost per execution | Inefficient models and tools |
An automation is not finished when it is deployed. Production generates evidence about how it should improve.
Checklist before automating a real action
- The input becomes a validated contract.
- An idempotency key exists.
- External effects are separated from interpretation.
- Every error has a retry or escalation policy.
- Important state persists beyond a single execution.
- Logs can reconstruct the process without exposing unnecessary data.
- Cost, time, permission and iteration limits exist.
- A person can continue an escalated case with context.
- Tests cover failures and duplicates, not only the happy path.
- Metrics have been defined to review real behaviour.
My final criterion
The difference between a demo and a production system is not only scale. It is the ability to respond correctly when something happens differently from what was expected.
In a reliable automation, the happy path can be short. The recovery paths demonstrate the quality of the architecture: preventing duplicates, preserving state, limiting damage, explaining what happened and allowing a person to intervene.
Automation is not merely making a task run by itself; it is designing what should happen even when it fails.
This approach complements my guide on choosing between n8n, FastAPI and Spring and the restaurant reservation system with AI and WhatsApp, where a conversation becomes a real action.
You can explore more projects in my portfolio or contact me through the contact page.