Dani Plicka

Voice AI fails in production not because of bad models but because of bad architecture. The five patterns that kill voice AI at scale are prompt stuffing, no state management, ignoring latency budgets, tool calls as an attack surface, and no post-call observability. The fix for all five is the same: code drives the call, not the model. Business logic, routing decisions, and state transitions belong in deterministic software. The model handles language. Everything else is governed by your application.

Why Voice AI Fails Before the First Call

You built a voice AI agent. The demo was flawless: tight latency, clean responses, every question answered perfectly. Then you shipped it. And somewhere around call #3,000 at 2 a.m., things went sideways.

The model started hallucinating. It skipped required steps. It went silent and left callers hanging. No crash, no error log, just wrong behavior, angry users, and zero ability to reproduce the issue.

This is the gap between demo and production, and it’s not a prompt problem. It’s an architecture problem.

In our latest LIVEWire webinar, SignalWire Head of DevEx Brian West broke down the five patterns that kill voice AI in production and showed exactly how to fix each one.

The root cause of failed voice AI: Prompt and pray

Most voice AI teams are stuck in a pattern Brian calls "prompt and pray." You stuff your prompt with business logic, instructions, and rules, then hope it works consistently in production.

It works in demos because you control the input. Real callers don't follow your script. They interrupt, change their minds, say things wrong, and go completely off-topic. Under that kind of pressure, a prompt-heavy architecture breaks in ways that are quiet and hard to catch.

The core issue is that a prompt is a suggestion, not a constraint. It will follow your rules most of the time, but "most of the time" is not an engineering standard. It’s not testable and it’s not replicable. Consistency is what’s most important for contact center operations.

The five patterns that cause voice AI to fail

1. Prompt stuffing

The more logic you pack into a prompt, the less reliable it becomes. When context pressure builds, the model starts skipping steps, inventing policies you never approved, and doing math it will get wrong.

There’s also the pink elephant problem: if you tell the model "don’t mention the discount," you have now increased the probability that it will. Prohibition is not a guardrail. Mentioning it at all is a risk.

The fix: Keep prompts thin. Use them only for tone, intent extraction, and natural language understanding. All business logic, pricing rules, and calculations belong in code, not in the prompt.

2. No state management

Without a state machine, every tool is available at every step. The result is that the model might call complete_order before it’s collected the required information, ask questions out of order, or behave differently on call one versus call ten.

Think of it like setting a toddler down with exactly three toys. It can only do what’s in front of it right now. When the step is done, the toys change.

The fix: Build a state machine. At each step, expose only the tools that are valid for that step. The model completes the task because you have removed all other options.

3. Ignoring latency budgets

After a year of recording and reviewing every call on their own platform, SignalWire found a clear pattern:

800ms–1,200ms is the sweet spot
Over 2 seconds and callers assume it's broken

Your TTS choice alone can add 250–500ms depending on the provider and voice. That has to be accounted for in your budget, and it's a budget you can actually expand by keeping audio processing within the control plane rather than round-tripping it through your application layer.

The fix: Know your full latency budget before you choose your stack. Use fillers and async patterns to cover gaps while tool calls and LLM responses are in flight.

4. Tool calls as an attack surface

Every argument you pass to a tool call is a place where something can go wrong, whether through hallucination, context drift, or adversarial input.

Our recommendation is the zero-argument tool call. Here's how it works in practice:

The model extracts two pieces of information from the caller: pickup and destination. That data gets validated (geocoded, confirmed with the caller) and written to global session data, outside the LLM context. From that point on, every subsequent tool call takes zero arguments. It pulls validated data directly from session state.

The LLM cannot pass a wrong address, a manipulated price, or hallucinated data to a downstream system, because it’s not sending any data at all.

The fix: Validate and store data in session state as early as possible. Design downstream tool calls to read from that state rather than accepting arguments from the model.

5. No post-call observability

Most teams have no way to answer the question: what actually happened on that call, and why?

Transcripts show what was said. They don't show why the model made a decision, how long each step took, which tool calls fired and in what order, or where latency was spent.

Without that visibility, you can't reproduce failures. And if you can't reproduce them, you can't fix them.

SignalWire's post-call payload captures latency per turn, tool call execution times, barge-in events, state transitions, TTS and ASR usage, and a full session timeline you can scrub through. The open-source Post-call Ingestion Engine visualizes all of it.

The fix: Instrument everything from day one. Build with observability in mind, not as an afterthought.

Programmatic Governed Inference (PGI): The key to predicting AI agent behavior

All five fixes point to the same underlying principle: your code should drive the call, not the LLM.

The model is excellent at understanding intent, extracting data from natural language, and sounding human. It’s not reliable for business logic, pricing, calculations, or routing decisions. The moment you lean on it for those things, you trade consistency for convenience.

The pattern Brian calls Programmatic Governed Inference flips the relationship. Code makes the decisions. The LLM executes natural language tasks within tightly scoped steps. Forced state transitions happen in your application, not in the model's context. The result is a voice AI agent that behaves the same way on call one as it does on call ten thousand.

Start building more consistent Voice AI agents

The demos referenced in this session, like Cabby (taxi booking), GoAir (flight booking), and Holy Guacamole (drive-through ordering), are all available in the SignalWire demos GitHub repo. Each one is a working example of state machines, zero-argument tool calls, and programmatic control in action.

The Post-call Ingestion Engine is also open source and available in the same repo.

If you want to dig deeper, join us at Friday Hangouts every Friday, 9am–2pm Central, where Brian and the DevEx team are available to review your architecture, answer questions, and help you ship voice AI that actually works in production. You can also join our community of developers on Discord.

Frequently asked questions

What is prompt stuffing and why is it dangerous?

Prompt stuffing is loading business logic, pricing rules, and constraints into the model prompt and hoping it follows them consistently. The more logic you pack in, the less reliable the model becomes. Under context pressure it skips steps, invents policies, and gets math wrong. A prompt is a suggestion, not a constraint.

What is a state machine and why does voice AI need one?

A state machine restricts which tools the model can access at each step of a conversation. Without one, every tool is available at every step — meaning the model might complete an order before collecting required information, or behave differently on call one versus call ten thousand. With a state machine, the model can only do what's in front of it right now.

What is a zero-argument tool call?

A zero-argument tool call is a design pattern where data is validated and stored in session state early in the conversation, and all subsequent tool calls read from that state rather than accepting arguments from the model. This prevents the model from passing hallucinated, manipulated, or incorrect data to downstream systems.

What is Programmatically Governed Inference (PGI)?

PGI is an architectural approach where deterministic code makes decisions and the model executes natural language tasks within tightly scoped steps. Business logic, routing, and calculations live in code. The model handles intent extraction and conversation. The result is an agent that behaves the same way on call one as it does on call ten thousand.