Contact Sales

All fields are required

CEO

What Twenty Years of Voice Infrastructure Taught Me About AI

Why voice AI keeps failing in production

Anthony Minessale

Anthony Minessale is the CEO of SignalWire and the creator of FreeSWITCH. He has been building real-time voice infrastructure since 2003.

Twenty years of operating voice infrastructure at carrier scale reveals that voice AI failures in production are not AI problems — they are infrastructure problems. State has no owner, latency compounds across multi-vendor stacks, and governance through prompts breaks under real conditions. The only architecture that works is one where a single layer owns the media, owns the state, and enforces governance deterministically — the same lessons telecommunications learned long before AI arrived.

What Twenty Years of Voice Infrastructure Taught Me About AI

I started building voice infrastructure in 2003. FreeSWITCH, the open-source telephony engine my co-founders and I wrote, now carries a significant share of global voice traffic. Vonage, Five9, Zoom Phone, Amazon Connect, and hundreds of other platforms run on it. We have been inside the voice stack, at the layer where packets become audio and audio becomes conversations, for over two decades.

Before FreeSWITCH, I was operating physical telecom circuits. DS3 lines, 644 channels on PRI. One of our customers was the TV show "A Current Affair." They bought two toll-free numbers and ran a live poll: is Michael Jackson innocent or not? The calls flooded in and filled half the circuit. We were paying for the full capacity, but the reseller we bought the circuit from, Airspring, was oversubscribing us. Random customer calls started dropping. Angry reports piled up. We called Airspring. They insisted nothing was wrong. We pushed harder. They threatened to bill us for demanding they engage the upstream carrier. Support calls went past 3am, night after night. We were positive something was wrong upstream. Nobody believed us. We kept pushing until they had no choice but to escalate. The carrier found they were putting skips on our line to throttle traffic on a circuit we were paying for in full. It finally got fixed, but it was a battle to the end.

That was decades ago, on physical copper and T-carrier circuits. The technology has changed completely. The pattern has not. When your infrastructure depends on a chain of vendors and something goes wrong in production, nobody upstream wants to own the problem. You fight. You escalate. You lose sleep. The only way to stop fighting is to own the layer yourself.

When AI voice agents started making real phone calls on our infrastructure, I recognized the same pattern. The models were impressive. The demos were compelling. And the production failures had nothing to do with the AI.

The demo-to-production gap is not about AI

Every voice AI demo I see works the same way. An LLM generates text. A TTS engine turns it into speech. An STT engine captures the caller's response. The loop repeats. The demo runs on a laptop, on a clean network, with one call at a time. It sounds great.

Then someone tries to run it in production. On the phone network. With real callers. At scale.

The failures that follow are not AI failures. They are infrastructure failures. The AI handled language fine. The infrastructure underneath it could not handle real-time voice. Just like the DS3 circuit could handle the traffic, until the vendor chain decided it could not.

I have watched this pattern repeat for two years now. The same categories of failure, across different companies, different AI models, different architectures. The failures are predictable because they come from the infrastructure layer, and the infrastructure layer has requirements that most AI builders have never encountered.

Three things break, every time

State has no owner

In most voice AI architectures, conversation state lives in the application. Your code sits between the telephony layer and the AI layer. It receives audio, routes it to STT, sends text to the LLM, gets a response, routes it to TTS, streams audio back. At every step, your application tracks what was said, what step the conversation is on, and what the AI is allowed to do next.

This works until something real-time happens. A transfer. A barge-in. A connection drop and reconnect. A concurrent event for the same call.

When a call transfers in a voice AI system, the receiving agent typically has no context. The state was in the application that handled the first leg. When a caller interrupts while the AI is mid-sentence, the system has to detect the interruption, cancel the current TTS output, capture the new input, and update the conversation state, all within milliseconds. When two events arrive for the same call at the same time (and they will), one overwrites the other.

These are not edge cases. These are the normal operating conditions of real-time voice. I have been dealing with them for twenty years. They are hard in telephony. They are harder when you add an AI that generates unpredictable outputs.

The fix is not better application code. It is an infrastructure layer that owns the state. The application should not be reconstructing conversation context from event streams. The platform should own it and expose it.

Latency compounds across boundaries

A five-vendor voice AI stack (telephony provider, STT, LLM, TTS, orchestration layer) introduces a network hop at every boundary. Each hop adds 50 to 200 milliseconds. By the time the full pipeline completes, the total latency is 2 to 4 seconds.

Humans notice. At about 800 milliseconds, a pause in conversation starts to feel unnatural. At 2 seconds, the caller assumes the system is broken and starts talking again. The system hears the new input, cancels its response, processes the interruption, and generates a new response. Which takes another 2 to 4 seconds. The conversation collapses into overlapping fragments.

This is not a model problem. GPT-4 can generate a response in 200 milliseconds. The latency comes from the architecture: five separate systems connected by network calls, each adding its own processing and transit time.

The fix is putting the AI inside the media path, not outside it firing requests over the network. When speech recognition, language processing, and speech synthesis happen inside the same engine that handles the audio stream, the inter-system latency disappears. Barge-in detection becomes near-instant because the system processing audio is the same system running the AI. There is no round-trip.

Governance is an afterthought

This is the one that worries me most.

In text-based AI, a bad response is recoverable. The user reads it, notices the error, asks again. In voice, a bad response is spoken aloud to a human in real time. The AI skips a compliance disclosure. Makes a pricing commitment the business cannot honor. Reveals data the caller should not have access to. The output is linguistically fluent, so no monitoring system flags it. The damage is immediate and may be legally binding.

Klarna deployed AI customer service, claimed it replaced 700 agents, then hired humans back. Forrester predicts a third of brands will fail at AI self-service this year. The pattern is consistent: AI handles language well and decisions poorly.

The industry's default approach is to govern AI through prompts. Write instructions telling the model what not to do. Hope the model follows them. This is what I call "prompt and pray." It does not work under production conditions. Context windows fill up and the instructions get pushed out. Model updates change behavior without warning. Adversarial callers probe boundaries the prompt author did not anticipate.

The fix is not a better prompt. It is an architecture where decisions are made by deterministic code and language is handled by AI. The AI proposes what to say. Deterministic software validates whether it is allowed, checks what tools are available at this step, verifies that the state transition is legal, and only then lets the response through. The model handles the conversation. The code handles the rules.

This separation only works if you own the execution environment. You cannot govern what you do not control. If the AI runs as an external process making API calls to a telephony provider, governance is a request you make and hope the other side honors. If the AI runs inside the infrastructure that owns the call, governance is structural.

Why this is an infrastructure problem, not an AI problem

Every one of these three failures (state, latency, governance) is an infrastructure problem wearing an AI costume.

The AI community is focused on models. Which LLM. Which speech-to-speech architecture. How many tokens per second. Those are real questions, but they are not the questions that determine whether voice AI works in production.

The questions that matter in production are: who owns the state? How many network hops separate the audio from the intelligence? Who enforces the rules, and can the AI circumvent the enforcement? These are infrastructure questions. They have been infrastructure questions for the entire history of telecommunications. AI did not create them. AI exposed them, because AI amplifies whatever foundation it sits on.

Bessemer Venture Partners published a finding that 78% of AI failures are invisible. The system produces plausible but wrong output without triggering any alert. In voice, this is especially dangerous because the output is spoken, ephemeral, and difficult to audit at scale. The failures do not show up in dashboards. They show up in lawsuits, compliance violations, and customers who quietly leave.

What I learned from twenty years of real-time voice

When you operate voice infrastructure at carrier scale for two decades, you accumulate a specific kind of knowledge. Not the kind you get from reading papers or building demos. The kind you get from debugging a codec negotiation failure at 3am that only manifests when a specific carrier's SBC sends a malformed SDP, and the only way to reproduce it is to have that carrier's traffic flowing through your system.

Here is what that experience teaches you about what AI needs:

Real-time is a different discipline. Web applications are request-response. Voice is a continuous bidirectional stream with timing constraints measured in milliseconds. You cannot take a web architecture, add WebSockets, and call it real-time. The data structures, the concurrency models, the failure recovery patterns, and the testing methodologies are all different.

The phone network has opinions. PSTN, SIP, WebRTC, and every carrier in between have their own codec preferences, their own overbooking behaviors, their own interpretations of the RFCs. A system that works on one carrier's network may produce garbled audio on another because of a silent transcoding failure between G.711 and PCM. These are not bugs you can find in a test environment. You find them in production, across years, across carriers, across continents.

Scale changes the physics. One AI agent handling one call is straightforward. One AI agent handling two hundred concurrent calls generates thousands of state changes per second. The infrastructure requirements are qualitatively different. This is what a16z's Malika Aubakirova means when she talks about "agent-speed workloads" overwhelming systems designed for "human-speed traffic." A human makes one call at a time. An AI agent makes hundreds. The infrastructure that was built for humans breaks under agent-scale load.

Governance has to be structural. In twenty years of operating voice infrastructure, I have learned that anything that depends on participants doing the right thing will eventually fail. Governance works when it is built into the architecture, not when it is an instruction. This applies to AI agents the same way it applies to every other component that has ever run on a voice network.

Where this goes

The voice AI market is projected to reach $47 billion by 2034. Most of the funding so far has gone to the application layer: wrappers that assemble third-party components into a product. These companies can move fast because they do not build infrastructure. They also cannot fix the three problems I described, because those problems live in the layer underneath them.

The infrastructure layer is where the problems are solved, and where the durable value accumulates. a16z calls these "toll collectors on the AI economy." Matei Zaharia at Berkeley calls the architecture "compound AI systems": deterministic components governing probabilistic models. Bessemer calls the need "harness infrastructure." They are all describing the same thing, and they are right.

I did not set out to build infrastructure for AI. I set out to build the best voice engine in the world, and I spent twenty years doing it. When AI voice arrived, the infrastructure it needed already existed. The real-time media processing, the state management, the protocol translation, the carrier integrations, the governance framework. All of it was there, built for a problem that had not arrived yet.

The next billion phone calls will be made by AI agents, not humans. The infrastructure for those calls will not be assembled from five vendors connected by network hops. It will be a single layer that owns the media, owns the state, and owns the governance. Not because that is a good architecture for AI. Because that is the architecture that twenty years of operating voice infrastructure proved is the only one that works.

Related Articles