Contact Sales

All fields are required

Voice AI Fails When State Lives in Two Places | SignalWire
Architecture

Voice AI Fails When State Splits

External state management is the single biggest source of production failures in voice AI. The fix is not better code. It is a different architecture.

< 1.2s
typical AI response latency
1
platform for the full AI pipeline
2,000+
companies in production
2.7B
minutes processed

The state problem

Every voice AI platform that treats AI as an external process creates a fundamental problem: state lives in two places. Your application sits between the telephony layer and the AI layer. It receives audio, sends it to an LLM, receives responses, sends them to TTS, and streams audio back. At every step, your application tracks what was said, manages context, and handles barge-in. It also detects endpointing, coordinates transfers, recovers from failures, and prevents race conditions.

This works in demos. It falls apart at scale.

Failure modes that emerge at scale

Zombie calls

Your app receives a hangup event after it already issued a transfer command. The transfer succeeds. The call is now live with no state tracking. Nobody knows the call exists.

Double updates

Two concurrent events for the same call. Both read state, both mutate, one overwrites the other. The caller hears contradictory information in the same conversation.

Phantom transfers

The transfer target picks up during a brief network partition. Your app never receives the answer event. The call is bridged but your state shows it as still ringing.

Stale responses

The caller interrupts while the LLM is generating. Your app receives the full LLM response and sends it to TTS. The caller hears an answer to a question they already corrected.

🚨
These are the inevitable result of managing distributed state across independent systems with different timing guarantees. Every bolt-on architecture hits them at volume. The question is not whether, but when.

AI outside the call vs. AI inside the call

Bolt-On: AI Outside the Call

  • Audio streams to your server, then to STT, LLM, TTS
  • Six network hops minimum per conversational turn
  • State split across your app, telephony provider, and AI services
  • Race conditions multiply with call volume
  • Your app responsible for context, timing, and error recovery

SignalWire: AI Inside the Call

  • AI kernel orchestrates from inside the media stack
  • The kernel eliminates the hops between the audio and the orchestration layer
  • State lives in one place: the platform. The media engine holds runtime state. The control plane governs lifecycle.
  • Events processed in order within a single system
  • No external state to synchronize or reconcile

Build a Voice AI Agent

from signalwire_agents import AgentBase
from signalwire_agents.core.function_result import SwaigFunctionResult

class SupportAgent(AgentBase):
    def __init__(self):
        super().__init__(name="Support Agent", route="/support")
        self.prompt_add_section("Instructions",
            body="You are a customer support agent. "
                 "Greet the caller and resolve their issue.")
        self.add_language("English", "en-US", "rime.spore:mistv2")

    @AgentBase.tool(name="check_order")
    def check_order(self, order_id: str):
        """Check the status of a customer order.

        Args:
            order_id: The order ID to look up
        """
        return SwaigFunctionResult(f"Order {order_id}: shipped, ETA April 2nd")

agent = SupportAgent()
agent.run()

Barge-in: where architecture becomes audible

Barge-in is when the caller interrupts while the AI is still speaking. It is the most demanding test of voice AI architecture because it requires coordinating three things simultaneously. Stop the outbound audio. Capture what the caller is saying. Process the new input in the context of what the caller actually heard, not what was generated.

On bolt-on architectures, your app detects speech activity on the inbound audio stream and sends a stop command to the TTS service. It estimates how much audio the caller heard based on buffer timing, which is unreliable across network conditions. Then it sends the new transcript to the LLM with an approximation of the state. Every step involves network latency. The caller hears 200-500ms of continued speech after interrupting.

On SignalWire, the AI kernel detects speech activity from its position inside the media stack. Outbound audio stops within the media processing frame. The kernel knows exactly how many milliseconds of audio played and approximates what text the caller heard. The kernel's direct audio access eliminates the detection delay of bolt-on pipelines. Accurate context for the next response.

State changes at the speed of media

Step transitions

The AI moves from one conversation phase to another without a round-trip to your server. The media engine manages the state machine internally. No webhook to process, no state to reconstruct.

Tool calls

Function calling routes to your backend only for business logic. The platform handles orchestration, context, and lifecycle. Your code focuses on what it should: your business rules.

Transfers with context

Context, authentication state, and conversation summary travel with the call inside the platform. The next agent or human representative knows what happened. No context reconstruction required.

Error recovery

The platform classifies errors into 10 types with fatal/non-fatal designations and applies recovery phrases without consulting your application. Callers hear graceful handling, not silence.

Production comparison

ConcernBolt-On ArchitectureSignalWire
Response latency2-4 seconds (6+ network hops)800-1200ms typical (orchestration overhead eliminated)
Barge-in accuracyApproximate (network-delayed)Exact (media-frame level)
State consistencyEventual (distributed)Immediate (single system)
Failure modes at scalePer-vendor, compoundingUnified, classified, recoverable
Race conditionsIncrease with call volumeDo not exist (single event loop)
Debug surface4-5 vendor dashboardsOne trace, one log, one system

Built by the FreeSWITCH team

The AI kernel and media stack were built by the team that created FreeSWITCH. It is the open-source telephony engine behind the platforms you already know. Trillions of minutes processed. Two decades of media engineering in C. This is not a team that integrated with audio APIs. This is the team that wrote the audio processing layer and then embedded AI inside it.

FAQ

What causes race conditions in voice AI?

When state lives in multiple systems (your app, the telephony provider, the AI services), events can arrive out of order. A hangup event can arrive after a transfer command. Two events for the same call can be processed simultaneously. These are inherent to distributed architectures, not bugs in any specific vendor.

How does SignalWire avoid distributed state?

The AI kernel orchestrates from inside the media stack with direct access to the audio stream. State lives in one place: the platform. The media engine holds runtime state. The control plane governs lifecycle. Events are processed in order within a single system. There is no external state to synchronize.

What does 800-1200ms response latency mean in practice?

The caller finishes speaking and hears a response in about a second. Fast enough that the conversation feels natural. With speech-to-speech voice models, latency can be as low as 600ms. By comparison, bolt-on architectures typically deliver 2-4 seconds, which feels like the call dropped.

Can I still use webhooks for business logic?

Yes. Tool calls route to your backend for business logic. The difference is that the platform handles orchestration, context, and lifecycle. Your webhooks handle your business rules, not call state management.

Trusted by

Build an agent. Call it. Listen to the difference.

AI orchestrated inside the media stack means sub-second latency and state that never splits.