All fields are required
External state management is the single biggest source of production failures in voice AI. The fix is not better code. It is a different architecture.
Every voice AI platform that treats AI as an external process creates a fundamental problem: state lives in two places. Your application sits between the telephony layer and the AI layer. It receives audio, sends it to an LLM, receives responses, sends them to TTS, and streams audio back. At every step, your application tracks what was said, manages context, and handles barge-in. It also detects endpointing, coordinates transfers, recovers from failures, and prevents race conditions.
This works in demos. It falls apart at scale.
Your app receives a hangup event after it already issued a transfer command. The transfer succeeds. The call is now live with no state tracking. Nobody knows the call exists.
Two concurrent events for the same call. Both read state, both mutate, one overwrites the other. The caller hears contradictory information in the same conversation.
The transfer target picks up during a brief network partition. Your app never receives the answer event. The call is bridged but your state shows it as still ringing.
The caller interrupts while the LLM is generating. Your app receives the full LLM response and sends it to TTS. The caller hears an answer to a question they already corrected.
from signalwire_agents import AgentBase
from signalwire_agents.core.function_result import SwaigFunctionResult
class SupportAgent(AgentBase):
def __init__(self):
super().__init__(name="Support Agent", route="/support")
self.prompt_add_section("Instructions",
body="You are a customer support agent. "
"Greet the caller and resolve their issue.")
self.add_language("English", "en-US", "rime.spore:mistv2")
@AgentBase.tool(name="check_order")
def check_order(self, order_id: str):
"""Check the status of a customer order.
Args:
order_id: The order ID to look up
"""
return SwaigFunctionResult(f"Order {order_id}: shipped, ETA April 2nd")
agent = SupportAgent()
agent.run()
Barge-in is when the caller interrupts while the AI is still speaking. It is the most demanding test of voice AI architecture because it requires coordinating three things simultaneously. Stop the outbound audio. Capture what the caller is saying. Process the new input in the context of what the caller actually heard, not what was generated.
On bolt-on architectures, your app detects speech activity on the inbound audio stream and sends a stop command to the TTS service. It estimates how much audio the caller heard based on buffer timing, which is unreliable across network conditions. Then it sends the new transcript to the LLM with an approximation of the state. Every step involves network latency. The caller hears 200-500ms of continued speech after interrupting.
On SignalWire, the AI kernel detects speech activity from its position inside the media stack. Outbound audio stops within the media processing frame. The kernel knows exactly how many milliseconds of audio played and approximates what text the caller heard. The kernel's direct audio access eliminates the detection delay of bolt-on pipelines. Accurate context for the next response.
The AI moves from one conversation phase to another without a round-trip to your server. The media engine manages the state machine internally. No webhook to process, no state to reconstruct.
Function calling routes to your backend only for business logic. The platform handles orchestration, context, and lifecycle. Your code focuses on what it should: your business rules.
Context, authentication state, and conversation summary travel with the call inside the platform. The next agent or human representative knows what happened. No context reconstruction required.
The platform classifies errors into 10 types with fatal/non-fatal designations and applies recovery phrases without consulting your application. Callers hear graceful handling, not silence.
| Concern | Bolt-On Architecture | SignalWire |
|---|---|---|
| Response latency | 2-4 seconds (6+ network hops) | 800-1200ms typical (orchestration overhead eliminated) |
| Barge-in accuracy | Approximate (network-delayed) | Exact (media-frame level) |
| State consistency | Eventual (distributed) | Immediate (single system) |
| Failure modes at scale | Per-vendor, compounding | Unified, classified, recoverable |
| Race conditions | Increase with call volume | Do not exist (single event loop) |
| Debug surface | 4-5 vendor dashboards | One trace, one log, one system |
The AI kernel and media stack were built by the team that created FreeSWITCH. It is the open-source telephony engine behind the platforms you already know. Trillions of minutes processed. Two decades of media engineering in C. This is not a team that integrated with audio APIs. This is the team that wrote the audio processing layer and then embedded AI inside it.
When state lives in multiple systems (your app, the telephony provider, the AI services), events can arrive out of order. A hangup event can arrive after a transfer command. Two events for the same call can be processed simultaneously. These are inherent to distributed architectures, not bugs in any specific vendor.
The AI kernel orchestrates from inside the media stack with direct access to the audio stream. State lives in one place: the platform. The media engine holds runtime state. The control plane governs lifecycle. Events are processed in order within a single system. There is no external state to synchronize.
The caller finishes speaking and hears a response in about a second. Fast enough that the conversation feels natural. With speech-to-speech voice models, latency can be as low as 600ms. By comparison, bolt-on architectures typically deliver 2-4 seconds, which feels like the call dropped.
Yes. Tool calls route to your backend for business logic. The difference is that the platform handles orchestration, context, and lifecycle. Your webhooks handle your business rules, not call state management.
Trusted by
AI orchestrated inside the media stack means sub-second latency and state that never splits.