All fields are required
Other platforms bolt AI on top of telephony. SignalWire embedded AI inside the audio processing layer. The latency difference is audible.
In a bolt-on architecture, audio leaves the telephony platform and travels to a speech-to-text service. The transcript goes to your server, which calls an LLM. The LLM response goes to a text-to-speech service, and the audio streams back to the caller. Six network hops minimum. Each hop adds latency, serialization overhead, and a failure mode.
The result: 2-4 seconds between the caller finishing a sentence and the AI responding. Long enough that callers wonder if the call dropped. Long enough that barge-in (interrupting the AI) feels sluggish and broken. No amount of prompt optimization can fix latency that lives in the architecture.
from signalwire_agents import AgentBase
from signalwire_agents.core.function_result import SwaigFunctionResult
class SupportAgent(AgentBase):
def __init__(self):
super().__init__(name="Support Agent", route="/support")
self.prompt_add_section("Instructions",
body="You are a customer support agent. "
"Greet the caller and resolve their issue.")
self.add_language("English", "en-US", "rime.spore:mistv2")
@AgentBase.tool(name="check_order")
def check_order(self, order_id: str):
"""Check the status of a customer order.
Args:
order_id: The order ID to look up
"""
return SwaigFunctionResult(f"Order {order_id}: shipped, ETA April 2nd")
agent = SupportAgent()
agent.run()
Barge-in is when the caller interrupts while the AI is speaking. It is the most demanding real-time coordination challenge in voice AI. The system must detect speech on the inbound audio while outbound audio is playing and stop that audio immediately. It must then capture the new input, determine what the caller heard before interrupting, and process the new input with accurate context.
On bolt-on architectures, your application detects voice activity, delayed by network latency. It sends a stop command to TTS (another network hop) and estimates what the caller heard based on audio buffer timing (unreliable). The caller hears 200-500ms of continued speech after interrupting. The context for the next response is approximate.
On SignalWire, the AI kernel detects voice activity in the same audio processing frame. Outbound audio stops within the media processing cycle. The kernel knows exactly how many milliseconds of audio played and approximates what text the caller actually heard. Near-zero continued speech. Precise context for the next response.
Endpointing determines when the caller has finished speaking. Too aggressive: the AI cuts the caller off mid-sentence. Too conservative: the AI waits awkwardly while the caller has finished.
On bolt-on architectures, your STT provider makes the endpointing decision. The decision travels across the network to your app. Latency between "caller stopped speaking" and "AI starts responding" includes the full network round-trip.
On SignalWire, the AI kernel detects speech endpoints from its position inside the media stack. The transition from "listening" to "processing" happens without the overhead of bolt-on pipelines. The AI responds immediately after the caller finishes.
| Function | AI Kernel (SignalWire) | Bolt-On Architecture |
|---|---|---|
| Speech detection | In-process | STT provider, via network |
| Transcription | In-process | STT provider, via network |
| Language inference | In-process, model-agnostic | LLM provider, via network |
| Speech synthesis | In-process | TTS provider, via network |
| Barge-in detection | Audio frame level | Network-delayed, approximate |
| Endpointing | Audio frame level | Provider-dependent, delayed |
| Context management | Platform-native state | Your application's state store |
| Error recovery | Classified (10-type taxonomy) | Per-vendor, uncoordinated |
The AI kernel integrates with multiple LLM providers. The tight integration is between the kernel and the audio pipeline, not the kernel and a specific model. Switch providers without changing agent code.
Created by the team behind FreeSWITCH, the telephony engine behind platforms you already know. Trillions of voice minutes processed across FreeSWITCH deployments worldwide. Two decades of low-level media engineering in C.
This team wrote the media processing layer and embedded AI inside it. Competitors building on top of someone else's media stack cannot replicate this architecture. The gap is structural, not feature-based.
pip install signalwire-agents. One package, no vendor chain to assemble.
Set a prompt, add tools for your business logic. Python or YAML, your choice.
The latency difference is audible on the first call. 800-1200ms vs. 2-4 seconds. No tuning required.
Same performance at 10 calls or 10,000 calls. The architecture does not degrade under load because the orchestration runs inside the media stack.
The AI kernel sits inside the media stack and orchestrates speech recognition, language model inference, and speech synthesis with direct access to the audio stream. The kernel's position in the stack eliminates the orchestration overhead of bolt-on pipelines.
Yes. The platform is model-agnostic. The tight integration is between the AI kernel and the audio pipeline. Switch LLM providers without changing your agent code.
AI processing: speech-to-text, language model inference, text-to-speech, and orchestration. Transport (PSTN, SIP) is billed separately at carrier rates. One invoice. No per-component billing for AI.
The AI kernel detects voice activity from its position inside the media stack. It stops outbound audio within the media processing frame, records exactly how many milliseconds of audio played, and estimates what the caller heard. The kernel's direct access to the audio stream makes this possible without the delay of bolt-on pipelines.
Trusted by
Intelligence inside the media stack, not bolted on top of it.