Contact Sales

All fields are required

Intelligence Inside the Media Stack | SignalWire
Core Technology

Intelligence Inside the Media Stack

Other platforms bolt AI on top of telephony. SignalWire embedded AI inside the audio processing layer. The latency difference is audible.

< 1.2s
typical AI response latency
1
platform for the full AI pipeline
2.7B
minutes processed
$0.16
per minute, AI processing

Why bolt-on voice AI sounds bad

In a bolt-on architecture, audio leaves the telephony platform and travels to a speech-to-text service. The transcript goes to your server, which calls an LLM. The LLM response goes to a text-to-speech service, and the audio streams back to the caller. Six network hops minimum. Each hop adds latency, serialization overhead, and a failure mode.

The result: 2-4 seconds between the caller finishing a sentence and the AI responding. Long enough that callers wonder if the call dropped. Long enough that barge-in (interrupting the AI) feels sluggish and broken. No amount of prompt optimization can fix latency that lives in the architecture.

Two architectures for voice AI

Bolt-On: Audio Leaves the Platform

  • Audio streams to your server (typical 20-50ms)
  • Your server sends to STT provider (typical 20-50ms)
  • STT processing (typical 200-500ms)
  • Transcript to your server to LLM (typical 40-100ms)
  • LLM inference (typical 300-800ms)
  • Response to your server to TTS (typical 40-100ms)
  • TTS synthesis (typical 200-500ms)
  • Audio back to caller (typical 40-100ms)
  • Typical total: 860-2,200ms per turn

SignalWire: AI Inside the Call

  • Internal STT (~150ms)
  • Internal LLM inference (~400ms)
  • Internal TTS (~150ms)
  • Internal media routing (~10ms)
  • Network, caller to platform (~50ms)
  • Total: ~800-1200ms typical

Build a Voice AI Agent

from signalwire_agents import AgentBase
from signalwire_agents.core.function_result import SwaigFunctionResult

class SupportAgent(AgentBase):
    def __init__(self):
        super().__init__(name="Support Agent", route="/support")
        self.prompt_add_section("Instructions",
            body="You are a customer support agent. "
                 "Greet the caller and resolve their issue.")
        self.add_language("English", "en-US", "rime.spore:mistv2")

    @AgentBase.tool(name="check_order")
    def check_order(self, order_id: str):
        """Check the status of a customer order.

        Args:
            order_id: The order ID to look up
        """
        return SwaigFunctionResult(f"Order {order_id}: shipped, ETA April 2nd")

agent = SupportAgent()
agent.run()

Barge-in: the architecture test

Barge-in is when the caller interrupts while the AI is speaking. It is the most demanding real-time coordination challenge in voice AI. The system must detect speech on the inbound audio while outbound audio is playing and stop that audio immediately. It must then capture the new input, determine what the caller heard before interrupting, and process the new input with accurate context.

On bolt-on architectures, your application detects voice activity, delayed by network latency. It sends a stop command to TTS (another network hop) and estimates what the caller heard based on audio buffer timing (unreliable). The caller hears 200-500ms of continued speech after interrupting. The context for the next response is approximate.

On SignalWire, the AI kernel detects voice activity in the same audio processing frame. Outbound audio stops within the media processing cycle. The kernel knows exactly how many milliseconds of audio played and approximates what text the caller actually heard. Near-zero continued speech. Precise context for the next response.

Endpointing: knowing when the caller is done

Endpointing determines when the caller has finished speaking. Too aggressive: the AI cuts the caller off mid-sentence. Too conservative: the AI waits awkwardly while the caller has finished.

On bolt-on architectures, your STT provider makes the endpointing decision. The decision travels across the network to your app. Latency between "caller stopped speaking" and "AI starts responding" includes the full network round-trip.

On SignalWire, the AI kernel detects speech endpoints from its position inside the media stack. The transition from "listening" to "processing" happens without the overhead of bolt-on pipelines. The AI responds immediately after the caller finishes.

What the AI kernel controls

FunctionAI Kernel (SignalWire)Bolt-On Architecture
Speech detectionIn-processSTT provider, via network
TranscriptionIn-processSTT provider, via network
Language inferenceIn-process, model-agnosticLLM provider, via network
Speech synthesisIn-processTTS provider, via network
Barge-in detectionAudio frame levelNetwork-delayed, approximate
EndpointingAudio frame levelProvider-dependent, delayed
Context managementPlatform-native stateYour application's state store
Error recoveryClassified (10-type taxonomy)Per-vendor, uncoordinated

Built for production

Model-agnostic design

The AI kernel integrates with multiple LLM providers. The tight integration is between the kernel and the audio pipeline, not the kernel and a specific model. Switch providers without changing agent code.

Built by the FreeSWITCH team

Created by the team behind FreeSWITCH, the telephony engine behind platforms you already know. Trillions of voice minutes processed across FreeSWITCH deployments worldwide. Two decades of low-level media engineering in C.

Not an integration layer

This team wrote the media processing layer and embedded AI inside it. Competitors building on top of someone else's media stack cannot replicate this architecture. The gap is structural, not feature-based.

💡
The kernel's position inside the media stack eliminates the orchestration overhead of bolt-on pipelines. Competitors cannot close this gap with faster models or better prompts. The advantage is architectural.

From install to production call

1

Install the SDK

pip install signalwire-agents. One package, no vendor chain to assemble.

2

Define your agent

Set a prompt, add tools for your business logic. Python or YAML, your choice.

3

Call it and listen

The latency difference is audible on the first call. 800-1200ms vs. 2-4 seconds. No tuning required.

4

Deploy to production

Same performance at 10 calls or 10,000 calls. The architecture does not degrade under load because the orchestration runs inside the media stack.

FAQ

What does 'AI inside the media stack' mean technically?

The AI kernel sits inside the media stack and orchestrates speech recognition, language model inference, and speech synthesis with direct access to the audio stream. The kernel's position in the stack eliminates the orchestration overhead of bolt-on pipelines.

Can I bring my own LLM provider?

Yes. The platform is model-agnostic. The tight integration is between the AI kernel and the audio pipeline. Switch LLM providers without changing your agent code.

What does $0.16/min include?

AI processing: speech-to-text, language model inference, text-to-speech, and orchestration. Transport (PSTN, SIP) is billed separately at carrier rates. One invoice. No per-component billing for AI.

How does barge-in work at the audio frame level?

The AI kernel detects voice activity from its position inside the media stack. It stops outbound audio within the media processing frame, records exactly how many milliseconds of audio played, and estimates what the caller heard. The kernel's direct access to the audio stream makes this possible without the delay of bolt-on pipelines.

Trusted by

Build an agent. Call it. The difference is audible.

Intelligence inside the media stack, not bolted on top of it.