Anthony Minessale

SignalWire separates communications infrastructure into two architectural layers: a media engine that handles all real-time processing — voice AI, call control, conferencing, recording, queuing, IVR, messaging, fax, and payments — and a control plane that governs what the media engine does without being part of the media path. Because AI runs inside the media engine rather than outside it, orchestration overhead is eliminated, state is never fragmented across services, and capabilities compose natively instead of requiring separate vendor integrations.

The Layer That Was Missing

Every major infrastructure domain has a control plane. Networks have them. Kubernetes has one. Databases have transaction managers. The concept is well understood: a layer that owns state, enforces policy, and absorbs complexity so that the systems above it can focus on logic instead of plumbing.

Communications has had control planes too, but none with enough depth to make the hard things possible. Existing platforms give you APIs for placing calls and sending messages. They do not own the interaction. State lives in your application. Transfers lose context. AI runs outside the call. Conferencing, recording, queuing, compliance, and payments are separate integrations from separate vendors. Every team that tries to build something real ends up rebuilding the same invisible machinery, because the control plane underneath is too thin to carry it.

Communications infrastructure has two responsibilities: processing media and governing interactions. Most platforms conflate these into one undifferentiated layer, or split them into disconnected services. SignalWire separates them architecturally while keeping them in the same system.

The media engine handles the hard real-time work. The control plane governs what the media engine does and who gets access to it. This separation is the missing layer.

The media engine

SignalWire's media engine is powered by FreeSWITCH, and was developed by the founders of the FreeSWITCH project. The SignalWire stack represents a significant set of improvements on top of FreeSWITCH: the control plane, the AI kernel, the developer interfaces, the distributed orchestration layer, and the managed infrastructure that makes it all work at carrier scale.

The media engine is where real-time processing happens. Voice calls, conferencing, recording, queuing, and AI orchestration all run as native capabilities inside it.

What runs inside the media engine

Voice AI Kernel. The AI kernel is embedded in the media processing pipeline. Because it sits directly in the media stack, the kernel has direct access to the audio stream. It does not wait for audio to traverse an orchestration boundary before processing begins. This position in the stack is what eliminates the orchestration overhead that slows down bolt-on architectures.

The kernel orchestrates STT, LLM, and TTS providers, which can be internal or external services. The kernel's advantage is not that every provider runs locally. The advantage is that the orchestration code manages the entire pipeline from a position inside the media path, optimizing timing, buffering, and handoffs between providers with direct access to the audio stream. Whether the STT provider is internal or external, the kernel's position means it starts processing earlier, coordinates faster, and delivers results back to the caller with fewer intermediate steps.

The kernel supports multiple languages per agent, tool calling with schema validation, step-based state machines with scoped prompts and scoped tools, speech-to-speech models (OpenAI, Amazon Nova Sonic), and a hidden data layer that passes information to tool handlers without entering the model's context window.

Call control. The media engine handles the full lifecycle of a call: answer, hangup, hold, unhold, transfer (blind, warm, attended), connect (serial, parallel, serial-parallel dialing), SIP REFER, and mid-call updates. Transfers preserve context. Hold plays media. Connect bridges to any destination: phone numbers, SIP endpoints, other agents, or conference rooms.

IVR and input collection. DTMF digit collection, speech recognition, prompt-and-collect patterns, digit bindings, and machine detection all run inside the media engine. The engine handles endpointing (detecting when a caller stops speaking), speech hints for recognition accuracy, and simultaneous DTMF and speech input.

Media processing. Audio playback (files, TTS, silence, ringtones) with volume control, pause, resume, and looping. Background audio playback. Noise removal (denoise). Live transcription and live translation between languages. Audio streaming to external WebSocket endpoints. RTP tapping for media interception. Echo testing.

Conferencing. Multi-party audio mixing, floor control, mute/unmute, participant management, in-conference recording, and wait music. Conference rooms are addressable resources.

Recording. Call recording with format conversion (WAV, MP3), dual-channel separation, stereo support, direction selection (speak, hear, both), consent management, pause/resume control, and background recording with control IDs.

Queuing. Hold queues, callback queues, priority routing, wait announcements, queue position tracking, and agent availability management. A caller on hold is not parked in a separate system waiting for an API callback.

Messaging. SMS and MMS sending with delivery tracking, media attachments, and status callbacks. Messaging runs through the same platform, not a separate service.

Fax. Send and receive fax over SIP (T.38). Document conversion, fax tone detection, and async tracking with control IDs.

Payments. PCI-compliant payment collection via DTMF card entry, with configurable payment connectors, charge amounts, currencies, and retry logic.

Video rooms. Video conferencing with room management, access tokens, session recording, participant tracking, and RTMP streaming.

Why this matters

When AI, call control, conferencing, recording, queuing, IVR, messaging, fax, and payments all run inside the same media engine, three things change.

State is managed, not reconstructed. In a bolt-on architecture, your application becomes an event consumer: webhooks fire, your middleware tries to piece together what happened, and by the time it acts the interaction has already moved on. State gets lost between services. Events arrive out of order. Race conditions are structural, not bugs. You end up building brittle middleware whose primary job is reconstructing context that should never have been fragmented in the first place.

Because SignalWire manages state inside the media engine and control plane, the platform always knows the authoritative state of every interaction. Your application receives clean, ordered events about what is happening, not raw fragments it has to reassemble. Transfers carry full context. Recordings know what was said. The complexity of state synchronization across microservices disappears because there is nothing to synchronize.

Latency drops because the orchestration starts at the audio. In a bolt-on architecture, audio leaves the telephony platform, crosses a network boundary to reach your application, then crosses more boundaries to reach STT, LLM, and TTS services. Each hop adds 20-100ms. The AI kernel eliminates the first and most expensive hops by sitting directly in the media path and coordinating the provider pipeline from inside the stack. Typical response latency is 800-1200ms, as low as 600ms with speech-to-speech voice models. Bolt-on architectures typically produce 2-4 seconds.

Capabilities compose without integration. An AI agent can transfer a caller to a conference, start a recording, collect a payment, send an SMS summary, and return to the conversation. Each of these is a native operation inside the same media engine. On bolt-on systems, each capability is a separate vendor integration with its own API, its own error model, and its own state. Composing them requires glue code. On SignalWire, they compose because they share the same runtime.

The control plane

The control plane is the boundary layer that wraps the media engine. It governs what happens inside the media engine without being part of the media processing path.

Every interaction passes through the control plane before reaching the media engine. Every developer interface connects through it. Every policy, routing rule, and governance constraint is enforced at this boundary.

What the control plane does

Protocol normalization. Calls arrive over PSTN, SIP, WebRTC, WhatsApp, and the Browser SDK. Each protocol has different signaling, different media formats, and different session semantics. The control plane normalizes these into a single interaction model before the media engine processes them. Developers write one set of logic regardless of how the call arrived.

Routing. The control plane decides which media engine instance handles an interaction, which resource receives it, and what happens when conditions change. Routing decisions are programmable through SWML, the SDK, or REST APIs.

State ownership. The control plane is the system of record for interaction state. Call metadata, transfer history, tool call results, and conversation context live in the control plane. When an interaction transfers between agents or media engine instances, state moves with it. The receiving agent has full context without reconstructing it from event streams.

Governance and System-Directed AI. Access control, compliance enforcement, and billing happen at the control plane boundary. The control plane decides which tools an AI agent can access in a given step, which data enters the model's context window, and which transitions are permitted. SignalWire calls this discipline System-Directed AI: deterministic software maintains authority over the AI model. The model operates inside constraints it cannot see or circumvent because the constraints exist in the control plane, not in the prompt. The model does not know it is being governed.

What the control plane does not do

The control plane does not process audio. It does not mix conference streams. It does not synthesize speech. These are media engine responsibilities. The control plane governs the media engine. It does not replace it.

This distinction matters because it means the control plane can evolve independently from the media processing layer. New governance policies, new routing strategies, and new developer interfaces do not require changes to the media engine internals. New AI models, new codecs, and new media capabilities do not require changes to the governance layer. Two decades of carrier-grade engineering in the media engine, evolving independently from the developer-facing platform built on top of it.

The developer interface

Developers interact with the control plane, not the media engine directly. The control plane exposes three interfaces, all of which can trigger any media engine capability.

The SDK. Available in Python, TypeScript, Go, Java, C#/.NET, Ruby, PHP, Perl, and C++. The SDK provides four modes of interaction: WebSocket (RELAY) for real-time bidirectional call control, REST for synchronous operations, declarative JSON for serving agent definitions, and composable AI agents with tools for building governed voice AI applications. The same agent pattern works in every language: define an agent, add prompts, define tools, run.

RELAY provides live call manipulation: inject prompts mid-call, transfer on external events, hold and unhold, play and stop audio, start and stop recording, bridge calls, and build supervisor dashboards. SignalWire introduced RELAY in 2018, years before AI voice existed as a category.

SWML. A declarative JSON/YAML format with 50+ methods for defining call flows and AI agents. SWML handles call control (answer, hangup, transfer, connect), media (play, record, stream, denoise), input collection (prompt, digit binding, speech recognition, machine detection), AI agents (with tools, steps, languages, and governance), conferencing, queuing, fax, messaging, payments, branching logic (if/switch/cond/goto), subroutines, variable management, and HTTP webhooks. Developers describe what should happen. The control plane and media engine handle how.

REST API. Synchronous HTTP endpoints for provisioning, configuration, and call control. Manage phone numbers (purchase, port, search, group), SIP endpoints and gateways, subscribers, AI agents, conference rooms, SWML scripts, RELAY applications, call flows with versioning, and Twilio-compatible cXML resources. The REST API also provides active call control: play, record, collect input, detect machines, stream audio, start AI sessions, and send messages, all on calls already in progress.

All three interfaces target the same control plane. A developer can prototype with SWML, build production agents with the SDK, and manage infrastructure with REST. The media engine is the same regardless of which interface the developer uses.

Additionally, Call Fabric resources mean that everything in the platform is an addressable resource: phone numbers, SIP endpoints, AI agents, SWML scripts, conference rooms, subscribers, queues. Resources are routed to by the control plane and processed by the media engine. A phone number can route to an AI agent, which can transfer to a conference room, which can bridge to a SIP endpoint, all within the same interaction lifecycle.

How this differs from bolt-on architectures

A typical voice AI deployment today involves five or more vendors: a telephony provider for phone numbers and SIP, a speech-to-text service, a language model, a text-to-speech service, and your own application code stitching them together with WebSocket streams and webhook callbacks. Add conferencing? Another vendor. Recording with compliance? Another integration. Payments? Another API. Each vendor has its own state model, its own error handling, its own logs, and its own billing.

Each boundary between services creates problems: latency (network round-trips), state fragmentation (each service has partial context), and failure modes (each integration can break independently). When something goes wrong at 2am, you check five dashboards with five different log formats and correlate timestamps across systems. Each vendor reports green. The caller experienced a 3-second delay.

In SignalWire's architecture, the control plane is the single governance boundary. The media engine is the single processing engine. AI, call control, conferencing, recording, queuing, IVR, messaging, fax, payments, and video all run inside the same media engine. There is one set of logs, one state model, one failure domain, and one team to call. The capabilities that require five vendors elsewhere are native operations here.

Globally distributed media engines

The architecture described here is not a single server. SignalWire deploys media engine instances globally, each running the full stack: AI kernel, call control, conferencing, recording, queuing, IVR, media processing, messaging, fax, and payments. The control plane coordinates across instances, handling routing, failover, and state transfer.

When a call arrives, the control plane routes it to the appropriate media engine instance based on geography, load, and application logic. If an instance fails, the control plane routes subsequent interactions elsewhere. The developer does not manage this. The control plane absorbs the complexity of distributed deployment.

This is the layer that was missing. Not a control plane as a concept, but a control plane with enough depth to make AI, telephony, conferencing, recording, messaging, and everything else compose as one system. The infrastructure underneath the vendor chain, not another link in it.

Ready to build better voice AI? Join our developer community on Discord and get started with a free SignalWire account.

The Communications Control Plane