All fields are required
Most platforms measure one step and call it latency. Full roundtrip is what callers feel: sentence ends, AI responds. That gap decides whether they stay or hang up.
Trusted by 2,000+ companies
PSTN to telephony provider, telephony to WebSocket, WebSocket to your server, server to STT, STT to LLM, LLM to TTS, TTS back through the chain. Each hop adds 50 to 300ms.
STT-to-first-token measures one step. TTS time-to-first-byte measures another. Neither measures the time a caller waits between finishing a sentence and hearing the AI respond.
Switching to a faster TTS provider saves 200ms but does not remove the other five network boundaries. Co-locating servers helps, but four to six network transits remain.
Streaming STT and TTS reduces batch delays. Each stream still crosses a network boundary. Streaming over WebSocket to an external service is faster than batch, but slower than processing inside one engine.
from signalwire_agents import AgentBase
from signalwire_agents.core.function_result import SwaigFunctionResult
class SupportAgent(AgentBase):
def __init__(self):
super().__init__(name="Support Agent", route="/support")
self.prompt_add_section("Instructions",
body="You are a customer support agent. "
"Greet the caller and resolve their issue.")
self.add_language("English", "en-US", "rime.spore:mistv2")
@AgentBase.tool(name="check_order")
def check_order(self, order_id: str):
"""Check the status of a customer order.
Args:
order_id: The order ID to look up
"""
return SwaigFunctionResult(f"Order {order_id}: shipped, ETA April 2nd")
agent = SupportAgent()
agent.run()
| Platform | Measured latency | Source |
|---|---|---|
| Twilio | 950ms average | Telnyx: Voice AI Agents Compared |
| Vonage | 800 to 1,200ms | Telnyx: Voice AI Agents Compared |
| Vapi (India region) | 1,450ms | Trustpilot reviews, production reports |
| Bland AI | 800ms average | G2 reviews |
| DIY WebSocket stack | 1,920ms median | DEV Community benchmark |
| DIY WebRTC stack | 2,060ms median | DEV Community benchmark |
| LiveKit + Twilio (EU) | 4,000ms+ per turn | GitHub issues, production reports |
| SignalWire | 800-1200ms typical | Full roundtrip measurement |
| Hop | What happens | Latency added |
|---|---|---|
| PSTN to telephony provider | Call ingress, media stream setup | 50 to 100ms |
| Telephony to WebSocket | Base64 encode mu-law, open stream | 30 to 80ms |
| WebSocket to your server | Network transit, decode, buffer | 20 to 50ms |
| Server to STT | Codec convert, stream audio, wait for transcript | 200 to 400ms |
| STT to LLM | Send transcript, wait for first tokens | 200 to 800ms |
| LLM to TTS | Send text, wait for first audio chunk | 150 to 400ms |
| TTS back through chain | Encode, transmit, decode at each boundary | 120 to 250ms |
| Total | 770 to 2,080ms |
PSTN ingress with no external telephony provider in the path. The audio is already inside the engine.
Audio processes every 250ms during speech. No waiting for the caller to finish before transcription begins.
The transcript streams to the LLM while the caller is still speaking. Response generation overlaps with transcription.
No network hop to an external synthesis service. Audio goes from TTS to PSTN egress without leaving the platform.
| Latency | Caller experience | Business impact |
|---|---|---|
| Under 500ms | Feels instantaneous | Optimal engagement |
| 500 to 800ms | Slight pause, still conversational | Acceptable for most use cases |
| 800 to 1,200ms | Noticeable delay, like a bad international call | Callers start talking over the agent |
| 1,200 to 2,000ms | Awkward pauses, callers check if the line dropped | 40% increase in call abandonment |
| Above 2,000ms | Caller hangs up or asks for a human | Support escalation, lost revenue |
Full roundtrip: the moment the caller stops speaking to the moment the caller hears the AI respond. Not a partial metric like STT-to-first-token or TTS time-to-first-byte. With speech-to-speech voice models, latency can be as low as 600ms.
Twilio and Vonage numbers come from a Telnyx benchmark (a competitor publishing independent measurements). Vapi numbers come from Trustpilot reviews and production reports. DIY stack numbers come from DEV Community benchmarks.
Switching providers saves time on one hop but does not eliminate the other five to eight network boundaries. Architecture determines the floor. Optimization determines how close you get to it.
No. You can bring your own models. The AI kernel orchestrates them from inside the media engine, eliminating the orchestration overhead of bolt-on pipelines.
Caching helps for common queries but removes the benefit of having an AI agent that handles novel conversations. Every external API call is a network round-trip that no cache eliminates.
Trusted by
Run the same conversation on your current stack and on SignalWire. Compare what your callers actually experience.