All fields are required
Voice is fundamentally different from chat. In text interfaces, delays of several seconds are often tolerated. In live conversation, they are not.
Voice AI performance is fundamentally an infrastructure problem, not a model problem. Architectural design decisions materially affect conversational responsiveness. No model quality advantage can fully compensate for a poorly orchestrated pipeline.
This report benchmarks three leading Voice AI platforms by implementing the same Smart Appointment Assistant across each system, measuring conversational latency, tool call latency, and development complexity under identical conditions. All platforms used the same LLM (gpt-4o-mini), comparable STT/TTS providers, and the same external API.
SignalWire beats LiveKit and Vapi in every configuration tested. SignalWire Default with zero tuning (1.46s) still beats LiveKit's tuned config (1.75s) by 17%.
Default config: 17% faster than LiveKit Tuned. SW Tuned: 37% faster. OSS models: 38% faster. Even our stock defaults beat their optimized configuration.
Same LLM, same TTS. The performance gap comes from the orchestration layer, not the models.
SignalWire's tightest config has a 0.09s spread (1.05-1.14s). LiveKit Tuned spread is 0.38s (1.61-1.99s). LiveKit baseline spread is 0.89s (1.55-2.44s).
Built-in speech fillers play during tool execution so the caller hears a response (1.40s) instead of silence (3.01s). Neither LiveKit nor Vapi offer equivalent filler support.
Vapi's dashboard displayed ~840ms. Stereo waveform analysis of actual calls measured 1.85s average. The dashboard may measure differently than what the caller experiences.
Every platform ran the same Smart Appointment Assistant: answers incoming PSTN calls, greets callers, calls an external appointment API (GET /search, POST /book), confirms bookings, and disconnects. Three calls per configuration, three measured turns per call.
Each call produces three turn types: a fetch tool turn (includes API round-trip), a book tool turn (includes API round-trip), and a conversational turn (pure platform speed, no external dependencies).
Measurement: stereo waveform analysis using signalwire/latency_checker (open source). Human audio on left channel, AI audio on right channel. Metric: human stops speaking to AI starts speaking, per turn. The tool is public. Run it against any platform and verify these numbers yourself.
Before interpreting the results, it helps to understand what is physically happening during a voice AI turn. Every turn walks through a chain of components, each with a non-zero cost. No platform can repeal these costs. The question is how much overhead the orchestration layer adds on top.
LLM Time to First Token (TTFT): The model must process the entire prompt before producing its first output token. For GPT-4o on OpenAI's API, the median TTFT is 910ms (Artificial Analysis, 10K input tokens). For gpt-4o-mini (used in this benchmark), TTFT is lower but still non-trivial. The reported floor for a self-hosted 70B model is 148ms on 4x H100 GPUs, warm, batch size 1, synthetic workload. Under real server load (MLPerf), Llama 2 70B TTFT is 443ms at p50 and over 2 seconds at p99.9.
STT processing: The caller's speech must be captured, streamed to an STT service, and endpointed (the system decides the caller is done). Endpointing alone adds 200ms to 1000ms depending on configuration. The STT service then processes the audio and returns text.
TTS startup: After the LLM produces enough tokens for a speakable phrase, the TTS service must synthesize audio and begin streaming it back. First-byte latency for TTS services varies from 100ms to 500ms.
Transport and orchestration: Audio must move between the caller, the media layer, and the AI pipeline. Every network hop between services adds latency and variance.
These costs are additive. A conversational turn that measures 1.09s is accounting for all of them. A platform claiming 300ms end-to-end response is either measuring something different (first token emitted, not first audio heard) or excluding parts of the chain.
The 2-second conversational threshold is a budget, not a target. Every pipeline component (STT, LLM, tool calls, TTS) consumes part of it. SignalWire tested five configurations with different STT, TTS, and model choices. All five stayed under threshold. All five beat LiveKit Tuned.
SignalWire Default (1.46s) leaves 540ms of headroom. LiveKit Tuned (1.75s) leaves 250ms. LiveKit baseline (1.87s) leaves 130ms.
With 250ms of headroom, there is little room to trade speed for quality. With 540ms or more, you choose based on what sounds better or reasons better, not what finishes faster.
| Configuration | Avg | Min | Max | Range | vs LK Tuned |
|---|---|---|---|---|---|
| SW OSS + ElevenLabs (OSS, EL Rachel) | 1.09s | 0.94s | 1.26s | 0.32s | 38% faster |
| SW Tuned (gpt-4o-mini, EL, eos=250ms) | 1.10s | 1.05s | 1.14s | 0.09s | 37% faster |
| SW Default (gpt-4o-mini, EL, all defaults) | 1.46s | 1.28s | 1.74s | 0.46s | 17% faster |
| LiveKit Tuned (preemptive_gen, eos 0.2-0.8s) | 1.75s | 1.61s | 1.99s | 0.38s | baseline |
| Vapi (gpt-4o-mini, Vapi defaults) | 1.85s | 1.45s | 2.11s | 0.66s | 6% slower |
| LiveKit Baseline (gpt-4o-mini, EL, defaults) | 1.87s | 1.55s | 2.44s | 0.89s | 7% slower |
| Config | LLM | TTS | end_of_speech | Notes |
|---|---|---|---|---|
| SW Default | gpt-4o-mini | ElevenLabs Rachel | 1000ms (default) | All platform defaults. Zero tuning. |
| SW Tuned | gpt-4o-mini | ElevenLabs Rachel | 250ms | temp=0.2, top_p=0.8 |
| SW Inworld | gpt-4o-mini | Inworld Mark (native) | 200ms | Native TTS engine, no ElevenLabs |
| SW OSS + Inworld | OSS model | Inworld Mark (native) | 200ms | Open-source inference |
| SW OSS + ElevenLabs | OSS model | ElevenLabs Rachel | 200ms | Cleanest apples-to-apples OSS test |
A system with a 1.7s average and occasional spikes performs worse perceptually than one with a stable 1.8s response time. Latency spikes create conversation breakdowns.
SignalWire's tuned configuration has a 0.09s spread (1.05 to 1.14s). LiveKit Tuned has a 0.38s spread (1.61 to 1.99s). LiveKit baseline has a 0.89s spread (1.55 to 2.44s). Tuning LiveKit reduced its spread, but SignalWire's tightest config is still 4x more consistent than LiveKit's tuned config.
Variance at small scale predicts variance at large scale. A platform that shows 0.89s spread across 3 test calls has multiple external services, each with its own queue and failure mode. At 300 concurrent calls, those queues get deeper. A platform with 0.09s spread has fewer compounding variables. Tight variance in a benchmark is the signal you have before committing to a production deployment.
The parameter that matters most: end_of_speech_timeout. SignalWire exposes this at millisecond granularity. Reducing it from the default 1000ms to 250ms dropped conversational latency from 1.46s to 1.10s (25% improvement) with no other changes.
| SignalWire | LiveKit (tuned) | Vapi | |
|---|---|---|---|
| Fastest tested, most configurable | |||
| Conversational Latency | 1.24s avg (5 configs) | 1.75s | 1.85s |
| Tool Turn Latency | 2.01s (OSS+EL config) | 2.90s | 2.82s |
| Latency Spread | 0.09 - 0.46s across configs | 0.38s | 0.66s |
| Filler Support | Built-in, per-language and per-function | Not available | Partial, not fully controllable |
| OSS Model Support | Drop-in via AI kernel | OpenAI-compatible endpoints, Ollama helper | Not tested |
| Endpointing Control | Millisecond-level end_of_speech_timeout | min/max delay (0.2s/0.8s tuned) | Smart endpointing plans |
Voice AI latency is primarily an orchestration and control-plane problem. Platforms purpose-built for real-time AI voice orchestration demonstrate lower latency variance, fewer extreme spikes, and more predictable response timing.
Systems adapted from messaging, workflow, or dashboard-first architectures introduce coordination overhead that no model quality advantage can fully compensate for.
Speech-to-speech models like Amazon Nova Sonic and OpenAI's Realtime API eliminate the STT+LLM+TTS pipeline entirely. Audio goes in, audio comes out, no text intermediate. This removes the serialization overhead that dominates the benchmark results above. SignalWire supports Amazon Nova Sonic in production and has OpenAI's Realtime API in testing, both integrated into the same platform through the same AI kernel.
For pure conversational turns (no tool calls), speech-to-speech achieves sub-600ms latency. That is a meaningful improvement over the 1.09 to 1.87s range in the STT+LLM+TTS benchmarks above.
The trade-offs are real. Current speech-to-speech models have weaker tool calling (function invocation is less reliable than text-based LLMs), less predictable inference (the model sometimes hallucinates or misinterprets in ways that text pipelines do not), and limited language/voice selection. For applications that need reliable tool execution, structured data extraction, or governed AI behavior, the STT+LLM+TTS pipeline remains the production choice.
This is not an either/or decision on SignalWire. The AI kernel supports both pipelines through the same interfaces. Developers choose per agent or per interaction based on what the use case requires. As speech-to-speech models mature, the same agents can switch without code changes. The infrastructure advantage holds either way: the kernel orchestrates the model from inside the media layer regardless of which pipeline is active.
No multi-layer handoff overhead.
Define agents in Python, TypeScript, Go, Java, C#, Rust, Ruby, PHP, Perl, or C++. Compose in code, deploy anywhere.
Objective measurement at the telecom layer, not the application layer.
Your function can transfer, hold, conference, record, collect payments, send SMS, or issue commands to other live calls. Actions, not only data retrieval.
0.09s spread in tuned config. Predictable performance, not best-case marketing numbers.
end_of_speech_timeout at millisecond granularity, temperature, top_p, fillers per-function. Tune what matters.
Benchmark conditions are controlled. Production is not. Three calls per config tells you how a platform performs in isolation. It does not tell you what happens at hundreds of concurrent calls, under real network conditions, with callers who interrupt and go silent.
The benchmark data does contain a signal: latency spread predicts production behavior. Each platform's spread and failure mode under test load points to what will happen when that load multiplies.
| Platform | Avg Latency | Spread | Scaling Risk | Primary Failure Mode at Scale |
|---|---|---|---|---|
| SignalWire | 1.46s (default) | 0.46s | Low | Narrow variance, few compounding failure boundaries. Predictable orchestration under concurrent load. |
| LiveKit (tuned) | 1.75s | 0.38s | Low-Medium | Tuning reduced spread from 0.89s to 0.38s. External STT/TTS dependencies remain. |
| LiveKit (baseline) | 1.87s | 0.89s | Medium | External STT/TTS dependencies multiply failure surface. Wider variance compounds under concurrent load. |
| Vapi | 1.85s | 0.66s | Medium-High | 4.87s outlier during a function call. Async tool execution is the likely cause. |
| Twilio | 2.42s | 1.21s | High | No programmatic control. No ability to tune, optimize, or instrument at scale. 3 of 4 turns exceeded 2.3s. |
| Platform | Conv. Latency | Tool Latency | Spread | Fillers | Verdict |
|---|---|---|---|---|---|
| SignalWire | 1.24s avg (5 configs) | 2.01s (OSS+EL) | 0.09 - 0.46s | Built-in | Fastest tested across all turn types. Tightest variance. Most exposed pipeline controls. |
| LiveKit (tuned) | 1.75s | 2.90s | 0.38s | None | Tuning improved 6% over baseline. Still 17% slower than SW Default. |
| Vapi | 1.85s | 2.82s | 0.66s | Partial | Close to LiveKit baseline. Dashboard latency claims did not match waveform measurement. |
| LiveKit (baseline) | 1.87s | 3.08s | 0.89s | None | Widest spread. External dependencies add variance. |
| Twilio | 2.42s | — | 1.21s | None | Already past the 2-second threshold on default components. No code-level pipeline control. |
Sign up, deploy the same appointment agent, and measure your own latency. The tool is open source. The platform is pay-as-you-go. $0.16/min, no minimum.