Contact Sales

All fields are required

Voice AI Platform Performance Analysis | SignalWire
Benchmarking Report 2026

Voice AI Platform Performance Analysis

Voice is fundamentally different from chat. In text interfaces, delays of several seconds are often tolerated. In live conversation, they are not.

1.24s
SW Avg (5 configs)
38%
Faster Than LiveKit Tuned
1.09s
SW Fastest Config
1.75s
LiveKit Tuned

Executive Summary

Voice AI performance is fundamentally an infrastructure problem, not a model problem. Architectural design decisions materially affect conversational responsiveness. No model quality advantage can fully compensate for a poorly orchestrated pipeline.

This report benchmarks three leading Voice AI platforms by implementing the same Smart Appointment Assistant across each system, measuring conversational latency, tool call latency, and development complexity under identical conditions. All platforms used the same LLM (gpt-4o-mini), comparable STT/TTS providers, and the same external API.

Key Findings

Faster in Every Config

SignalWire beats LiveKit and Vapi in every configuration tested. SignalWire Default with zero tuning (1.46s) still beats LiveKit's tuned config (1.75s) by 17%.

17-38% Faster (vs Tuned LiveKit)

Default config: 17% faster than LiveKit Tuned. SW Tuned: 37% faster. OSS models: 38% faster. Even our stock defaults beat their optimized configuration.

Orchestration Is the Variable

Same LLM, same TTS. The performance gap comes from the orchestration layer, not the models.

Consistency Over Averages

SignalWire's tightest config has a 0.09s spread (1.05-1.14s). LiveKit Tuned spread is 0.38s (1.61-1.99s). LiveKit baseline spread is 0.89s (1.55-2.44s).

Fillers Cut Tool Call Silence 53%

Built-in speech fillers play during tool execution so the caller hears a response (1.40s) instead of silence (3.01s). Neither LiveKit nor Vapi offer equivalent filler support.

Vapi Dashboard vs Waveform

Vapi's dashboard displayed ~840ms. Stereo waveform analysis of actual calls measured 1.85s average. The dashboard may measure differently than what the caller experiences.

Human Conversational Thresholds

< 2.0s
Responsive
Slight mechanical feel, but conversation flows naturally.
2 - 3s
Noticeable delay
Callers begin losing confidence in the system.
> 3s
Frequent interruption
Users frequently interrupt or assume the system has failed.
> 4s
Conversation breakdown
Callers often hang up or request a human.

Two Architectures, One Pipeline

Application-Layer (LiveKit, Vapi)

  • A Python async event loop orchestrates the pipeline from outside the media path
  • Audio goes to an external STT service, text streams to LLM, LLM output streams to an external TTS service, audio routes back to the caller
  • STT and TTS are separate network services. LLM-to-TTS streaming overlaps, but STT-to-LLM and TTS-to-caller are sequential hops
  • Barge-in is detected via VAD in the agent process, not at the raw audio level inside the media layer
  • Endpointing, turn detection, and tool execution all run in the same async loop as the pipeline orchestration
  • Latency variance comes from coordinating multiple external services, each with its own queue and network path
  • Getting it to work takes a weekend. Getting it to feel like a conversation takes months.

AI Kernel (SignalWire)

  • A purpose-built engine orchestrates the pipeline from inside the media processing layer, not from an application-layer script
  • The AI kernel calls the same external STT, LLM, and TTS services, but from a position with direct access to the audio stream, timing, and call state
  • Barge-in and endpointing are detected at the audio level inside the media engine, before any external service is involved
  • Provider differences (latency profiles, streaming behavior, error modes) are normalized by the kernel so they perform consistently regardless of which vendor you choose
  • The result is not fewer external calls. It is lower variance and faster coordination, because the orchestration layer was built for real-time audio from the start

Methodology

Every platform ran the same Smart Appointment Assistant: answers incoming PSTN calls, greets callers, calls an external appointment API (GET /search, POST /book), confirms bookings, and disconnects. Three calls per configuration, three measured turns per call.

Each call produces three turn types: a fetch tool turn (includes API round-trip), a book tool turn (includes API round-trip), and a conversational turn (pure platform speed, no external dependencies).

Measurement: stereo waveform analysis using signalwire/latency_checker (open source). Human audio on left channel, AI audio on right channel. Metric: human stops speaking to AI starts speaking, per turn. The tool is public. Run it against any platform and verify these numbers yourself.

Recording Capture Point SignalWire recordings are captured at the telecom stack, reflecting the actual audio path the caller experiences. LiveKit and Vapi do not provide access to telecom-layer recording. Their measurements are captured at the application/media server layer, before audio traverses the phone network to the caller. LiveKit and Vapi numbers in this report are structurally optimistic. The real-world gap is likely wider than measured.

The Physics of Voice AI Latency

Before interpreting the results, it helps to understand what is physically happening during a voice AI turn. Every turn walks through a chain of components, each with a non-zero cost. No platform can repeal these costs. The question is how much overhead the orchestration layer adds on top.

LLM Time to First Token (TTFT): The model must process the entire prompt before producing its first output token. For GPT-4o on OpenAI's API, the median TTFT is 910ms (Artificial Analysis, 10K input tokens). For gpt-4o-mini (used in this benchmark), TTFT is lower but still non-trivial. The reported floor for a self-hosted 70B model is 148ms on 4x H100 GPUs, warm, batch size 1, synthetic workload. Under real server load (MLPerf), Llama 2 70B TTFT is 443ms at p50 and over 2 seconds at p99.9.

STT processing: The caller's speech must be captured, streamed to an STT service, and endpointed (the system decides the caller is done). Endpointing alone adds 200ms to 1000ms depending on configuration. The STT service then processes the audio and returns text.

TTS startup: After the LLM produces enough tokens for a speakable phrase, the TTS service must synthesize audio and begin streaming it back. First-byte latency for TTS services varies from 100ms to 500ms.

Transport and orchestration: Audio must move between the caller, the media layer, and the AI pipeline. Every network hop between services adds latency and variance.

These costs are additive. A conversational turn that measures 1.09s is accounting for all of them. A platform claiming 300ms end-to-end response is either measuring something different (first token emitted, not first audio heard) or excluding parts of the chain.

Conversational Turn Latency

SignalWire (avg of 5 configs)
1.24s
range: 1.09 - 1.46s
Vapi
1.85s
1% faster than LK baseline
LiveKit (baseline)
1.87s
0.89s spread
LiveKit (tuned)
1.75s
0.38s spread

Tool Turn Latency

SignalWire (OSS + ElevenLabs)
2.01s
31% faster than LK Tuned
Vapi
2.82s
LiveKit (tuned)
2.9s
LiveKit (baseline)
3.08s

Speech Fillers on Tool Calls

With fillers (caller hears response)
1.4s
53% reduction
Without fillers (caller hears silence)
3.01s

The Latency Budget

The 2-second conversational threshold is a budget, not a target. Every pipeline component (STT, LLM, tool calls, TTS) consumes part of it. SignalWire tested five configurations with different STT, TTS, and model choices. All five stayed under threshold. All five beat LiveKit Tuned.

SignalWire Default (1.46s) leaves 540ms of headroom. LiveKit Tuned (1.75s) leaves 250ms. LiveKit baseline (1.87s) leaves 130ms.

With 250ms of headroom, there is little room to trade speed for quality. With 540ms or more, you choose based on what sounds better or reasons better, not what finishes faster.

Detailed Results: Conversational Turns

ConfigurationAvgMinMaxRangevs LK Tuned
SW OSS + ElevenLabs (OSS, EL Rachel)1.09s0.94s1.26s0.32s38% faster
SW Tuned (gpt-4o-mini, EL, eos=250ms)1.10s1.05s1.14s0.09s37% faster
SW Default (gpt-4o-mini, EL, all defaults)1.46s1.28s1.74s0.46s17% faster
LiveKit Tuned (preemptive_gen, eos 0.2-0.8s)1.75s1.61s1.99s0.38sbaseline
Vapi (gpt-4o-mini, Vapi defaults)1.85s1.45s2.11s0.66s6% slower
LiveKit Baseline (gpt-4o-mini, EL, defaults)1.87s1.55s2.44s0.89s7% slower

SignalWire Configuration Details

ConfigLLMTTSend_of_speechNotes
SW Defaultgpt-4o-miniElevenLabs Rachel1000ms (default)All platform defaults. Zero tuning.
SW Tunedgpt-4o-miniElevenLabs Rachel250mstemp=0.2, top_p=0.8
SW Inworldgpt-4o-miniInworld Mark (native)200msNative TTS engine, no ElevenLabs
SW OSS + InworldOSS modelInworld Mark (native)200msOpen-source inference
SW OSS + ElevenLabsOSS modelElevenLabs Rachel200msCleanest apples-to-apples OSS test

Consistency Over Averages

A system with a 1.7s average and occasional spikes performs worse perceptually than one with a stable 1.8s response time. Latency spikes create conversation breakdowns.

SignalWire's tuned configuration has a 0.09s spread (1.05 to 1.14s). LiveKit Tuned has a 0.38s spread (1.61 to 1.99s). LiveKit baseline has a 0.89s spread (1.55 to 2.44s). Tuning LiveKit reduced its spread, but SignalWire's tightest config is still 4x more consistent than LiveKit's tuned config.

Variance at small scale predicts variance at large scale. A platform that shows 0.89s spread across 3 test calls has multiple external services, each with its own queue and failure mode. At 300 concurrent calls, those queues get deeper. A platform with 0.09s spread has fewer compounding variables. Tight variance in a benchmark is the signal you have before committing to a production deployment.

The parameter that matters most: end_of_speech_timeout. SignalWire exposes this at millisecond granularity. Reducing it from the default 1000ms to 250ms dropped conversational latency from 1.46s to 1.10s (25% improvement) with no other changes.

Platform Comparison

SignalWireLiveKit (tuned)Vapi
Fastest tested, most configurable
Conversational Latency1.24s avg (5 configs)1.75s1.85s
Tool Turn Latency2.01s (OSS+EL config)2.90s2.82s
Latency Spread0.09 - 0.46s across configs0.38s0.66s
Filler SupportBuilt-in, per-language and per-functionNot availablePartial, not fully controllable
OSS Model SupportDrop-in via AI kernelOpenAI-compatible endpoints, Ollama helperNot tested
Endpointing ControlMillisecond-level end_of_speech_timeoutmin/max delay (0.2s/0.8s tuned)Smart endpointing plans
Sample Size Three calls per configuration, three turns per call. Conversational turns (one per call) are the smallest sample. Tool turns are bottlenecked by shared Heroku API response time. Results are directionally strong but not a large-sample statistical study.

Architectural Thesis

Voice AI latency is primarily an orchestration and control-plane problem. Platforms purpose-built for real-time AI voice orchestration demonstrate lower latency variance, fewer extreme spikes, and more predictable response timing.

Systems adapted from messaging, workflow, or dashboard-first architectures introduce coordination overhead that no model quality advantage can fully compensate for.

What About Speech-to-Speech Models?

Speech-to-speech models like Amazon Nova Sonic and OpenAI's Realtime API eliminate the STT+LLM+TTS pipeline entirely. Audio goes in, audio comes out, no text intermediate. This removes the serialization overhead that dominates the benchmark results above. SignalWire supports Amazon Nova Sonic in production and has OpenAI's Realtime API in testing, both integrated into the same platform through the same AI kernel.

For pure conversational turns (no tool calls), speech-to-speech achieves sub-600ms latency. That is a meaningful improvement over the 1.09 to 1.87s range in the STT+LLM+TTS benchmarks above.

The trade-offs are real. Current speech-to-speech models have weaker tool calling (function invocation is less reliable than text-based LLMs), less predictable inference (the model sometimes hallucinates or misinterprets in ways that text pipelines do not), and limited language/voice selection. For applications that need reliable tool execution, structured data extraction, or governed AI behavior, the STT+LLM+TTS pipeline remains the production choice.

This is not an either/or decision on SignalWire. The AI kernel supports both pipelines through the same interfaces. Developers choose per agent or per interaction based on what the use case requires. As speech-to-speech models mature, the same agents can switch without code changes. The infrastructure advantage holds either way: the kernel orchestrates the model from inside the media layer regardless of which pipeline is active.

The SignalWire Advantage

Native Telephony Integration

No multi-layer handoff overhead.

SDKs in 10 Languages

Define agents in Python, TypeScript, Go, Java, C#, Rust, Ruby, PHP, Perl, or C++. Compose in code, deploy anywhere.

Built-in Stereo Recording

Objective measurement at the telecom layer, not the application layer.

Tool Calls That Control the Call

Your function can transfer, hold, conference, record, collect payments, send SMS, or issue commands to other live calls. Actions, not only data retrieval.

Tightest Latency Spread

0.09s spread in tuned config. Predictable performance, not best-case marketing numbers.

Exposed Pipeline Controls

end_of_speech_timeout at millisecond granularity, temperature, top_p, fillers per-function. Tune what matters.

Production Scaling

Benchmark conditions are controlled. Production is not. Three calls per config tells you how a platform performs in isolation. It does not tell you what happens at hundreds of concurrent calls, under real network conditions, with callers who interrupt and go silent.

The benchmark data does contain a signal: latency spread predicts production behavior. Each platform's spread and failure mode under test load points to what will happen when that load multiplies.

Scaling Risk by Platform

PlatformAvg LatencySpreadScaling RiskPrimary Failure Mode at Scale
SignalWire1.46s (default)0.46sLowNarrow variance, few compounding failure boundaries. Predictable orchestration under concurrent load.
LiveKit (tuned)1.75s0.38sLow-MediumTuning reduced spread from 0.89s to 0.38s. External STT/TTS dependencies remain.
LiveKit (baseline)1.87s0.89sMediumExternal STT/TTS dependencies multiply failure surface. Wider variance compounds under concurrent load.
Vapi1.85s0.66sMedium-High4.87s outlier during a function call. Async tool execution is the likely cause.
Twilio2.42s1.21sHighNo programmatic control. No ability to tune, optimize, or instrument at scale. 3 of 4 turns exceeded 2.3s.

Summary of Results

PlatformConv. LatencyTool LatencySpreadFillersVerdict
SignalWire1.24s avg (5 configs)2.01s (OSS+EL)0.09 - 0.46sBuilt-inFastest tested across all turn types. Tightest variance. Most exposed pipeline controls.
LiveKit (tuned)1.75s2.90s0.38sNoneTuning improved 6% over baseline. Still 17% slower than SW Default.
Vapi1.85s2.82s0.66sPartialClose to LiveKit baseline. Dashboard latency claims did not match waveform measurement.
LiveKit (baseline)1.87s3.08s0.89sNoneWidest spread. External dependencies add variance.
Twilio2.42s1.21sNoneAlready past the 2-second threshold on default components. No code-level pipeline control.

Test it yourself.

Sign up, deploy the same appointment agent, and measure your own latency. The tool is open source. The platform is pay-as-you-go. $0.16/min, no minimum.