Executive Summary

Voice AI performance is fundamentally an infrastructure problem, not a model problem. Architectural design decisions materially affect conversational responsiveness. No model quality advantage can fully compensate for a poorly orchestrated pipeline.

This report benchmarks three leading Voice AI platforms by implementing the same Smart Appointment Assistant across each system, measuring conversational latency, tool call latency, and development complexity under identical conditions. All platforms used the same LLM (gpt-4o-mini), comparable STT/TTS providers, and the same external API.

Key Findings

Faster in Every Config

SignalWire beats LiveKit and Vapi in every configuration tested. SignalWire Default with zero tuning (1.46s) still beats LiveKit's tuned config (1.75s) by 17%.

17-38% Faster (vs Tuned LiveKit)

Default config: 17% faster than LiveKit Tuned. SW Tuned: 37% faster. OSS models: 38% faster. Even our stock defaults beat their optimized configuration.

Orchestration Is the Variable

Same LLM, same TTS. The performance gap comes from the orchestration layer, not the models.

Consistency Over Averages

SignalWire's tightest config has a 0.09s spread (1.05-1.14s). LiveKit Tuned spread is 0.38s (1.61-1.99s). LiveKit baseline spread is 0.89s (1.55-2.44s).

Fillers Cut Tool Call Silence 53%

Built-in speech fillers play during tool execution so the caller hears a response (1.40s) instead of silence (3.01s). Neither LiveKit nor Vapi offer equivalent filler support.

Vapi Dashboard vs Waveform

Vapi's dashboard displayed ~840ms. Stereo waveform analysis of actual calls measured 1.85s average. The dashboard may measure differently than what the caller experiences.

Human Conversational Thresholds

< 2.0s

Responsive

Slight mechanical feel, but conversation flows naturally.

2 - 3s

Noticeable delay

Callers begin losing confidence in the system.

> 3s

Frequent interruption

Users frequently interrupt or assume the system has failed.

> 4s

Conversation breakdown

Callers often hang up or request a human.

Two Architectures, One Pipeline

Application-Layer (LiveKit, Vapi)

A Python async event loop orchestrates the pipeline from outside the media path
Audio goes to an external STT service, text streams to LLM, LLM output streams to an external TTS service, audio routes back to the caller
STT and TTS are separate network services. LLM-to-TTS streaming overlaps, but STT-to-LLM and TTS-to-caller are sequential hops
Barge-in is detected via VAD in the agent process, not at the raw audio level inside the media layer
Endpointing, turn detection, and tool execution all run in the same async loop as the pipeline orchestration
Latency variance comes from coordinating multiple external services, each with its own queue and network path
Getting it to work takes a weekend. Getting it to feel like a conversation takes months.

AI Kernel (SignalWire)

A purpose-built engine orchestrates the pipeline from inside the media processing layer, not from an application-layer script
The AI kernel calls the same external STT, LLM, and TTS services, but from a position with direct access to the audio stream, timing, and call state
Barge-in and endpointing are detected at the audio level inside the media engine, before any external service is involved
Provider differences (latency profiles, streaming behavior, error modes) are normalized by the kernel so they perform consistently regardless of which vendor you choose
The result is not fewer external calls. It is lower variance and faster coordination, because the orchestration layer was built for real-time audio from the start

Methodology

Every platform ran the same Smart Appointment Assistant: answers incoming PSTN calls, greets callers, calls an external appointment API (GET /search, POST /book), confirms bookings, and disconnects. Three calls per configuration, three measured turns per call.

Each call produces three turn types: a fetch tool turn (includes API round-trip), a book tool turn (includes API round-trip), and a conversational turn (pure platform speed, no external dependencies).

Measurement: stereo waveform analysis using signalwire/latency_checker (open source). Human audio on left channel, AI audio on right channel. Metric: human stops speaking to AI starts speaking, per turn. The tool is public. Run it against any platform and verify these numbers yourself.

Recording Capture Point SignalWire recordings are captured at the telecom stack, reflecting the actual audio path the caller experiences. LiveKit and Vapi do not provide access to telecom-layer recording. Their measurements are captured at the application/media server layer, before audio traverses the phone network to the caller. LiveKit and Vapi numbers in this report are structurally optimistic. The real-world gap is likely wider than measured.

The Physics of Voice AI Latency

Before interpreting the results, it helps to understand what is physically happening during a voice AI turn. Every turn walks through a chain of components, each with a non-zero cost. No platform can repeal these costs. The question is how much overhead the orchestration layer adds on top.

LLM Time to First Token (TTFT): The model must process the entire prompt before producing its first output token. For GPT-4o on OpenAI's API, the median TTFT is 910ms (Artificial Analysis, 10K input tokens). For gpt-4o-mini (used in this benchmark), TTFT is lower but still non-trivial. The reported floor for a self-hosted 70B model is 148ms on 4x H100 GPUs, warm, batch size 1, synthetic workload. Under real server load (MLPerf), Llama 2 70B TTFT is 443ms at p50 and over 2 seconds at p99.9.

STT processing: The caller's speech must be captured, streamed to an STT service, and endpointed (the system decides the caller is done). Endpointing alone adds 200ms to 1000ms depending on configuration. The STT service then processes the audio and returns text.

TTS startup: After the LLM produces enough tokens for a speakable phrase, the TTS service must synthesize audio and begin streaming it back. First-byte latency for TTS services varies from 100ms to 500ms.

Transport and orchestration: Audio must move between the caller, the media layer, and the AI pipeline. Every network hop between services adds latency and variance.

These costs are additive. A conversational turn that measures 1.09s is accounting for all of them. A platform claiming 300ms end-to-end response is either measuring something different (first token emitted, not first audio heard) or excluding parts of the chain.

Conversational Turn Latency

SignalWire (avg of 5 configs)

1.24s

range: 1.09 - 1.46s

Vapi

1.85s

1% faster than LK baseline

LiveKit (baseline)

1.87s

0.89s spread

LiveKit (tuned)

1.75s

0.38s spread

Tool Turn Latency

SignalWire (OSS + ElevenLabs)

2.01s

31% faster than LK Tuned

Vapi

2.82s

LiveKit (tuned)

2.9s

LiveKit (baseline)

3.08s

Speech Fillers on Tool Calls

With fillers (caller hears response)

1.4s

53% reduction

Without fillers (caller hears silence)

3.01s

The Latency Budget

The 2-second conversational threshold is a budget, not a target. Every pipeline component (STT, LLM, tool calls, TTS) consumes part of it. SignalWire tested five configurations with different STT, TTS, and model choices. All five stayed under threshold. All five beat LiveKit Tuned.

SignalWire Default (1.46s) leaves 540ms of headroom. LiveKit Tuned (1.75s) leaves 250ms. LiveKit baseline (1.87s) leaves 130ms.

With 250ms of headroom, there is little room to trade speed for quality. With 540ms or more, you choose based on what sounds better or reasons better, not what finishes faster.

Detailed Results: Conversational Turns

Configuration	Avg	Min	Max	Range	vs LK Tuned
SW OSS + ElevenLabs (OSS, EL Rachel)	1.09s	0.94s	1.26s	0.32s	38% faster
SW Tuned (gpt-4o-mini, EL, eos=250ms)	1.10s	1.05s	1.14s	0.09s	37% faster
SW Default (gpt-4o-mini, EL, all defaults)	1.46s	1.28s	1.74s	0.46s	17% faster
LiveKit Tuned (preemptive_gen, eos 0.2-0.8s)	1.75s	1.61s	1.99s	0.38s	baseline
Vapi (gpt-4o-mini, Vapi defaults)	1.85s	1.45s	2.11s	0.66s	6% slower
LiveKit Baseline (gpt-4o-mini, EL, defaults)	1.87s	1.55s	2.44s	0.89s	7% slower

SignalWire Configuration Details

Config	LLM	TTS	end_of_speech	Notes
SW Default	gpt-4o-mini	ElevenLabs Rachel	1000ms (default)	All platform defaults. Zero tuning.
SW Tuned	gpt-4o-mini	ElevenLabs Rachel	250ms	temp=0.2, top_p=0.8
SW Inworld	gpt-4o-mini	Inworld Mark (native)	200ms	Native TTS engine, no ElevenLabs
SW OSS + Inworld	OSS model	Inworld Mark (native)	200ms	Open-source inference
SW OSS + ElevenLabs	OSS model	ElevenLabs Rachel	200ms	Cleanest apples-to-apples OSS test

Consistency Over Averages

A system with a 1.7s average and occasional spikes performs worse perceptually than one with a stable 1.8s response time. Latency spikes create conversation breakdowns.

SignalWire's tuned configuration has a 0.09s spread (1.05 to 1.14s). LiveKit Tuned has a 0.38s spread (1.61 to 1.99s). LiveKit baseline has a 0.89s spread (1.55 to 2.44s). Tuning LiveKit reduced its spread, but SignalWire's tightest config is still 4x more consistent than LiveKit's tuned config.

Variance at small scale predicts variance at large scale. A platform that shows 0.89s spread across 3 test calls has multiple external services, each with its own queue and failure mode. At 300 concurrent calls, those queues get deeper. A platform with 0.09s spread has fewer compounding variables. Tight variance in a benchmark is the signal you have before committing to a production deployment.

The parameter that matters most: end_of_speech_timeout. SignalWire exposes this at millisecond granularity. Reducing it from the default 1000ms to 250ms dropped conversational latency from 1.46s to 1.10s (25% improvement) with no other changes.

Platform Comparison

	SignalWire	LiveKit (tuned)	Vapi
	Fastest tested, most configurable
Conversational Latency	1.24s avg (5 configs)	1.75s	1.85s
Tool Turn Latency	2.01s (OSS+EL config)	2.90s	2.82s
Latency Spread	0.09 - 0.46s across configs	0.38s	0.66s
Filler Support	Built-in, per-language and per-function	Not available	Partial, not fully controllable
OSS Model Support	Drop-in via AI kernel	OpenAI-compatible endpoints, Ollama helper	Not tested
Endpointing Control	Millisecond-level end_of_speech_timeout	min/max delay (0.2s/0.8s tuned)	Smart endpointing plans

Sample Size Three calls per configuration, three turns per call. Conversational turns (one per call) are the smallest sample. Tool turns are bottlenecked by shared Heroku API response time. Results are directionally strong but not a large-sample statistical study.

Architectural Thesis

Voice AI latency is primarily an orchestration and control-plane problem. Platforms purpose-built for real-time AI voice orchestration demonstrate lower latency variance, fewer extreme spikes, and more predictable response timing.

Systems adapted from messaging, workflow, or dashboard-first architectures introduce coordination overhead that no model quality advantage can fully compensate for.

What About Speech-to-Speech Models?

Speech-to-speech models like Amazon Nova Sonic and OpenAI's Realtime API eliminate the STT+LLM+TTS pipeline entirely. Audio goes in, audio comes out, no text intermediate. This removes the serialization overhead that dominates the benchmark results above. SignalWire supports Amazon Nova Sonic in production and has OpenAI's Realtime API in testing, both integrated into the same platform through the same AI kernel.

For pure conversational turns (no tool calls), speech-to-speech achieves sub-600ms latency. That is a meaningful improvement over the 1.09 to 1.87s range in the STT+LLM+TTS benchmarks above.

The trade-offs are real. Current speech-to-speech models have weaker tool calling (function invocation is less reliable than text-based LLMs), less predictable inference (the model sometimes hallucinates or misinterprets in ways that text pipelines do not), and limited language/voice selection. For applications that need reliable tool execution, structured data extraction, or governed AI behavior, the STT+LLM+TTS pipeline remains the production choice.

This is not an either/or decision on SignalWire. The AI kernel supports both pipelines through the same interfaces. Developers choose per agent or per interaction based on what the use case requires. As speech-to-speech models mature, the same agents can switch without code changes. The infrastructure advantage holds either way: the kernel orchestrates the model from inside the media layer regardless of which pipeline is active.

The SignalWire Advantage

Native Telephony Integration

No multi-layer handoff overhead.

SDKs in 10 Languages

Define agents in Python, TypeScript, Go, Java, C#, Rust, Ruby, PHP, Perl, or C++. Compose in code, deploy anywhere.

Built-in Stereo Recording

Objective measurement at the telecom layer, not the application layer.

Tool Calls That Control the Call

Your function can transfer, hold, conference, record, collect payments, send SMS, or issue commands to other live calls. Actions, not only data retrieval.

Tightest Latency Spread

0.09s spread in tuned config. Predictable performance, not best-case marketing numbers.

Exposed Pipeline Controls

end_of_speech_timeout at millisecond granularity, temperature, top_p, fillers per-function. Tune what matters.

Production Scaling

Benchmark conditions are controlled. Production is not. Three calls per config tells you how a platform performs in isolation. It does not tell you what happens at hundreds of concurrent calls, under real network conditions, with callers who interrupt and go silent.

The benchmark data does contain a signal: latency spread predicts production behavior. Each platform's spread and failure mode under test load points to what will happen when that load multiplies.

Scaling Risk by Platform

Platform	Avg Latency	Spread	Scaling Risk	Primary Failure Mode at Scale
SignalWire	1.46s (default)	0.46s	Low	Narrow variance, few compounding failure boundaries. Predictable orchestration under concurrent load.
LiveKit (tuned)	1.75s	0.38s	Low-Medium	Tuning reduced spread from 0.89s to 0.38s. External STT/TTS dependencies remain.
LiveKit (baseline)	1.87s	0.89s	Medium	External STT/TTS dependencies multiply failure surface. Wider variance compounds under concurrent load.
Vapi	1.85s	0.66s	Medium-High	4.87s outlier during a function call. Async tool execution is the likely cause.
Twilio	2.42s	1.21s	High	No programmatic control. No ability to tune, optimize, or instrument at scale. 3 of 4 turns exceeded 2.3s.

Summary of Results

Platform	Conv. Latency	Tool Latency	Spread	Fillers	Verdict
SignalWire	1.24s avg (5 configs)	2.01s (OSS+EL)	0.09 - 0.46s	Built-in	Fastest tested across all turn types. Tightest variance. Most exposed pipeline controls.
LiveKit (tuned)	1.75s	2.90s	0.38s	None	Tuning improved 6% over baseline. Still 17% slower than SW Default.
Vapi	1.85s	2.82s	0.66s	Partial	Close to LiveKit baseline. Dashboard latency claims did not match waveform measurement.
LiveKit (baseline)	1.87s	3.08s	0.89s	None	Widest spread. External dependencies add variance.
Twilio	2.42s	—	1.21s	None	Already past the 2-second threshold on default components. No code-level pipeline control.

Voice AI Platform Performance Analysis

Executive Summary

Key Findings

Faster in Every Config

17-38% Faster (vs Tuned LiveKit)

Orchestration Is the Variable

Consistency Over Averages

Fillers Cut Tool Call Silence 53%

Vapi Dashboard vs Waveform

Human Conversational Thresholds

Two Architectures, One Pipeline

Application-Layer (LiveKit, Vapi)

AI Kernel (SignalWire)

Methodology

The Physics of Voice AI Latency

Conversational Turn Latency

Tool Turn Latency

Speech Fillers on Tool Calls

The Latency Budget

Detailed Results: Conversational Turns

SignalWire Configuration Details

Consistency Over Averages

Platform Comparison

Architectural Thesis

What About Speech-to-Speech Models?

The SignalWire Advantage

Native Telephony Integration

SDKs in 10 Languages

Built-in Stereo Recording

Tool Calls That Control the Call

Tightest Latency Spread

Exposed Pipeline Controls

Production Scaling

Scaling Risk by Platform

Summary of Results

Test it yourself.