The Truth About Voice AI Latency

Anthony Minessale

Voice AI latency claims are meaningless without knowing what interval was measured. Most published numbers start the timer late, stop it early, or measure on WebRTC rather than real phone calls. SignalWire created an open-source tool to measure the only intervals that actually matter and tested it against LiveKit and Vapi using identical models, voices, and call scripts. SignalWire ranged from 1.09s to 1.46s across five configurations. LiveKit tuned measured 1.75s. Vapi measured 1.85s. The advantage comes not from fewer external service calls but from orchestrating the pipeline inside the media layer, which reduces variance as much as average latency.

This post reveals what actually happens between the moment a caller stops speaking and the moment the AI responds, why most published numbers don't measure that interval, and what we built to find the truth.

Everyone says their AI is fast

Search for voice AI latency and you will find companies claiming 200ms, 300ms, 500ms response times. The numbers are impressive. The problem is that none of them measure the same thing.

Some start the timer after speech recognition finishes, skipping the time the system spent deciding the caller was done talking. Some stop it when the first token leaves the language model, before that token has been converted to audio and delivered through the phone network. Some measure on an idle GPU with a warm cache and a synthetic prompt that hits prefix reuse. Some measure on a WebRTC connection with no phone network in the path.

A caller does not experience any of those intervals. A caller experiences one thing: I stopped talking, and then I waited, and then the AI started talking. That's the only number that matters, and the hardest one to measure honestly.

The physics sets a floor

Before you can evaluate any platform's latency claim, you need to know what the components actually cost. Every voice AI turn walks through a chain, and each link has a non-zero price.

The language model must process the full prompt before producing its first output token. For GPT-4o on OpenAI's API, the median time to first token is 910ms (Artificial Analysis, 10K input token workload). For a self-hosted Llama 3.3 70B on 4x H100 GPUs, the reported floor is 148ms under ideal conditions: warm, batch size 1, synthetic workload. Under real server load with concurrent requests (MLPerf Inference v5.0), Llama 2 70B first-token latency is 443ms at p50 and over 2 seconds at p99.9. The model used in our benchmark, gpt-4o-mini, is faster than GPT-4o, but the floor is not zero.

Before the model even starts, the system must decide the caller has finished speaking. This is endpointing, and it adds 200ms to 1000ms depending on how aggressively you are willing to risk cutting someone off mid-sentence. Then speech-to-text processes the audio and returns a transcript. Then after the model generates enough tokens for a speakable phrase, the text-to-speech service synthesizes audio and begins streaming. TTS first-byte latency varies from 100ms to 500ms depending on the provider and voice.

These costs are additive. A platform claiming 300ms end-to-end with a capable model is either measuring a different interval than what the caller perceives, or excluding parts of the chain. The recipes are documented: start the timer late, stop it early, measure on a tiny model, use prefix caching, report the best run. None of those tricks repeal the compute cost. They hide it.

The latency metric nobody mentions: Transport

Most benchmarks are measured on WebRTC connections, often on the same network as the server. Real voice AI deployments involve phone calls.

A WebRTC browser-to-server connection adds 50 to 100ms round-trip depending on geography. That is the best case. SIP trunking adds codec negotiation and carrier routing. PSTN (an actual phone call) adds carrier transit, transcoding between G.711 and PCM, and jitter buffering. Some platforms handle phone calls by bridging a WebRTC room to a SIP trunk through a separate provider, which adds both hops plus the bridge translation.

A benchmark measured on a WebRTC room is measuring a fundamentally different audio path than what a caller on a phone experiences. And here is the part that matters for measurement: if the platform records audio at the application layer (before the phone network), the recorded latency does not include the transport to the caller. If the platform records at the telecom layer (where the caller's audio actually lives), the recording captures what the caller heard. You cannot record at the telecom layer if you do not own the telecom layer.

How we built a latency checker

We wanted to know how fast our platform actually is. Not a marketing number. The real number, measured the way the caller would experience it.

We built signalwire/latency_checker, an open-source waveform analysis tool. It takes call recordings and measures the interval from human stops speaking to AI starts speaking, per turn. Stereo recordings (human on left, AI on right) give the most precise results. Mono recordings also work, using diarization to separate speakers, though with less precision. It works on any platform's recordings.

We implemented the same appointment booking agent on SignalWire, LiveKit, and Vapi. Same language model (gpt-4o-mini). Same TTS voice (ElevenLabs Rachel). Same external API. Same call script. Three calls per configuration, three measured turns per call. We tested five SignalWire configurations: stock defaults, tuned endpointing, native TTS, open-source LLM, and a combination. We did not cherry-pick the best one.

We also tested LiveKit with their exposed tuning parameters (preemptive generation enabled, tightened endpointing). If we are going to tune ours, we should tune theirs too.

We published the raw data, the tool, every per-turn measurement, and the methodology in this report.

What we found: the fastest and slowest AI voice agents

SignalWire's conversational latency ranged from 1.09s to 1.46s across five configurations, averaging 1.24s. LiveKit baseline measured 1.87s. LiveKit tuned (with preemptive generation and tightened endpointing) measured 1.75s. Vapi measured 1.85s.

The headline comparison: SignalWire was 17 to 38% faster than LiveKit's tuned configuration depending on which SignalWire config you compare. Our stock defaults, with zero tuning, beat their optimized config by 17%.

But the finding that changed how we think about the problem was not the averages. It was the variance. Our tightest configuration had a 0.09s spread (1.05 to 1.14s). LiveKit tuned had 0.38s. LiveKit baseline had 0.89s, meaning their worst turn was more than double their best. In a three-turn conversation, at least one turn on LiveKit will likely land in the noticeable-delay zone.

Variance matters more than speed for production. A system with a 1.7s average and occasional spikes feels worse than one with a stable 1.8s. And variance at small scale predicts variance at large scale. Three test calls with a 0.89s spread means multiple external services compounding queue depth. At 300 concurrent calls, it gets worse, not better.

Where the difference comes from

Every platform in our test called the same external STT, LLM, and TTS services. The model is not the variable. The orchestration layer is.

Most voice AI platforms orchestrate the pipeline from an application-layer script. Audio goes out to STT, text comes back, goes out to the LLM, response comes back, goes out to TTS, audio comes back. The script coordinates these services from outside the media path. LiveKit, for example, runs a Python async event loop that manages the pipeline with LLM-to-TTS streaming overlap. It is a reasonable architecture. Getting it to work takes a weekend. Getting it to feel like a conversation takes months.

SignalWire's AI kernel orchestrates from inside the media processing layer. It calls the same external services, but from a position with direct access to the audio stream, timing, and call state. Barge-in and endpointing are detected at the audio level inside the media engine, before any external service is involved. Provider differences in latency profiles, streaming behavior, and error modes are normalized by the kernel so they perform consistently regardless of which vendor you choose.

The result is not fewer external calls. It is lower variance and faster coordination, because the orchestration layer was built for real-time audio from the start.

The latency budget

After running these benchmarks, we stopped thinking about latency as a race and started thinking about it as a budget.

The 2-second conversational threshold is a spending limit, not a target. Every component in the pipeline consumes part of it. The question is not how fast the platform is. The question is how much headroom the platform gives you to choose the components you want.

SignalWire Default at 1.46s leaves 540ms of headroom. LiveKit tuned at 1.75s leaves 250ms. With 540ms you can choose a richer TTS voice, a more capable LLM, or a slower but more reliable external API without breaking the conversation. With 250ms there is not much room to trade speed for quality.

The five SignalWire configurations in the benchmark are the latency budget in action. Each represents different component choices: open-source STT, premium TTS, native TTS, tuned endpointing, stock defaults. All stayed under the threshold. All beat LiveKit tuned. The budget let us make choices based on quality rather than speed.

Fillers and perceived latency

Tool calls are the latency spikes that break conversations. When the AI needs to check inventory or book an appointment, the external API round-trip adds seconds on top of the pipeline latency. Without fillers, the caller sits in silence for 3 seconds waiting for the tool to return.

SignalWire has built-in speech fillers that play during tool execution. The caller hears "let me check on that" at 1.40s instead of silence until 3.01s. The tool still takes the same time. The caller's experience changes completely. This is not a latency reduction, but a perception shift. However, perception is what determines whether the caller stays on the line.

Neither LiveKit nor Vapi offer equivalent built-in filler support.

A note on Vapi's dashboard

During testing, Vapi's dashboard displayed approximately 840ms latency for the same calls where our waveform analysis measured 1.85s. We do not know exactly what interval Vapi's dashboard measures. We do know it does not match what the stereo waveform shows for human-stop-to-AI-start. The dashboard may measure a different point in the pipeline, or it may exclude parts of the path the caller experiences.

This is the measurement problem in miniature. A number on a dashboard is only useful if you know exactly what it measures and whether that matches what the end user perceives.

Speech-to-speech: The next pipeline

Speech-to-speech models like Amazon Nova Sonic and OpenAI's Realtime API eliminate the STT+LLM+TTS chain entirely. Audio goes in, audio comes out. SignalWire supports Amazon Nova Sonic in production and has OpenAI's Realtime API in testing, both through the same AI kernel.

For pure conversational turns, speech-to-speech achieves sub-600ms latency. That is a real improvement over the 1.09 to 1.87s range in the STT+LLM+TTS benchmarks.

The trade-offs are real. Current speech-to-speech models have weaker tool calling, less predictable inference, and limited voice selection. For applications that need reliable function execution or governed AI behavior, the traditional pipeline remains the production choice. On SignalWire, both pipelines run through the same kernel and the same interfaces. As the models mature, agents can switch without code changes.

Why we published the latency checking tool

The voice AI industry does not have a standard measurement methodology. Every company measures differently, reports differently, and optimizes for different benchmarks. This makes comparison impossible and rewards creative measurement over actual performance.

We published signalwire/latency_checker because the industry needs a shared measurement standard more than any single company needs a proprietary advantage in benchmarking. If our numbers are wrong, the tool will show it. If a competitor is faster, the tool will show that too.

SignalWire's platform also produces the same latency data natively. Every AI interaction generates a structured post-call report with per-component timing. Each AI response includes LLM first-token latency, time to speakable phrase, and time to audio playback. Each user input includes speaking duration, turn detection timing, and confidence. Each tool call includes function execution duration, the function name, and which conversation step triggered it. A flat timeline feed sequences every event with microsecond timestamps so you can reconstruct the exact flow of any call.

The native telemetry matches the waveform results. In the calls we benchmarked, the per-turn answer times from the post-call report aligned with what the stereo waveform tool measured independently. You can build the same latency breakdown from production data, across every call, at scale, without an external measurement tool.

We would rather compete on a level playing field with honest measurement than win a marketing contest where everyone measures differently.

What honest latency measurement looks like

Our numbers are 1.09 to 1.46s depending on configuration, 17 to 38% faster than a tuned LiveKit. Those numbers come from real phone calls, stereo waveform analysis, and a tool anyone can run.

They include our limitations: three calls per configuration, one filler test call, different recording capture points across platforms. They include configurations where we do not look great: our stock defaults are 7% slower than LiveKit overall when you include tool turns, because our default end-of-speech timeout is conservative. They include the competitor's tuned config, not only their defaults.

A developer choosing a voice AI platform deserves to know what the timer measured, what transport path the audio took, what model was used, and whether the number represents what the caller actually hears. That is the bar. Anything less is marketing.

Measure it yourself.

Measure it yourself. The tool is open source. The platform is pay-as-you-go. $0.16/min, no minimum. Deploy the same appointment agent and run your own benchmark.

Frequently asked questions

Why do voice AI latency numbers vary so much between companies?

There is no standard measurement methodology. Companies start and stop the timer at different points in the pipeline — some skip endpointing, some stop before TTS delivers audio, some measure on WebRTC rather than real phone calls. The result is numbers that are technically defensible but don't reflect what a caller on a phone actually experiences.

What is endpointing and why does it matter for latency?

Endpointing is how the system detects that a caller has finished speaking. It adds 200ms to 1000ms to every turn depending on how aggressively it's tuned. Most latency benchmarks either exclude this entirely or bury it. It is one of the most consequential components in the pipeline and one of the least discussed.

What is the right way to measure voice AI latency?

Measure human-stop-to-AI-start on a real phone call using stereo waveform analysis. The human channel and AI channel should be recorded separately so the interval can be measured precisely. SignalWire published an open-source tool — signalwire/latency_checker — that does exactly this and works on any platform's recordings.

Why does latency variance matter more than average latency?

A system with a stable 1.8s average feels better than one averaging 1.7s with occasional 3s spikes. Variance at small scale predicts variance at large scale — three test calls with a 0.89s spread means multiple external services compounding under load. At 300 concurrent calls, high-variance systems get worse, not better.

What are speech fillers and how do they affect perceived latency?

When an AI needs to execute a tool call — checking inventory, booking an appointment — the external API round-trip adds seconds of silence. Speech fillers play a phrase like "Let me check on that" during tool execution so the caller hears a response immediately rather than waiting in silence. The tool still takes the same time. The caller's experience changes completely.