Dani Plicka

Zombie calls happen when your system treats webhooks and a database as the source of truth for call state. Under real concurrency, events arrive late, out of order, duplicated, or not at all, and your application keeps a dead call marked as active, leaking resources and billing. This article explains why zombie calls are a predictable failure mode of externalized state, and how orchestrated call control prevents them by keeping authoritative state inside the call execution context.

Voice Horror Stories: The Zombie Call

Q: What are the common symptoms of zombie calls?

Common symptoms include calls stuck in “active,” billing that does not stop, cleanup routines that never fire, sessions and media resources that leak, and incidents that only appear under high concurrency.

At small scale, most voice and AI systems feel reliable.

At production scale, they start becoming ghost stories.

Calls that never die. Context that disappears. Transfers that reach humans without callers. AI responses that arrive too late to matter.

These aren’t edge cases, and they aren’t model failures. They’re predictable outcomes of architectures that rely on external state, asynchronous events, and best-effort ordering.

This series breaks down real failure modes that emerge when voice and AI systems scale, explaining why they happen, why they’re so hard to debug, and what kind of orchestration architecture prevents them entirely.

Welcome to part 1: When the call is dead… but your system swears it’s still alive.

At low volume, everything feels calm.

Your voice AI application behaves exactly as expected. Calls connect. State transitions cleanly. Logs look sane. At 100 calls a day, your architecture feels almost elegant.

Then you ship. Or marketing ships. Or a customer ships you straight into reality.

Suddenly you’re handling thousands of concurrent calls, webhooks are flying in from every direction, and your system starts telling you things that are… impossible.

A call that ended five minutes ago is still marked active.

Billing hasn’t stopped.

Cleanup never fired.

Resources are locked.

Welcome to your first real horror story in production: the Zombie Call.

Act I: Everything works (until it doesn’t)

At small scale, most voice applications rely on a familiar pattern:

A call starts
You receive a webhook
You update call state in your database
You take action based on that state

It feels deterministic because most of the time, events arrive in the order you expect. When a call ends, you get a “call completed” event, you mark it as finished, and the system moves on.

At 100 calls per day, network jitter, delayed callbacks, retries, and partial failures barely register.

At 5,000 concurrent calls, they are the system.

Act II: The call that wouldn’t die

Here’s what the Zombie Call looks like in practice:

The caller hangs up
The media stream drops
The PSTN leg is gone
But your application never processes the final state change

Why?

Because the truth of the call lives outside your system.

You’re reconstructing reality from asynchronous signals:

Webhooks that can arrive late
Events that can be duplicated
Notifications that can be dropped entirely
State transitions inferred after the fact

If the “call ended” event is delayed, reordered, or lost, your system has no way to know the call is dead. As far as your database is concerned, it’s still alive. And so it stays alive forever.

No cleanup. No teardown. No billing stop. Just a ghost consuming resources.

Act III: Months lost to distributed systems debugging

This is the part no demo ever shows…

Your best engineers start chasing symptoms:

Why are calls stuck in “active”?
Why are sessions leaking?
Why does this only happen under load?

They add retries, reconciliation jobs, watchdogs to kill stuck calls.

Each fix makes the system more complex and more fragile… because the root problem isn’t a bug. It’s externalized state.

Why do zombie calls happen?

Zombie calls happen when:

Call state is stored outside the execution context
Events are processed after the fact
Decisions are made based on eventual consistency

In distributed systems, “eventually” is where ghosts live.

When call control, media, and business logic are decoupled, no component has authoritative knowledge of now. Every part is guessing based on partial information.

Under concurrency, those guesses diverge. That divergence is the zombie.

How orchestrated call control prevents zombies

The most reliable way to prevent zombie calls is to stop reconstructing reality after it happens.

That requires orchestration, not callbacks. In an orchestrated architecture, the call lifecycle is:

Executed, not inferred
State is local to the call, not mirrored elsewhere
Actions happen in-line, not in response to delayed notifications

When the call ends, the execution context ends with it. There is no external state to clean up because nothing escaped the call’s control plane.

No dangling sessions, no orphaned state, and no undead resources.

This is why SignalWire treats calls as executable workflows, not event streams.

The system doesn’t learn that a call ended. It knows, because it was the thing executing the call in the first place.

The real lesson of the zombie call

Zombie calls aren’t edge cases. They’re what happens when systems scale past the point where best-effort ordering and retries can save you.

If your architecture depends on:

Webhooks for truth
Databases for call liveness
Cleanup jobs for correctness

Then you’re not building call control, you’re building a séance.

In the next part of the series, we’ll talk about The Double Update and how two perfectly valid events arriving at the same time can erase your application’s memory entirely.

Sleep well.

Read the rest of the series:

Start building your own voice AI system today and join our community of developers on Discord to share your stories with our community of voice AI developers.

Frequently asked questions

What is a zombie call in voice systems?
A zombie call is a call that has ended in the network, but your application still thinks it is active because the final state change was never processed, or was processed late or out of order.

Why do zombie calls happen at scale?
At scale, webhooks and event notifications can be delayed, duplicated, reordered, or dropped. If your application reconstructs call truth from those asynchronous signals, it can miss the “call ended” transition and keep stale state alive indefinitely.

What are the common symptoms of zombie calls?
Calls stuck in “active,” billing that does not stop, cleanup routines that never fire, sessions and media resources that leak, and recurring incidents that only appear under high concurrency.

Why don’t retries and reconciliation jobs fully solve zombie calls?
They can reduce symptoms, but they add complexity and fragility because they still rely on externalized state and eventual consistency. You are patching inference after the fact instead of having authoritative call state in the moment.

How does orchestrated call control prevent zombie calls?
In an orchestrated architecture, the call lifecycle is executed, not inferred. State stays local to the call execution context, actions happen in-line, and when the call ends, the execution context ends with it, leaving no external call state to reconcile or clean up.

Orchestration Horror Stories: The Zombie Call

Voice Horror Stories: The Zombie Call

Act I: Everything works (until it doesn’t)

Act II: The call that wouldn’t die

Act III: Months lost to distributed systems debugging

Why do zombie calls happen?

How orchestrated call control prevents zombies

The real lesson of the zombie call

Frequently asked questions

Related Articles

Edge Cases Define Your Voice AI Success

Treat Your AI Agent Like a Person, Not a Program

Holy Guacamole: A Voice AI Drive-Thru You Can Actually Build (and Scale)