Contact Sales

All fields are required

CEO

The Scaffolding Is the Product

What the Claude Code leak revealed about voice AI

Anthony Minessale

A source code leak from Anthropic’s Claude Code revealed that only about 1.6% of the system actually calls the AI model. The remaining 98% consists of orchestration, tooling, state management, and infrastructure that make the model usable in production. This article explores why AI products are mostly scaffolding around models, why voice AI requires even more orchestration than typical AI applications, and why embedding AI directly inside communications infrastructure changes performance, reliability, and capabilities.

What a source code leak taught everyone about AI

In March 2026, Anthropic accidentally shipped a source map file in a Claude Code npm package update. Within hours, developers had the full source: 512,000 lines of TypeScript across 1,900 files.

The model was not in the leak. Not a single weight, not a training example, not an inference endpoint. What leaked was everything around the model: tool definitions, agent loops, memory systems, context management, security validators, parallel execution logic, error recovery, and the system prompts that shape behavior. The application layer. The scaffolding.

The ratio was striking. Of 512,000 lines, roughly 8,000 (1.6%) were responsible for calling the AI model. The other 98.4% orchestrated everything else: reading files, managing permissions, tracking context across sessions, coordinating parallel operations, handling failures gracefully, and knowing when to stop.

The same Claude model, accessed through a basic chat interface, produces dramatically inferior coding results compared to Claude Code. The model is identical. The scaffolding transforms it into a product that generates $2.5 billion in annualized revenue.

The scaffolding ratio applies everywhere

This is not unique to coding tools. Every AI product that works in production has the same structure: a small core of model calls wrapped in a large body of orchestration, integration, state management, and domain-specific logic.

Voice AI is the clearest example.

A voice AI agent needs a speech-to-text engine, a language model, and a text-to-speech engine. Those are three API calls. A developer can wire them together in a weekend. The demo works.

Then production happens.

The demo does not handle callers who interrupt mid-sentence. It does not manage state when a call transfers between an AI agent and a human. It does not know what the caller already said when they get transferred. It does not record the call with dual-channel separation for compliance. It does not route to the right queue based on caller intent. It does not detect when the caller hangs up during hold music. It does not handle codec negotiation between a SIP trunk and a WebRTC browser client. It does not manage phone numbers across 60 countries. It does not collect PCI-compliant payments via DTMF. It does not produce structured logs that tell you which component caused the 3-second delay at 2am.

The model calls are the 1.6%. Everything else is the 98.4%.

Where most voice AI companies built the scaffolding

Most voice AI platforms built their scaffolding on top of infrastructure they rent from someone else. The pattern: orchestrate calls to a telephony provider's API, a speech-to-text provider's API, a language model provider's API, and a text-to-speech provider's API. Stitch the results together. Add state management. Charge a fee for the orchestration.

This is a reasonable product when each of those providers requires different SDKs, different authentication, different error handling, and different billing. The orchestration layer saves developers from building the glue themselves.

But it is a structurally fragile position.

The telephony provider can change pricing or deprecate APIs. The model providers are shipping their own voice capabilities (OpenAI's Realtime API already supports SIP with call control). Speech-to-speech models are eliminating the need for separate STT, LLM, and TTS services entirely. Amazon's documentation for one such model explicitly states it "eliminates the need for a separate middleware orchestration layer."

When the thing you orchestrate becomes a single API call, the orchestration layer has nothing left to do.

What the scaffolding needs to sit on

The Claude Code leak revealed something else: the scaffolding works because it has direct access to the systems it orchestrates. It reads files from the filesystem. It runs commands in the shell. It manages git operations. It does not call a "filesystem orchestration API" that calls another API that touches the disk. It sits where the work happens.

Voice AI has the same architectural requirement. The scaffolding that orchestrates a voice conversation needs to sit where the audio lives.

When the AI orchestration layer is embedded inside the media processing infrastructure, three things change.

The AI starts earlier. It does not wait for audio to cross a network boundary before processing begins. It has direct access to the audio stream. The difference is audible: 800-1200ms response times instead of 2-4 seconds.

State does not fragment. The platform knows the authoritative state of every interaction because it is processing the interaction. Transfers carry full context. Recordings know what was said. There is no middleware trying to reconstruct state from webhook events that arrive out of order.

Capabilities compose without integration. An AI agent can transfer a caller to a conference, start a recording, collect a payment, send an SMS summary, and return to the conversation. Each operation is native to the same system. No additional vendor integrations. No additional state synchronization. No additional failure modes.

What SignalWire built

Claude Code's leaked source revealed three categories of scaffolding: tool orchestration (40+ tools for filesystem, git, shell), an agent loop managing context and state across sessions, and a permission system governing what the AI can access. Voice AI needs the same three categories, but for phone calls instead of codebases.

The kernel (tool orchestration). SignalWire's AI kernel is embedded inside the media engine that processes every call. It orchestrates speech-to-text, language models, and text-to-speech providers from a position inside the media path. The kernel starts processing before any external system knows there is a call. This is the latency advantage: 800-1200ms instead of 2-4 seconds. But it is only one part of the scaffolding.

The programmability (agent capabilities). Claude Code has 40 tools for interacting with the filesystem. SignalWire has 50+ declarative methods for interacting with the phone network. An AI agent's tool handler can return actions that the platform executes immediately: hold the call, dial another number, bridge a conference, start a recording, collect a payment via DTMF, send an SMS, transfer with full context, or issue commands to other live calls by UUID. The developer writes the tool handler in any of 10 languages. The platform executes the actions. This is why a SignalWire agent can do everything a contact center does, not because the AI is smarter, but because the scaffolding connects it to every communications primitive through one interface.

The governance (permission system). Claude Code's leaked source included 9,700 lines of security validators controlling what the AI can access. SignalWire's control plane does the same for voice: at each step of a conversation, the platform decides which tools the model can see, which data enters its context window, and which state transitions are legal. The model does not know what it cannot access. It does not know other tools exist. It does not know a state machine is governing its behavior. This is how you put an AI on the phone without it promising things your business cannot deliver.

The media engine is powered by FreeSWITCH, created by the SignalWire team two decades ago. The control plane, AI kernel, developer tools, and managed infrastructure are the layers they built on top of it. SDKs in 10 languages. Twilio-compatible migration. $0.16/min AI processing. The scaffolding and the infrastructure are the same system.

The model is a commodity. The position is not.

LLM inference costs are falling roughly 50x per year. Speech-to-speech models from OpenAI, Amazon, and Google are production-ready and competing on price. The model layer is commoditizing faster than any previous technology input.

This is good for infrastructure companies. Every new model, every price reduction, every new modality plugs into the same kernel interfaces. The infrastructure becomes more valuable as the models become cheaper, because the infrastructure is what makes them useful in production.

The Claude Code leak proved that the scaffolding is 98.4% of the engineering effort in AI coding tools. Voice AI has the same ratio. The question is not whether you have scaffolding. Everyone does. The question is whether your scaffolding is embedded inside the infrastructure it orchestrates, or floating on top of infrastructure someone else controls.

SignalWire built the infrastructure. Then built the scaffolding inside it. That is a position you cannot replicate by assembling APIs.

Related Articles