Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 4

Build

~56 min • 4 chapters

Chapter 15: Architecture Decisions

What you'll learn: How to manage the latency budget that makes voice conversations feel natural, with a component-level breakdown.

Key takeaways:

Voice agents have a total latency budget of roughly 1,000 milliseconds. Speech-to-text takes 200ms, LLM inference takes 400ms, text-to-speech takes 200ms, network takes 200ms. Every architectural choice spends or saves from this budget.

Where latency hides. Long prompts, slow tool APIs, non-streaming TTS, and provider variance during peak hours. Audit each component individually.
RAG can reduce latency by keeping prompts short. Retrieving 50-100ms of relevant context beats stuffing everything into a long prompt that slows inference.
Build for failover from day one. Provider outages happen. The system should switch to backups without dropping calls.
Architecture should match your scale. Starter deployments need simplicity. Growth deployments need optimization. Enterprise deployments need redundancy and geographic distribution.

The agent was correct but felt slow. Drivers complained that it took too long to receive a response. Marcus listened to recordings and timed the gaps. Sometimes, there was a full two seconds between when the driver stopped talking and when the agent replied. By then, drivers had already started repeating themselves or hung up.

He brought the recordings to Priya, the infrastructure engineer who'd built the platform the agents ran on. She listened to three calls and diagnosed the problem immediately.

"You're blowing your latency budget."

Marcus didn't know he had a latency budget. Priya explained.

The thousand-millisecond window

Voice conversations feel natural when responses arrive within about one second. Faster feels instant. Slower feels broken. Human conversations have natural pauses, but those pauses have rhythm. An AI that pauses for two seconds in the wrong place sounds like it crashed.

Priya drew the stack on a whiteboard. Speech-to-text transcribes what the driver said. The LLM figures out what to say back. Text-to-speech turns that response into audio. Each step takes time. Add network latency and telephony overhead, and you've spent your thousand milliseconds before you know it.

She broke it down. Speech-to-text, maybe 200 milliseconds on a good provider. LLM inference, 300 to 500 milliseconds depending on the model and prompt length. Text-to-speech, another 150 to 250 milliseconds. Network round-trip and telephony, 100 to 200 milliseconds total.

Add those up on a good day, and you're at 750 milliseconds. Add them up when the LLM is slow, or the prompt is long, and you're past 1,200. Past that threshold, the conversation feels sluggish. The driver loses patience.

Marcus realized he'd been optimizing conversation design without thinking about what happened beneath the surface. Nina's prompts were good, but if the infrastructure couldn't deliver them fast enough, the experience still failed.

Where latency hides

Priya walked through where teams typically lost time.

Speech-to-text varied more than people expected. Some providers returned transcripts in 150 milliseconds. Others took 400. Accuracy mattered too. A fast provider that mishears the driver creates additional latency downstream because the agent responds to the wrong thing, causing the conversation to loop.

LLM inference was the biggest variable. Longer prompts meant more tokens to process. Marcus's prompts had grown to 1,500 words as he'd added guardrails and edge case handling. Every word costs time. Priya suggested trimming the prompt or moving rarely used instructions into conditional sections that load only when relevant.

Tool calls added latency inside the LLM step. If the agent needed to check a backend system before responding, that API call happened mid-inference. A slow API led to slow responses. Marcus had one tool that sometimes took 800 milliseconds on its own. Combined with everything else, responses stretched past two seconds.

Text-to-speech seemed fast but had hidden costs. Some engines needed to process the entire response before speaking. Others could stream, starting to speak while still generating audio for the rest of the sentence. Streaming TTS shaved 100-200 milliseconds off perceived latency because the driver heard the audio sooner.

Network and telephony overhead was mostly fixed, but poor architecture could inflate it. If every component talked to every other component across the public internet, round-trip times would accumulate. Co-locating services or using dedicated connections reduced this overhead.

Priya audited Marcus's deployment and found that he could recover 400 milliseconds. Faster STT provider. Shorter system prompt. Streaming TTS. Async tool calls where possible. The agent felt responsive again.

The latency budget template

Priya gave Marcus a template his team could use for any new agent.

Start with 1,000 milliseconds as the target. Allocate across components based on what you can actually achieve.

Speech-to-text. Target 200 milliseconds. Measure your actual provider. If it's slower, either accept it or switch providers.

LLM inference. Target 400 milliseconds. This depends on prompt length, model choice, and whether you're using tools. Measure with your actual prompt, not a test prompt.

Text-to-speech. Target 200 milliseconds with streaming. Non-streaming adds 100 to 150 milliseconds.

Network and telephony. Budget 200 milliseconds. Measure your actual infrastructure. Co-location helps.

Tool calls. Budget separately. If a tool call is required before responding, add its latency to the LLM step. If it averages 300 milliseconds, your LLM budget is really 700 milliseconds total.

The numbers would vary by deployment. The discipline was measuring each component and knowing where you stood against the budget. Teams that didn't track this discovered latency problems only when users complained.

Knowledge bases and retrieval

The support triage agent needed access to policy documents. Hundreds of pages covering procedures, exceptions, and edge cases. Marcus's first instinct was to stuff the relevant sections into the prompt.

Priya stopped him. Long context windows were possible but expensive. Every token in the prompt added inference time and cost. A 10,000-token context costs more than a 2,000-token context and responds more slowly.

Retrieval-augmented generation solved this differently. Store the documents in a knowledge base with semantic search. When a caller asks a question, retrieve only the relevant sections and inject them into the prompt. The prompt stayed short. The agent still had access to the full knowledge base.

Priya set up the architecture. Documents chunked into searchable segments. A retrieval step before the LLM call that pulled the three most relevant chunks. Those chunks are injected into the system prompt as context. The agent could answer questions about any policy without loading all policies into every call.

This added 50 to 100 milliseconds of latency to the retrieval step. But it reduced latency further by keeping the prompt short. And it scaled. Adding a thousand more documents didn't slow down inference because only the relevant ones were loaded.

For the logistics company's B2B support line, this was essential. Drivers asked questions that spanned insurance policies, operating procedures, and app documentation. No prompt could hold it all. Retrieval made the agent knowledgeable without making it slow.

Failover and redundancy

Marcus asked what would happen if a component failed. The STT provider had an outage. The LLM endpoint went down. The TTS service timed out.

Priya explained that enterprise architecture meant graceful degradation. Every critical component needed a fallback.

For speech-to-text, configure a secondary provider. If the primary fails or responds too slowly, route to the backup. Accept that accuracy might differ slightly between providers. Test the backup regularly to ensure it works.

For LLM inference, options were more limited but still existed. A secondary model, perhaps smaller and faster, could handle calls during primary outages. Or queue calls for callback rather than failing silently. The worst outcome was silence on the line with no explanation.

For text-to-speech, secondary providers were straightforward. The voice might sound different, but that was better than no voice at all.

Priya built a monitoring system that tracked response times for each component. If speech-to-text latency spiked above 300 milliseconds, alerts fired. If LLM inference consistently exceeded 600 milliseconds, someone investigated. Catching degradation before it became a failure was the goal.

Marcus asked how the automotive marketplace handled this at scale. Priya had talked to their team. 450 concurrent sessions across five countries meant any single point of failure could affect thousands of calls. They ran multiple providers in parallel for critical components, using the fastest response and discarding duplicates. Expensive but resilient.

Architecture by scale

Priya outlined how architecture should evolve with scale.

Starter deployments with under 10,000 calls per month can run simply. Single provider for each component. Monitoring but minimal redundancy. The priority was proving the use case worked. Over-engineering at this stage wastes time and money.

Growth deployments (10,000 to 100,000 calls per month) needed more robustness. Failover providers are configured but not necessarily running hot. Latency monitoring with alerts. Knowledge base infrastructure if the use case required it. Still manageable with a small team.

Enterprise deployments with over 100,000 calls per month required the full stack. Multi-provider redundancy is running actively. Dedicated infrastructure or reserved capacity. Sophisticated monitoring and observability. Likely a platform team maintaining the infrastructure separate from the teams building individual agents.

Marcus's installation agent had started as a starter deployment. Four months later, with multiple agents planned and growing volume, they were moving into growth architecture. Priya was building the infrastructure to support it.

Choosing providers

Each component in the stack required a provider decision. Priya walked Marcus through her criteria.

Latency came first for voice. A provider that was 100 milliseconds slower than competitors consumed 10% of the total budget. That cost compounded across thousands of calls. Measure actual latency in production conditions, not benchmarks on a marketing page.

Accuracy mattered especially for speech-to-text. A provider that mishears 5% of utterances creates downstream problems. The agent responds to the wrong thing. The caller repeats themselves. Conversations loop. Slightly higher latency with better accuracy often produced faster overall conversations.

Cost scaled with volume. A provider charging twice as much per minute might be fine at 5,000 calls per month but prohibitive at 200,000. Model the unit economics at your target scale, not your current scale.

Language support mattered for global deployments. The automotive marketplace operating across five countries needed providers that handled Spanish, Portuguese, and regional dialects. Not all providers offered the same quality across languages. Some excelled in English but struggled elsewhere.

Provider flexibility prevented lock-in. Platforms that abstracted provider choices, letting you swap STT or TTS providers without rewriting integration code, gave you leverage. When a provider raised prices or degraded quality, you could switch. When a better option emerged, you could adopt it.

Priya had seen teams locked into underperforming providers because switching would require months of engineering work. She built the logistics company's infrastructure from the start, with provider abstraction. Swapping providers meant changing configuration, not code.

What Priya taught Marcus

Six months after that first conversation about latency, Marcus thought about voice agents differently. The conversation design mattered. The prompt engineering mattered. But underneath both was an infrastructure that could make everything feel fast or make everything feel broken.

He'd learned to measure before optimizing. Gut feel said the agent was slow. Measurements told him exactly which component was the bottleneck. Sometimes it was the prompt. Sometimes it was a tool. Sometimes it was a provider having a bad day.

He'd learned to budget latency the way he budgeted money. A thousand milliseconds total. Allocate carefully. Measure constantly. When something new consumed time, something else had to give it up.

The best agents weren't just well-designed; they were well-architected. The conversation only felt natural when the infrastructure underneath could deliver it without hesitation.

Chapter 16: Tool Contracts and Backend Integrations

What you'll learn: How to define tool contracts so the agent never hallucinates behavior when integrations fail or return unexpected results.

Key takeaways:

Your conversation is the UI. Your tools are the backend. The LLM will hallucinate behavior for any gap left undefined in the tool contract.
Tool contracts must specify the purpose, required inputs with validation, output schema, timeout, retry policy, idempotency strategy, and a complete error taxonomy.
Error taxonomy is the mapping of every possible error code to a spoken response and next action. NO_AVAILABILITY, PATIENT_NOT_FOUND, CALENDAR_TIMEOUT each need specific handling.
Idempotency is non-negotiable for state-changing tools. Request IDs prevent duplicate bookings when network issues cause retries.
Test tool contracts before conversation testing. Validate every error code, every timeout scenario, every edge case. Only then, test how the agent talks about them.

The complaint came from a driver named Ray. He'd received a call from the installation agent, agreed to set up the app, and heard the agent say, "I've sent you the link." He waited. No text arrived. He called back and spent fifteen minutes with a human operator who manually sent the link and apologized for the confusion.

Marcus pulled the logs. The SMS gateway had timed out. The API never returned a success response. But the agent had already said the link was sent because the prompt told it to send the link and confirm, treating sending and confirming as a single action rather than two sequential steps with a dependency.

This was the bug Nina had warned him about in Chapter 12. Tool-first truth. Never confirm until the tool confirms. But even with that principle in his prompt, the specific tool contract hadn't enforced it strictly enough. The agent knew the rule in general. It didn't know how to apply it when the SMS API took eight seconds and then timed out.

Marcus brought the problem to Priya. She'd seen this pattern before.

"Your conversation is the UI. Your tools are the backend. And right now your backend has no contract."

Tools are the product

In voice AI, the agent talks. The tools act. The conversation creates the experience, but the tools create the outcome. A scheduling agent that can't actually book appointments is just a chatbot that talks about scheduling. A payment agent that can't process transactions is a theater.

The tools are where value gets created or destroyed. A well-designed conversation that triggers a broken tool fails completely. A mediocre conversation that triggers reliable tools might still succeed.

Priya explained that tool contracts were the most important engineering artifact after the system prompt. The prompt told the agent how to talk. The tool contracts told it how to act. And because the LLM served as the integration layer, it would hallucinate behavior whenever a gap was left undefined.

If the contract didn't specify what to do when the API timed out, the agent would make something up. If it didn't specify which errors were retryable, the agent would guess. If it didn't define what success looked like, the agent might prematurely claim success.

Ray's missing SMS was a contractual gap. Marcus hadn't specified the timeout behavior. He hadn't defined what the agent should say when the API failed. He'd left a gap, and the agent filled it wrong.

Anatomy of a tool contract

Priya walked Marcus through what a complete tool contract looked like.

The purpose stated what the tool did in plain language. Send an SMS with the app installation link to the driver's phone number. Simple, but written down so there was no ambiguity.

Required inputs listed every parameter the tool needed, with validation rules. Phone number is required and must be in a valid format. Message type (required) must be one of the allowed templates. Driver ID is required and must exist in the system. If the agent calls the tool with invalid inputs, the tool should reject the call with a clear error message, not fail silently or behave unpredictably.

The output schema defined exactly what the tool returned on success. A confirmation ID. A timestamp. A status field. The agent needed to know what it would get back so it could respond appropriately.

The timeout and retry policy specified how long the agent should wait and what to do if the tool didn't respond. For the SMS tool, Priya set a five-second timeout. If the API didn't respond in five seconds, the agent would say, "I'm having trouble sending that. Let me try again." One retry. If the retry also failed, offer to have someone call the driver back.

Idempotency strategy ensured the tool could be called multiple times without causing duplicate actions. Every SMS request has a unique request ID. If the same request ID came through twice, the gateway would return the original result rather than sending a second text. This protects against retries creating duplicates.

Error taxonomy mapped every possible error to a user-facing response and a next action.

Marcus and Priya built the error taxonomy for the SMS tool together. Gateway timeout meant retry once, then escalate. An invalid phone number meant asking the driver to confirm the number. Rate limit exceeded meant apologize and try again in a moment. Carrier rejection meant the message couldn't be delivered to that number and offers an alternative.

Every error the backend could throw required a spoken response to be planned before launch. Leaving any error unmapped meant the agent would improvise, and improvised error handling in voice was almost always wrong.

Error taxonomy in practice

Priya showed Marcus a more complex example. A healthcare scheduling agent with tools that queried the EHR, checked availability, and booked appointments.

The error taxonomy had eight entries.

NO_AVAILABILITY indicated that the requested time slot wasn't available. The agent should respond, "I don't have that time available, but I can offer Thursday at 2pm or Friday at 10am. Would either of those work?"

PATIENT_NOT_FOUND meant the patient ID didn't match any record. The agent should ask for verification. "I'm not finding your record with that date of birth. Could you confirm it for me?"

SLOT_ALREADY_BOOKED meant someone else grabbed the slot between when the agent offered it and when the patient confirmed. The agent should apologize and offer the next available. "That slot just got taken. The next opening is thirty minutes later at 2:30. Does that work?"

EHR_TIMEOUT meant the health records system was slow or down. The agent should acknowledge the delay. "I'm having trouble reaching our scheduling system. Let me try that again." One retry, then offer a callback.

INVALID_INSURANCE meant the patient's coverage wasn't accepted. Transfer to a human who could discuss payment options.

PROVIDER_UNAVAILABLE meant the requested doctor wasn't seeing patients that day. Offer alternative providers or alternative dates.

BOOKING_CONFLICT meant the patient already had an appointment at the overlapping time. Mention the conflict and ask if they want to reschedule the existing appointment.

SYSTEM_MAINTENANCE meant the backend was intentionally down. Apologize, explain that scheduling is temporarily unavailable, and offer a callback.

Each error is mapped to a specific spoken response and a next action. No improvisation required. The agent knew exactly what to say and what to do.

Idempotency is non-negotiable

Any tool that changed state needed idempotency protection. Booking an appointment. Processing a cancellation. Sending a notification. Updating a record.

Without idempotency, retries created duplicates. The driver got two texts. The patient got booked twice. The payment was processed twice. Each of these became a complaint, a support ticket, or a refund.

Priya explained the standard pattern. Generate a unique request ID for every action. Pass it with the tool call. The backend checks if it's seen that ID before. If yes, return the cached result. If not, process the request and cache the result.

For the SMS tool, Marcus added a request ID generated from the call session ID and a timestamp. If the agent retried, it sent the same request ID. The gateway recognized the duplicate and returned the original confirmation without sending a second message.

For booking tools, the request ID came from the patient ID, the requested slot, and the conversation ID. Same principle. Retry safety without duplicate bookings.

Nina added a corresponding instruction to the prompt. Once a tool confirms an action, mark it complete. Don't re-execute even if the conversation loops back to the same topic. Idempotency was enforced at both layers: the prompt and the backend.

Timeout budgets

A tool that took eight seconds was a broken tool for voice. Priya had explained the thousand-millisecond latency budget in the previous chapter. Tools ate into that budget.

Every tool needed its own timeout ceiling. For the SMS tool, five seconds. For the scheduling API, three seconds. For the EHR lookup, four seconds. Any tool that regularly exceeded its ceiling needed to be optimized or replaced.

Marcus audited his tools. The SMS gateway averaged 1.2 seconds, but sometimes spiked to six. The installation verification tool averaged 800 milliseconds. The driver lookup averaged 400 milliseconds.

The SMS gateway spikes were the problem. Priya worked with the vendor to identify the cause. A connection pooling issue on their side. Once fixed, the gateway stabilized at 1.5 seconds worst case.

For any tool that couldn't be made faster, Priya recommended async patterns. Start the tool call, give the driver a filler response ("Let me check on that for you"), and continue speaking while waiting for the result. This preserved the conversational flow even when tools were slow.

Partial failure handling

Some operations involved multiple steps. Book the appointment, send the confirmation SMS, and update the patient portal. What happened when step two failed but steps one and three succeeded?

Priya called this a partial failure, and it required explicit handling in the tool contract.

The first option was atomic transactions. Either all steps succeeded or all steps rolled back. This was clean but often impractical. Different systems didn't always support coordinated rollback.

The second option was compensating actions. If step two failed, undo step one. Cancel the booking that was just made. This worked when the steps were reversible.

The third option was graceful degradation with notification. Accept that step one succeeded and step two failed. Inform the user of the partial state. "I've booked your appointment, but I wasn't able to send a confirmation text. You'll receive an email confirmation instead." Then queue a background job to retry the SMS or alert an operator.

Marcus chose the third option for most cases. The complete rollback was too complex. The driver would rather have the appointment booked with a failed text than no appointment at all. The agent acknowledged the partial success honestly and offered an alternative.

Testing tool contracts

Tool contracts need testing before conversation testing. Priya insisted on this order. If the tools didn't work reliably, testing conversations was pointless.

For each tool, test the happy path. Valid inputs, successful response, correct output schema. Verify the agent could parse the response and use it appropriately.

Test every error in the taxonomy. Simulate gateway timeouts, invalid inputs, rate limits, and system maintenance. Verify the agent said the right thing for each error and took the right next action.

Test idempotency. Call the same tool twice with the same request ID. Verify no duplicate actions occurred.

Test timeout behavior. Simulate slow responses. Verify the agent handled them gracefully rather than hanging or hallucinating.

A staffing marketplace with tools that queried availability databases, updated candidate records, and triggered downstream notifications tested each integration independently first. Different tools had different latency profiles and failure modes. The availability query was fast but occasionally returned stale data. The record update was reliable but slow. The notification trigger was fast but had rate limits. Each needed its own test suite.

Only after tool contracts were verified did they move to conversation testing. At that point, they knew the tools worked. Any conversation failures were conversation problems, not tool problems. The separation made debugging faster.

What Priya built

The SMS incident had been embarrassing. A driver waited for a link that never arrived because Marcus hadn't thought through what to do when an API call failed.

But the incident led to something better. A tool contract template that every new integration follows. Purpose, inputs, outputs, timeouts, retries, idempotency, and error taxonomy. Nothing is launched without every section filled in.

Priya built a contract registry. Every tool the agents used had its contract documented and versioned. When a backend team changed an API, they updated the contract. When a new error code was added, it got mapped to a spoken response before the change went live.

Marcus realized that the conversation was the UI and the tools were the product. He'd been treating tools as implementation details, things that happened behind the scenes while the real work happened in the prompt. Priya showed him they were the opposite. The tools were where outcomes happened. The conversation just made them accessible.

A voice agent that talked beautifully but couldn't reliably execute its tools was worthless. A voice agent with solid tool contracts could survive mediocre conversation design because at least it delivered results.

Both mattered. But if he had to choose where to invest engineering rigor, Priya had convinced him. Tool contracts first. Conversation polish second.

Chapter 17: Telephony Setup

What you'll learn: The telephony infrastructure decisions that determine whether your outbound calls get answered or flagged as spam.

Key takeaways:

Telephony is invisible when it works. When it breaks, it's catastrophic. Answer rates can drop from 45% to 10% in a week if numbers get flagged as spam.
Number reputation requires active management. Warm up new numbers gradually. Register with CNAM databases. Monitor for spam flags daily.
STIR/SHAKEN attestation is table stakes for outbound. Calls without proper attestation get flagged more aggressively by carriers and spam filters.
Plan for recording consent by jurisdiction. Some states require one-party consent, while others require all-party consent. International adds more complexity.
Capacity planning uses a formula. Peak concurrent calls equals hourly volume times average handle time divided by 3600, plus a 20-30% buffer for spikes.

The outbound campaign started well. Five hundred calls per day to drivers who needed to install the app. Answer rates hovered around 45%. The agent converted well. Marcus watched the metrics climb.

By day three, something changed. Answer rates dropped to 30%. By day five, 18%. By the end of the week, barely one in ten drivers had picked up.

Marcus assumed the agent was the problem. Maybe drivers were sharing warnings about the calls. Maybe the timing was wrong. He adjusted the campaign, changed the calling windows, and tweaked the opening script.

Nothing helped.

Priya found the real problem. She checked the numbers against a carrier reputation database. All four numbers the campaign used had been flagged as likely spam. The major carriers were labeling the calls before they even rang. Drivers saw "Spam Risk" or "Scam Likely" on their screens and ignored the calls.

The agent was fine. The telephony was broken. And fixing it would take two weeks.

Telephony is invisible until it isn't

Most teams treat telephony as a checkbox. Provide some numbers, connect them to the platform, and start making calls. It works in testing. It works in a small pilot. Then it fails at scale for reasons unrelated to the agent itself.

Priya had seen this pattern repeatedly. Telephony is invisible when it works. When it breaks, it's catastrophic. A spam flag can kill a campaign overnight. A carrier outage can cause every call in a region to drop. A compliance violation can trigger fines or legal action. Number porting can take weeks when you expected it to take days.

The logistics company learned this the hard way. They'd focused on conversation design, prompt engineering, and tool contracts. All the visible work. They'd treated telephony as infrastructure to be handled by someone else. No one handled it well, and the campaign paid the price.

Number reputation

For outbound calling, number reputation determines whether the call is answered.

Carriers maintain databases of calling patterns. Numbers that make many short calls get flagged. Numbers that receive high volumes of complaints get flagged. Numbers with no established calling history get treated with suspicion.

The logistics company's numbers were new, had no history, and suddenly started making hundreds of calls per day. The pattern looked exactly like a robocall operation. The carriers responded accordingly.

Priya walked Marcus through the remediation. Register the numbers with carrier reputation services. CNAM registration to display a legitimate caller ID name. Analytics partnerships that fed carrier call patterns as legitimate business traffic rather than spam. It took relationships with vendors Marcus didn't know existed.

For new campaigns, she established a warm-up protocol. Start with low volume, maybe fifty calls per day. Gradually increase over two weeks. Build a calling history that looked like a legitimate business, because it was a legitimate business, before scaling to full volume.

She also set up ongoing monitoring. A reputation dashboard that tracked how each number was performing across different carriers. If a number started getting flagged, they'd catch it early and rotate to a fresh number before the whole campaign suffered.

STIR/SHAKEN and attestation

Priya explained the regulatory layer. STIR/SHAKEN was a framework that carriers used to verify that calls originated from the numbers they claimed. Calls with full attestation, meaning the carrier vouched that the caller had the right to use that number, got better treatment. Calls without attestation got flagged more aggressively.

The logistics company had been making calls without proper attestation. Their carrier wasn't passing the right signals. The calls looked unverified, which, to spam algorithms, meant they might be spoofed.

Fixing this required working with the carrier to ensure proper STIR/SHAKEN signing. It required using numbers that the carrier could fully attest. It required understanding that not all numbers from all carriers were equal, and that the cheapest option often came with the worst attestation.

For outbound campaigns at scale, STIR/SHAKEN attestation wasn't optional. It was table stakes for getting calls answered.

Carrier selection

Not all carriers were the same. Priya had learned this across multiple deployments.

Quality varied. Some carriers had excellent voice clarity. Others introduced artifacts, delays, or compression that made the agent sound robotic even when the TTS was clean. A healthcare provider expanding internationally found that the same agent sounded crisp in the US but garbled in rural areas of certain countries because the local carrier's infrastructure couldn't support high-quality audio.

Reliability varied. Some carriers had redundant infrastructure and rarely dropped calls. Others had regional outages that could take down an entire campaign for hours.

Attestation varied. Some carriers provided full STIR/SHAKEN attestation on all numbers. Others provided partial attestation or none at all.

Cost varied, but the cheapest carrier often meant the worst quality and reputation. Saving a fraction of a cent per minute wasn't worth it if calls didn't connect or sounded terrible.

Priya recommended qualifying carriers the same way you'd qualify any other vendor. Test call quality with actual voice AI traffic. Check attestation levels. Review uptime history. Understand their spam mitigation practices. The telephony layer was too important to choose based solely on price.

Number provisioning and porting

Getting numbers was straightforward for greenfield deployments. Provision new numbers from the carrier, configure routing, and start calling or receiving calls.

Porting existing numbers was harder. If the company already had a support line that customers knew, that number needed to move to the new platform. Porting timelines ranged from a few days to several weeks, depending on the carriers involved. Enterprise numbers with complex configurations took longer.

Priya warned Marcus to start the porting process early. At least a month before the planned launch, if possible. Porting delays had pushed back launches more than once.

International provisioning added another layer. Each country had its own regulatory requirements for number ownership. Some required a local business presence. Some required specific documentation. Some had quotas or approval processes. A number that took two days to provision in the US might take three weeks in another country.

For the logistics company's expansion plans, Priya built a provisioning timeline by country. She identified which markets would require extra lead time and started the paperwork before anyone asked.

Contact center integration

Most enterprises already had contact center platforms. Genesys, Five9, NICE, Amazon Connect. The voice agent needed to work alongside these systems, not replace them entirely.

Integration patterns varied. Some platforms could route calls to the voice agent as a first line of defense, with overflow to human agents. Some could use the voice agent for specific intents while humans handled everything else. Some needed the voice agent to transfer calls back into the contact center queue with context attached.

Priya worked with the operations team to map the integration requirements. Inbound calls would first hit the existing IVR, then be routed to the voice agent for specific options. The voice agent could transfer back to the contact center with a warm handoff, passing context so the human agent knew what had already been discussed.

This required SIP trunk configuration, transfer protocols, and context passing mechanisms that worked with the specific contact center platform. It wasn't plug-and-play. Each platform had its own quirks and requirements.

Marcus had assumed the voice agent would be standalone. Priya showed him that enterprise deployments almost always require integration with existing infrastructure. Planning for that integration from the start saved painful rework later.

Call recording and compliance

Recording calls seemed simple until compliance got involved.

In the US, some states require only one party to consent to recording. Others required all parties to consent. A company operating nationwide needed a compliance matrix that tracked which states required notification and the required language.

Internationally, the rules varied even more. Some countries require explicit opt-in. Some prohibited recording entirely without specific consent mechanisms. Some required that recordings be stored within the country.

Priya built a compliance matrix for the logistics company. Every state they operated in, every country on the expansion roadmap. What disclosure language was required? Where could recordings be stored? How long could they be retained?

The agent's opening script included recording disclosure where required. "This call may be recorded for quality purposes." In two-party consent states, the agent explicitly asked for permission. "Do I have your permission to continue with the recording?" If the caller declined, the agent could continue without recording or transfer to a human.

Getting this wrong meant legal exposure. Getting it right meant building compliance into the conversation design from the start rather than retrofitting it after launch.

Capacity planning

The first time the logistics company ran a promotion that drove high call volume, calls started failing. Drivers heard busy signals. Some calls connected but dropped mid-conversation.

The platform wasn't configured for peak load. They'd provisioned enough concurrent call capacity for average, not peak, volume.

Priya taught Marcus how to calculate concurrency requirements.

Start with expected call volume. How many calls per day? Identify peak distribution. Most businesses see 60% of daily volume in a 4-hour window. Calculate average handle time. For the installation agent, calls averaged 3.5 minutes.

The math works out to peak calls per hour times average handle time in hours equals concurrent call slots needed. Add a 30% buffer for spikes. That was the minimum concurrency to provision.

For the logistics company, 500 calls per day concentrated in a 4-hour morning window with a 3.5-minute average handle time meant they needed capacity for roughly 30 concurrent calls. They'd provisioned 15. During peak hours, half the calls couldn't connect.

Priya increased the provisioning and added monitoring. If concurrent usage exceeded 70% of capacity, alerts fired. If it exceeded 85%, they'd scale up before calls started failing.

Concurrency was how infrastructure demonstrated its ability to handle enterprise loads. Underprovisioning meant dropped calls during the moments that mattered most.

International considerations

The healthcare provider expanding internationally faced challenges that the logistics company hadn't faced.

Number provisioning required local compliance in each country. Some markets needed a local subsidiary to own numbers. Others required specific business registration documents. Lead times varied from days to months.

Carrier quality varied dramatically. The same agent, which sounded clear and responsive in the US, experienced noticeable latency in some regions due to the telephony infrastructure introducing round-trip delays. In rural areas, voice quality degraded to the point that speech-to-text accuracy suffered.

Priya's solution was to qualify carriers by region. Test actual call quality in each market before launch. Identify local carriers with better infrastructure. Accept that some markets might need different providers than others, even if it complicates operations.

Time zones affected campaign timing. Calling windows that worked in one country violated quiet hours regulations in another. The campaign logic needed per-country calling schedules.

International telephony wasn't just US telephony with different numbers. Each market had a different set of challenges.

What the spam incident taught them

Two weeks after the spam flags appeared, the logistics company's numbers were clean again. Reputation restored. Attestation configured. Answer rates climbed back to 40%, then 45%, then higher as the warm-up period established history.

Marcus never underestimated telephony again.

The agent could be perfect. The prompts could be flawless. The tools could be reliable. But if calls didn't connect, or if drivers saw "Spam Risk" and ignored the call, none of that mattered.

Telephony was the foundation. Invisible when it worked. Catastrophic when it didn't. The logistics company had learned to check the foundation before building the house.

Priya added telephony to the launch checklist. Number reputation verified. STIR/SHAKEN attestation confirmed. Carrier quality tested. Compliance matrix completed. Concurrency provisioned for peak plus buffer. Recording disclosures scripted.

Only after all of that passed did they start worrying about conversation design.

Chapter 18: Security and Compliance

What you'll learn: How to map the data flows that create compliance exposure and build security controls for voice-specific risks.

Key takeaways:

Voice agents touch more data categories and formats than any previous automation. Audio, transcripts, LLM context, tool calls, and logs each have different classification and compliance requirements.
Map the full data flow before launch. Audio capture through transcription through LLM context through tool calls through storage. Every node has different compliance requirements.
Recording consent varies by jurisdiction. Build a consent matrix that covers every state and country where you operate.
PII redaction strategy must cover both real-time (what the LLM sees) and at-rest (what logs store). Different data has different redaction requirements.
Compliance frameworks like SOC 2, HIPAA, and PCI-DSS each have specific requirements for voice data. Know which apply before you build.

Sandra found the problem two weeks before launch.

The insurance brokerage had built a Medicare qualification agent. It asked callers about their health coverage, eligibility, and care needs. Standard qualification questions that human agents ask every day. The voice agent asked the same questions, collected the same information, and stored transcripts in the same logging system the engineering team used for everything else.

Sandra was the compliance lead. She'd been brought in late to review the deployment. She looked at the data flow diagram and stopped at the transcript storage node.

"These transcripts contain health information. Once you store them, they're protected health information under HIPAA. Where are your access controls? Where's your audit logging? What's your retention policy?"

The engineering team didn't have answers. They'd built a voice agent. They hadn't built a compliance architecture.

What was planned as a two-week launch delay turned into a six-week delay. They re-architected the storage layer, implemented role-based access controls, built an audit trail for every human who accessed a recording, and established retention and deletion policies that satisfied both HIPAA and their state's insurance regulations.

The agent worked fine. The compliance posture didn't exist. Sandra made sure it would exist before anything went live.

The data surface

Voice agents create a data surface most enterprise security teams haven't mapped.

Call recordings are audio files containing spoken PII. Names, dates of birth, account numbers, health conditions, and financial details. All captured in a format that's harder to search but just as sensitive as text.

Transcripts are searchable text versions of those recordings. Every piece of PII the caller spoke is now indexed and queryable.

LLM context windows contain the conversation history, the system prompt, and any retrieved documents. For the duration of the call, sensitive data sits in the model's working memory.

Tool call logs record the actions taken. Account lookups, appointment bookings, and payment processing. Each log entry contains identifiers that connect to customer records.

Storage systems hold all of this after the call ends. Audio files, transcripts, logs, and metadata. Some temporary, some permanent, depending on policies that may not exist yet.

Sandra mapped this flow for the insurance brokerage. Audio captured by the telephony provider, streamed to the speech-to-text service, transcribed, and passed to the LLM; tool calls logged by the platform; transcripts stored in the company's data warehouse. Five different systems, three different vendors, multiple compliance requirements at each step.

Most teams focused on the conversation design. Sandra focused on where the data went afterward.

Data classification

Not all data requires the same level of protection. Sandra classified the voice agent's data into three categories.

PII included names, phone numbers, addresses, dates of birth, and account identifiers. Standard personal information that appeared in almost every call. Required access controls and retention policies, but didn't trigger the strictest regulatory requirements.

PHI included health conditions, medications, treatment history, and insurance coverage details. The Medicare qualification agent collected all of this. HIPAA applied, which meant business associate agreements with every vendor who touched the data, audit logging for every access, and specific retention and destruction requirements.

PCI included payment card numbers, CVVs, and billing information. If the agent processed payments, PCI-DSS applied. Card data needed to be encrypted in transit and at rest, couldn't be logged in plaintext, and required a hardened environment for any system that touched it.

The qualification agent touched PII and PHI but not PCI. Different use cases would have different classifications. A payment processing agent would need to be PCI-compliant. A general support agent might only need PII protections.

Sandra's first deliverable was a classification matrix. Every data element the agent might collect is mapped to its classification level and the applicable compliance requirements.

The rules for recording calls varied by jurisdiction, and getting them wrong could expose you to legal liability.

In the US, some states require only one party to consent. The company could record without telling the caller. Other states required all parties to consent. California, Florida, and several other states required the caller to be informed and agree before recording began.

For a company operating nationally, Sandra built a consent matrix. Every state has its own consent requirement and disclosure language that the agent would use. In one-party states, the agent could simply record. In two-party states, the agent's opening included disclosure and asked for permission.

Internationally, the requirements varied more. Some countries require explicit opt-in. Others prohibited recording entirely without specific consent mechanisms. The automotive marketplace operating across five countries needed country-specific consent flows, with some markets requiring written consent before any voice recording could occur.

Sandra's matrix was incorporated into the agent's configuration. Based on the caller's location, the agent knew whether to disclose, ask for consent, or proceed without mention. The logic was built into the conversation flow, not left to chance.

PII redaction

Once data existed, it could be accessed inappropriately. Redaction reduced the risk by removing sensitive elements from stored records.

Sandra evaluated two approaches.

Real-time redaction processed the transcript as it was created, identifying and removing PII before storage. The caller said, "My social security number is 123-45-6789," and the stored transcript read "My social security number is [REDACTED]." The sensitive data never reached permanent storage.

This was cleaner from a compliance perspective but harder to implement. Redaction had to be accurate, which meant maintaining and tuning detection models. False negatives left PII in storage. False positives removed legitimate information that might be needed for dispute resolution or quality review.

Post-processing redaction initially stored the full transcript, then redacted it after a defined period. For 30 days, the complete transcript was available for quality review and dispute resolution. After thirty days, a batch process redacted sensitive fields and deleted the originals.

This was easier to implement, but meant sensitive data existed in storage for the retention window. Access controls during that window became critical.

Sandra chose a hybrid approach for the insurance brokerage. Credit card numbers were redacted in real time because they were never needed after the call. Health information was retained for thirty days for dispute resolution, then redacted. Names and account numbers were retained longer for business purposes but were access-controlled to specific roles.

The right strategy depended on the use case. The decision needed to be explicit, not defaulted.

Access controls

Not everyone who can access transcripts should access transcripts.

Sandra implemented role-based access controls for the insurance brokerage's voice data.

Agents who handled escalations could access transcripts for calls they personally received. They couldn't browse other agents' calls.

Quality reviewers could access a random sample of transcripts for coaching purposes. They could see call content but not export it in bulk.

Compliance auditors could access any transcript, but every access was logged with their identity, the timestamp, and the business justification they provided.

Engineering teams could access anonymized transcripts for debugging and improvement. They couldn't see caller identities or account numbers.

Nobody had unrestricted access. Every role had a purpose, and access was scoped to that purpose.

For audio recordings, the controls were tighter. Fewer people needed to hear the audio than read a transcript. Audio access was limited to compliance and quality teams, with additional audit logging.

Audit logging

For regulated data, knowing who accessed what and when wasn't optional.

Sandra required audit logs that captured every access to voice data. The user's identity. The timestamp. The specific record accessed. The action taken. The business justification is required by policy.

For tool actions, the logs captured more. What action did the agent take? What caller was it acting on behalf of? What was the outcome? If the agent booked an appointment, the log showed which caller, which appointment, whether it succeeded, and what confirmation was returned.

These logs served multiple purposes. Security teams could investigate suspicious access patterns. Compliance teams could demonstrate controls during audits. Legal teams could reconstruct what happened during a disputed call.

The insurance brokerage's original architecture had logging, but it was engineering logging. Errors, performance metrics, and debugging information. Sandra required compliance logging layered on top. Different retention periods. Different access controls. Different query capabilities.

Compliance frameworks

Different industries require different certifications.

SOC 2 applies to most enterprise software deployments. It required demonstrating security controls, access management, availability, and confidentiality. Voice agents needed SOC 2 compliance from both the deploying company and the platform vendor.

HIPAA applied when protected health information was involved. Healthcare providers, insurers, and anyone handling health data need business associate agreements with every vendor in the data flow. The insurance brokerage needed BAAs with its telephony provider, speech-to-text vendor, LLM provider, and storage systems.

PCI-DSS applies when payment card data is involved. If the voice agent processed payments, the entire call path needed to meet PCI requirements. This often meant isolating payment handling into a separate, hardened flow rather than processing cards in the main agent.

Sandra's compliance checklist for the insurance brokerage had forty-three items. BAAs with six vendors. Access control implementations for four data stores. Audit logging configurations for three systems. Retention policies are documented and enforced. Encryption verified in transit and at rest.

The agent itself was a small part of the compliance surface. The infrastructure around it was the majority of the work.

Vendor assessment

The platform vendor mattered as much as the internal architecture.

Sandra evaluated what data the voice platform stored, where it was stored, and for how long. Did the platform retain transcripts? Did it retain audio? Who at the vendor could access customer data? What certifications did the vendor hold?

Some platforms store everything by default. Transcripts persisted indefinitely in the vendor's systems. Audio recordings lived in vendor storage. This created compliance exposure that many customers didn't realize until an audit.

Other platforms offered configurable retention, data residency options, and bring-your-own-storage models. Sandra required these for the insurance brokerage. Transcripts stored in the company's own infrastructure, not the vendor's. Audio is retained only as long as the policy requires, then deleted automatically.

The vendor security questionnaire became part of Sandra's standard process. Before any new voice deployment, the platform vendor answered questions about data handling, access controls, certifications, and incident response. Vendors who couldn't answer satisfactorily didn't get selected.

International complexity

The automotive marketplace operating across five countries faced compliance complexity that domestic deployments didn't.

GDPR applies in European markets, requiring consent mechanisms, data subject access requests, and right-to-erasure capabilities. A caller could request a copy of all data the company held about them, including voice recordings and transcripts.

Data residency requirements meant some countries required data to stay within their borders. Transcripts from calls in Brazil couldn't be stored on servers in the US. The company needed regional storage infrastructure.

Recording consent varied by country. Some required explicit opt-in. Some required specific disclosure language. Some had restrictions on how long recordings could be retained.

The marketplace built a compliance matrix by country. Consent requirements, storage locations, retention limits, and access request procedures. Each market had its own configuration. The agent's behavior adapted based on where the caller was located.

This was expensive and complicated. It was also non-negotiable for operating legally across jurisdictions.

What Sandra built

The six-week delay had been painful. The launch pushed back, the engineering team frustrated, the business team anxious.

But when the Medicare qualification agent finally launched, it launched correctly. Access controls are in place. Audit logging is active. Retention policies enforced. BAAs are signed with every vendor. Consent flows are configured by jurisdiction.

Sandra built a compliance framework that applied to every subsequent voice agent the brokerage deployed. A checklist that engineering teams followed from the start, not two weeks before launch. Data flow mapping is a required artifact. Classification decisions were documented before development began.

The voice agent touched more data categories and formats than any previous automation the company had deployed. Sandra made sure they understood what that meant before they learned it the hard way.

Her final deliverable was a launch gate. No voice agent was deployed to production without compliance sign-off. The gate added a week to every project timeline. It prevented the kind of six-week emergency that had nearly derailed the first deployment.

The agent could be brilliant. The compliance posture had to be solid. Sandra made sure they never forgot which one actually mattered for staying in business.