Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 6

Launch

~28 min • 3 chapters

Chapter 22: Launch Planning

What you'll learn: How to move from successful pilot to production deployment without catastrophic failures.

Key takeaways:

Launch is not a single event. It's a graduated ramp with gates at every stage. Internal dogfooding, single location, limited expansion, then gradual percentage increase.
Each ramp stage needs explicit gate criteria. Completion rate, error rate, satisfaction scores, and operational stability must meet thresholds before advancing.
Kill switches must be granular. Per-location, per-queue, or per-use-case controls let you disable problem areas without shutting down everything.
Practice the rollback before launch. An untested rollback takes hours under pressure. A rehearsed rollback takes minutes.
Communicate the deployment plan to affected staff before launch. Surprised employees become resistant employees.

The pilot had succeeded at one location. The scheduling agent booked appointments accurately, patients reported positive experiences, and the no-show rate matched human-booked appointments. Leadership wanted to move fast.

The healthcare provider launched to all twelve locations simultaneously.

Within 48 hours, the operations team discovered that three locations had different EHR configurations. A required field that existed in the pilot location didn't exist in these three. The booking tool received the data, returned a success message, but the appointment never actually appeared in the EHR. The agent confirmed appointments that were never created.

They didn't have a kill switch per location. The only option was to shut down the entire deployment. The choice was to let phantom appointments continue accumulating or take down a working system at nine locations to fix three.

They shut it down. Six hundred patients had appointments that didn't exist. The rollback took four hours. The patient outreach took a week. The internal credibility damage took months to repair.

Grace, the engineering director who inherited the cleanup, rebuilt the launch process from scratch. Never again would a deployment go from zero to one hundred without stops along the way.

One location at a time

Grace designed a launch sequence with gates at every stage.

Internal dogfooding came first. The engineering and product teams used the agent to book their own appointments. They caught issues that testing missed because they used it like real patients, not like testers following scripts. A week of internal use before any external exposure.

A single location came next. One clinic, chosen for its standard EHR configuration and cooperative staff. Full production traffic, full monitoring, real patients. Two weeks at a single location with daily review of every metric.

Limited expansion added two more locations. Still small enough to monitor closely, diverse enough to surface configuration differences. Another two weeks with the expanded set.

Gradual ramp increased coverage in stages. 25% of locations, then 50%, then 75%, then 100%. Each stage lasted at least a week. Each stage required gate criteria to be met before advancing.

The full rollout that had taken two days now took eight weeks. But eight weeks without incident was better than two days followed by six months of credibility repair.

What has to be true before you advance

Each stage had explicit criteria for advancement.

Completion rate had to exceed 85% of the pilot baseline. A drop suggested something was wrong at the new locations.

The error rate had to stay below 0.5% for critical errors. More than five critical errors per thousand interactions triggered an investigation before advancing.

Patient feedback had to maintain satisfaction scores within two points of pilot levels. Patients at new locations should have comparable experiences.

Operational stability meant no unplanned incidents requiring rollback. Any incident paused the ramp until the root cause was identified and fixed.

Grace wrote the criteria before launch and posted them for the team to see. Advancing to the next stage wasn't a judgment call. Either the numbers met the criteria, or they didn't.

When the second expansion stage showed a drop in completion rate at one location, they didn't advance. They investigated and found a timezone configuration issue causing appointment times to display incorrectly. Fixed it, monitored for a week, confirmed the completion rate recovered, then advanced.

Kill switches

The original deployment had one switch: on or off for everything. Grace built granular controls.

Per-location switches let them disable the agent at specific clinics without affecting others. If three locations had EHR issues, they could disable those three while nine continued operating.

Per-queue switches let them disable specific call types. If the agent handled scheduling and prescription refills, they could disable refills without affecting scheduling.

Per-feature switches let them disable specific capabilities. If a new confirmation flow caused problems, they could revert to the previous flow without rolling back the entire agent.

Each switch had a clear owner and a documented activation procedure. Anyone on the operations team could flip a location switch. Queue and feature switches required engineering approval. The escalation path was defined before it was needed.

Grace tested the switches monthly. A kill switch that had never been tested might not work when needed. They practiced disabling and re-enabling locations during low-traffic windows. The muscle memory mattered.

Rollback planning

The original rollback had taken four hours because no one had practiced it.

Grace built a rollback plan and rehearsed it.

The plan specified exactly what steps to take. Disable the agent through the kill switch. Route calls to the backup queue. Notify the contact center team. Update the status page. The steps were documented in a runbook that anyone on call could follow.

The rehearsal happened before launch. During a maintenance window, the team executed the full rollback procedure as if an incident had occurred. They timed it. Identified three steps that took longer than expected. Optimized those steps. Rehearsed again.

The rehearsed rollback took eighteen minutes. The unrehearsed rollback had taken four hours. The difference was preparation.

Grace also defined rollback triggers. What conditions would warrant an immediate rollback rather than investigation first? A critical error rate exceeding 2% triggered an immediate rollback. A completion rate below 70% triggered an investigation with the rollback authority. Patient complaints about phantom appointments triggered immediate rollback at affected locations.

The triggers were written down before launch. No one had to decide at the moment whether things were bad enough to roll back. The criteria made the decision.

Who gets told what

The original launch caught the contact center staff by surprise. Calls suddenly routed differently, and no one had told them why.

Grace built a communication sequence.

Two weeks before launch, the contact center managers received a briefing. What the agent would do, how it would affect call routing, and what to expect from patient feedback. Managers needed time to prepare their teams.

One week before launch, frontline staff received training. How to handle patients who mentioned the automated system. How to escalate if the agent has made errors. What the rollback procedure looked like from their end.

On launch day, everyone received a status update. The rollout was beginning; here's what to watch for, and here's who to contact with issues.

Daily during the ramp, the team sent brief status updates. Metrics summary, any incidents, next stage timeline. No surprises.

Grace learned that communication failures caused as much damage as technical failures. Staff who felt blindsided became resistant. Staff who felt informed became allies who surfaced issues early.

Day-one monitoring

The first 24 hours required dedicated attention.

Grace assigned specific people to watch specific metrics. One engineer watched system health: latency, error rates, and uptime. One analyst watched business metrics: completion rates, transfer rates, and patient feedback. One operations lead watched the support queue for emerging issues.

They defined escalation triggers. Latency above 1.5 seconds for more than five minutes paged the engineer. The 10-point drop in completion rate triggered a war room. Patient complaints about incorrect appointments triggered an immediate check of the booking tool.

The team stayed in a shared channel for the first 24 hours. Real-time communication, quick decisions. After 24 hours without incident, they shifted to regular monitoring cadence.

Grace also scheduled a 24-hour retrospective. What had they seen? What had surprised them? What needed adjustment before the next stage of the ramp? The retrospective captured learning while it was fresh.

The second deployment took eight weeks instead of two days. It reached all twelve locations without a single phantom appointment.

Grace kept a photo on her desk from the first deployment. Six hundred sticky notes on a whiteboard, each representing a patient who needed to be called about an appointment that didn't exist. The photo reminded her why the process mattered.

Eight weeks felt slow when leadership wanted to move fast. But eight weeks without the sticky notes was faster than two days followed by six months of repair.

Chapter 23: Enterprise Readiness

What you'll learn: The seven-domain checklist that separates working prototypes from production-ready enterprise deployments.

Key takeaways:

Enterprise readiness spans seven domains. Scope and safety, tooling and integrations, voice UX reliability, security and compliance, observability and operability, quality gates, and economics.
The most commonly missed items are error taxonomy completeness, rollback plan testing, tool call failure rate monitoring, and transcription confidence alerting.
Every checklist item exists because a real deployment failed without it. The checklist is compiled from production incidents.
Enterprise readiness is a recurring gate, not a one-time event. Every new agent configuration and major prompt change should go through the review.
The review should involve engineering, ops, compliance, and business stakeholders together.

The staffing marketplace was ready to scale. The screening agent had proven itself in a pilot, handling hundreds of calls per day with strong completion rates. Leadership approved expanding to tens of thousands of calls per day.

Tomás, the QA lead who'd built the test pyramid, suggested they run through an enterprise readiness checklist before scaling. The product team pushed back. The agent worked. The pilot had succeeded. Why slow down for a checklist?

Tomás insisted. They walked through each domain together.

The review surfaced 23 items they hadn't addressed. Their error taxonomy covered 6 of the 14 error codes the backend could return. Their monitoring dashboard tracked completion rates but not tool call failure rates. They had no alerting for drops in transcription confidence. Their rollback plan existed on paper but had never been tested.

Fixing these took three weeks. Those three weeks prevented the class of production incidents that kill enterprise programs at scale. The checklist had earned its place.

Seven domains

Enterprise readiness spans seven domains. Each contains items that seem obvious until you realize you haven't done them.

Scope and safety cover what the agent does and doesn't do. Are all supported intents tested? Are scope boundaries implemented with explicit fallbacks? Does the agent use bounded retries rather than infinite loops? Does it enforce tool-first truth, never claiming success before tools confirm?

Tooling and integrations cover how the agent interacts with backend systems. Are tool contracts documented with all required fields? Are timeouts defined and tested? Is idempotency implemented for state-changing actions? Is the error taxonomy complete, with every possible error code mapped to a spoken response?

Voice UX reliability covers how the agent handles the messy reality of phone conversations. Does it handle interruptions gracefully? Is there a confirmation policy for high-stakes information? Are disambiguation rules defined for ambiguous inputs?

Security and compliance cover data handling, access controls, and regulatory requirements. This is detailed enough to deserve its own chapter, which Sandra provided in Chapter 18.

Observability and operability cover whether you can see what's happening and fix it when things go wrong. Is logging structured and searchable? Are dashboards built for each stakeholder? Are alerts defined with clear thresholds and owners? Can you replay a call from the logs to understand what happened? Is the rollback plan documented and tested?

Quality gates cover the testing that must be passed before deployment. Does the regression suite run on every change? Are noisy audio tests included? Have load tests validated the expected call volume? Is human review built into the ongoing process, not just launch preparation? Did pilot metrics meet defined thresholds?

Economics and controls cover whether the business case holds at scale. Is the cost per call calculated accurately? Are guardrails in place to prevent runaway costs? Is the business case tied to measurable KPIs that will be tracked post-launch?

Tomás created a checklist with 5 to 8 items per domain. Every item had a pass/fail criterion. No judgment calls.

The commonly missed items

Some items appeared on almost every team's gap list.

Error taxonomy completeness was missed because teams built the taxonomy during development when they were focused on happy paths. They added error handling for the errors they encountered. They didn't audit the backend to find errors they hadn't encountered yet.

Tomás added a requirement: before launch, get the complete list of error codes from every backend system. Map each one. The staffing marketplace discovered eight error codes that had never been fired during development. When they eventually fired in production, the agent had responses ready.

Rollback plan testing was missed because teams wrote plans but didn't practice them. A plan that hadn't been tested might not work under pressure.

Tomás required a rollback drill before every major launch. Execute the full procedure during a maintenance window. Time it. Identify bottlenecks. Fix them. A tested rollback took fifteen minutes. An untested rollback took hours and sometimes failed entirely.

Tool-call failure-rate monitoring was missed because teams focused on conversation metrics. Completion rate, transfer rate, and handle time. Tool failures got logged but not dashboarded or alerted on.

Tomás added the tool call success rate to the standard monitoring package. When the SMS gateway started failing at 3am, the alert fired immediately instead of waiting for morning QA review.

Transcription confidence alerting was missed because transcription happened invisibly. The agent received text from the speech-to-text engine and worked with whatever it got. Low confidence led to higher error rates, but teams didn't track them.

Tomás added confidence score monitoring with alerts when the average confidence dropped below the threshold. When a provider updated degraded accuracy for certain accents, the alert surfaced within hours.

The checklist in practice

Tomás organized the checklist as a structured review meeting. Engineering, operations, compliance, and business stakeholders attended together. They walked through each domain, each item, discussing the evidence for pass or fail.

Some items required documentation. The tool contracts needed to be written, not just in the prompt. The rollback plan needed to be documented, not kept as tribal knowledge.

Some items required demonstration. Show me the alert that fires when the tool call error rate exceeds 1%. Show me the dashboard that tracks transcription confidence. Show me the rollback procedure executing in the test environment.

Some items required sign-off from specific roles. Compliance signed off on security and regulatory items. Operations signed off on alerting and runbook items. Business signed off on the economics items.

The meeting took half a day. Teams resisted the first time. After seeing what the review caught, they requested it for every launch.

Every time something changes

The automotive marketplace, operating at enterprise scale, treated the readiness review as a recurring practice rather than a one-time checklist.

Every new agent configuration was reviewed. A new language variant of an existing agent still needed its error taxonomy verified, its compliance requirements mapped, and its rollback procedure documented.

Every major prompt change went through a lighter version. Did the change affect any scope boundaries? Did it introduce new tool calls? Could it create new error paths? Changes that touched critical areas triggered a focused review of affected domains.

Version updates from vendors triggered reviews. A new speech-to-text model might affect baseline confidence in transcription. A new LLM version might change how the agent interprets prompts. Each required validation against the relevant checklist items.

The automotive marketplace ran these reviews weekly. With over a thousand agent configurations across five countries, something was always changing. The review became part of the operating rhythm, not a special event.

The 23 gaps found in the first review became 23 permanent checklist items. As the staffing marketplace launched more agents and encountered more failures, the checklist grew. Every production incident that could have been prevented became a new item for the next team.

The product team that had pushed back on the three-week delay began requesting reviews for their own launches. They'd seen what happened to teams that skipped it.

The checklist existed because real deployments had failed without it. Every item is traced to an incident somewhere. "It works on the happy path" was the beginning of the question, not the answer.

Chapter 24: Change Management

What you'll learn: How to handle the organizational resistance that technical excellence alone cannot overcome.

Key takeaways:

Voice AI changes roles, workflows, and power dynamics. Teams that treat deployment as a technology project rather than an organizational change will face resistance.
Middle manager resistance is rational. Their authority is often tied to headcount and queue management, exactly the levers voice AI disrupts.
Redefine roles before they disappear. "Call Center Supervisor" becomes "AI Experience Lead" with new responsibilities in prompt iteration, quality scoring, and performance optimization.
Change the metrics. Supervisors evaluated on AI agent performance adapt faster than those still measured on human call volume.
Communicate early and repeatedly. Announce the vision six months before the role change. Proactive communication prevents reactive panic.

The voice agent worked. The technology wasn't the problem.

The problem was Luis, a supervisor with fifteen years in the automotive marketplace. He managed forty call center agents across one floor. His authority came from his deep knowledge of the call queues, his relationships with his team, and his ability to hit headcount targets during hiring surges. Voice AI threatened all three.

When the deployment began, Luis became the center of resistance. He found reasons why the agents couldn't handle his team's calls. He escalated every error as evidence of fundamental flaws. He told his team the technology was unreliable, and they believed him because they trusted him more than they trusted the project leads.

The technology team tried to address Luis's concerns with better data. They showed him the metrics. Completion rates exceeded human performance. Cost per call was down. Customer satisfaction was stable. Luis nodded, said he understood, and continued undermining the rollout.

Elena, the program director, realized they were treating a people problem like a technology problem. The metrics were right. The change management was missing.

Understanding resistance

Elena spent a week talking to supervisors like Luis. She wanted to understand what they feared, not convince them they were wrong.

The pattern was consistent. Middle managers had built their careers on operational expertise. They knew which agents handled which call types best. They knew how to manage queues during peak hours. They knew how to hire, train, and retain good people. Voice AI made much of that expertise irrelevant.

Their authority was tied to headcount. A supervisor managing forty people had more organizational weight than one managing twenty. If voice agents handled half the calls, teams would shrink. Supervisors would lose direct reports, budget, and standing.

Their metrics were designed for the old model. Queue time, calls per hour, first-call resolution, and agent utilization. Voice AI changed what mattered, but no one had updated the metrics. Supervisors were still evaluated on metrics that the new model would make worse.

The resistance wasn't irrational. It was a rational response to incentives that hadn't changed, even though the technology did.

New roles, new metrics

Elena redesigned the rollout to address the underlying concerns.

She redefined what supervisors would manage. Not agents taking calls, but agents configuring AI. The team's job was shifting from call handling to AI training, quality review, and conversation optimization. Luis would manage people who made the voice agents better, not people who competed with them.

She worked with HR to update job descriptions, titles, and career paths. "Call Center Supervisor" became "AI Experience Lead." The role description emphasized prompt iteration, quality scoring, and performance optimization. New responsibilities came with training and time to develop new skills.

She changed the metrics. Supervisors would be evaluated on AI agent performance, not human agent call volume. Quality scores, completion rates, and continuous improvement velocity. The metrics that mattered for the new model became the metrics that mattered for performance reviews.

Luis was skeptical. But when Elena offered him a choice, lead the transformation or watch someone else lead it, he chose to lead. His deep knowledge of customer conversations became valuable in a new way. He knew which calls were hard, which edge cases mattered, and which phrasings worked. That knowledge made him an effective AI trainer once he stopped seeing AI as a replacement for him.

Training for transformation

The four floors of call center agents across multiple cities needed new skills.

Elena built a training program that started before anyone's role changed. Agents learned conversation design principles. They learned how prompts worked and how small changes affected agent behavior. They learned quality review methodologies and how to score calls against rubrics.

The training wasn't optional or an afterthought. It ran for three months alongside normal operations. Agents spent 20% of their time in training while still handling calls. By the time their roles shifted, they had the skills to succeed.

Some agents struggled with the transition. They'd been hired for phone skills, not analytical skills. Elena created two paths. Agents who took to the new work became AI trainers and quality reviewers. Agents who preferred customer interaction became the human escalation layer, handling the complex calls that AI couldn't resolve.

Both paths were positioned as valuable. The escalation role wasn't a demotion for people who couldn't learn the new skills. It was a specialized role for people whose human judgment was essential for difficult situations.

Timing matters

Elena learned that communication timing mattered as much as communication content.

The announcement came early. Six months before roles would change, everyone learned what was coming. The vision for AI-augmented customer service. The new roles that would exist. The training that would be provided. The timeline for transition.

Early communication prevented panic. When rumors about layoffs spread, leadership could point to the announcement. No one was being eliminated. Roles were transforming. The message had to be repeated many times before it sank in, but having announced it early gave time for repetition.

Updates came regularly. Monthly town halls showcased progress on the rollout, introduced early adopters thriving in new roles, and addressed concerns directly. The updates weren't spinning. They acknowledged problems, explained how problems were being addressed, and celebrated genuine successes.

Individual conversations happened before role changes. Every supervisor met with their manager to discuss their specific transition. What would their new role look like? What training would they receive? What were their concerns? These conversations surfaced individual issues that town halls missed.

Measuring adoption

Elena tracked adoption as role transformation, not technology deployment.

The technology metrics were necessary but not sufficient. The agent was handling calls. Completion rates were good. But were the organization's roles, workflows, and metrics actually changing?

She tracked training completion. What percentage of affected staff had finished the new skills curriculum? Low completion meant the transformation wasn't taking hold.

She tracked quality review volume. How many call reviews were supervisors completing per week? If supervisors were supposed to be doing quality review but weren't, the role change existed on paper but not in practice.

She tracked prompt iteration velocity. How many prompt improvements originated from the transformed teams versus from the central engineering team? High velocity from transformed teams meant they'd internalized their new role. Low velocity meant they were still thinking of themselves as call handlers who'd been given strange new tasks.

She tracked voluntary attrition. Some people would leave rather than adapt. That was expected and acceptable. But if the best performers were leaving, something was wrong with the transition. Elena watched attrition by performance tier, not just overall.

Two years later

Two years after the transformation began, the automotive marketplace operated fundamentally differently.

The four floors of call center agents hadn't shrunk. They'd transformed. Former call handlers now manage AI configurations, review call quality, and optimize conversation flows. They understood customer needs from years of direct interaction. They understood AI capabilities from months of training and practice. The combination made them more valuable than either skill set alone.

Luis managed a team of twelve AI experience builders. His floor still existed, but his team's work had changed completely. They spent their days reviewing calls, identifying improvement opportunities, testing prompt changes, and monitoring performance dashboards.

He still walked the floor every morning, the way he had for fifteen years. But now, when he stopped at someone's desk, the conversation was about transcription confidence scores and prompt adjustments, not queue times and call volumes.

His authority hadn't come from headcount after all. It came from knowing his people and knowing his customers. Both still mattered. The technology was just a different way to apply them.