Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 9

Scale

~30 min • 3 chapters

Chapter 30: Expanding Use Cases

What you'll learn: How to grow from one agent to a portfolio without fragmentation or duplicated effort.

Key takeaways:

Expansion follows three pathways. Deepening (adding intents to existing agents), broadening (building new agents), and extending (adding languages or channels).
Sequence expansion by leverage and risk. Deepening reuses the most infrastructure. Extending carries the highest brand risk.
The most common expansion failure is starving existing agent maintenance to fund new builds. Allocate capacity deliberately. 70% new build, 20% maintenance, 10% exploration.
Shared components create compounding returns. Identity verification, context lookup, and monitoring tooling should be built once and reused everywhere.
Portfolio governance becomes critical when there are three or more agents. Treat agents as a system with shared standards and operational practices.

The Install Reminder Agent had been running for eight months. Completion rates were stable. Operations were smooth. Leadership wanted more.

Marcus had three expansion candidates on the roadmap. The COI Verification Agent would automate the collection of certificates of insurance from drivers. An install verifier, which would confirm successful app installation days after the initial reminder. And Spanish-language support for the existing agent, serving the 30% of drivers who preferred Spanish.

Each represented a different kind of expansion. Marcus needed to understand which to pursue first and how to allocate resources across them.

Nina helped him see the pattern. Expansion followed three distinct pathways, each with different investment requirements and risk profiles.

Three pathways

Deepening meant adding intents to an existing agent. The install verifier would become a new intent within the Install Reminder Agent. Same infrastructure, same telephony, same operational practices. The agent would handle "remind and install" calls and also "verify installation" calls.

Deepening was low risk because it reused existing investment. The telephony was already provisioned. The monitoring was already running. The team already knew how to operate the agent. The new intent added capability without adding operational complexity.

Broadening meant building new agents for different use cases. The COI Verification Agent was a completely new workflow. Different tools, different conversation design, different success metrics. It shared the platform but nothing else.

Broadening required more investment. New tool contracts, new prompts, new test suites, new operational runbooks. But it also unlocked new value. COI collection was a manual process that consumed hundreds of operator hours monthly.

Extending meant adapting existing agents to new channels, languages, or markets. Spanish support for the Install Reminder Agent required new STT/TTS configurations, translated prompts, and a culturally adapted persona. The conversation logic stayed the same. The delivery changed.

Extending the required specialized investment. Translation was more than word substitution. The agent needed to sound natural in Spanish, not like an English speaker translating into Spanish. Pacing, phrasing, and cultural references all needed adaptation.

What to build first

Marcus evaluated each pathway against three criteria.

Shared infrastructure leverage measured how much existing investment could be reused. Deepening had the highest leverage. The install verifier reused everything except the new intent's prompt and tools. Broadening had moderate leverage. The COI Agent reused the platform and operational practices but needed everything else built anew. Extending had variable leverage. Spanish support reused the conversation logic but needed new voice infrastructure.

The risk profile measured what could go wrong and how severe the consequences would be. Deepening was the lowest risk because failures would be isolated to the new intent while the existing agent continued working. Broadening was a moderate risk because a new agent's failure wouldn't affect existing agents. Extending was the highest risk because a poorly localized agent could damage the brand with an entire language segment.

Value unlocked measured the business impact of success. Deepening unlocked incremental value. The install verifier improved completion rates for an existing process. Broadening unlocked new value. The COI Agent automated a manual process. Extending the unlocked market value. Spanish support reached 30% of drivers with a degraded experience.

Marcus plotted each option and chose the sequence: deepen first, broaden second, extend third.

The install verifier went first. High leverage, low risk, incremental value. A good way to build confidence and demonstrate consistent execution.

The COI Agent went second. Moderate leverage, moderate risk, new value. More investment is required, but with a clearer business impact.

Spanish support went third. Lower leverage, higher risk, significant market value. Worth doing, but worth doing carefully.

Where the time goes

With multiple agents in production and more in development, Marcus faced a resource allocation problem. How much capacity went to maintaining existing agents versus building new ones?

He'd seen the common failure. Teams starved existing agent maintenance to fund new builds. The existing agents degraded slowly. Small issues accumulated. By the time anyone noticed, the maintenance backlog was overwhelming.

Marcus established an allocation rule. 70% of capacity went to the current priority initiative. 20% went to maintenance and optimization of existing agents. 10% went to exploration and experimentation.

The 70% built the install verifier, then the COI Agent, then Spanish support. One major initiative at a time, fully staffed.

The 20% kept the Install Reminder Agent healthy. Weekly optimization reviews continued. Bugs got fixed. Monitoring stayed current. The existing agent didn't degrade while attention went elsewhere.

The 10% explored future possibilities. Could the platform handle inbound calls? Could agents transfer between each other? What would multi-language support require? Exploration seeded the roadmap for future prioritization.

The allocation wasn't rigid. Some weeks, a production incident demanded more maintenance attention. Some weeks, a major deadline demanded more building attention. But the baseline allocation prevented any category from being starved.

What gets reused

As the agent portfolio grew, Marcus noticed duplication. Each agent implemented identity verification in a slightly different way. Each had its own error-handling patterns. Each built monitoring dashboards from scratch.

Nina identified candidates for shared components.

Identity verification became a shared module. The same verification flow, prompts, and error handling are used by every agent who needs to verify callers. Changes to verification updated all agents simultaneously.

Context lookup became a shared service. When an agent needed to retrieve caller history, it used the same lookup service regardless of which agent was requesting it. The service was optimized once and used everywhere.

Knowledge base infrastructure became shared. Multiple agents could access the same policy documents, FAQ content, and operational procedures. New documents were indexed once and are available to all agents.

Operational tooling became standardized. Monitoring dashboards followed a template. Alerting configurations followed a pattern. Runbooks used consistent formats. New agents launched with operational infrastructure already in place.

Shared components created compounding returns. The second agent was faster to build than the first because it reused components. The third was faster than the second. The investment in getting components right early paid dividends as the portfolio grew.

Managing multiple agents

At three agents, informal coordination worked. With five agents, Marcus needed governance.

He established a portfolio review cadence. Monthly meetings with engineering, operations, and business stakeholders. Each agent's performance was reviewed. Resource allocation decisions made. Expansion priorities set.

The review covered standardized metrics across all agents. Completion rates, error rates, cost per interaction, business outcomes. Comparing agents on consistent metrics revealed which were healthy and which needed attention.

The review also covered dependencies. Did Agent A's performance depend on a backend service that Agent B also used? Could a change to that service affect both agents? Dependencies needed visibility across the portfolio.

Versioning became critical. Which prompt version was each agent running? Which tool contract versions? Which platform version? When an issue appeared, the governance process enabled rapid identification of which agents might be affected.

The automotive marketplace, managing over a thousand agent configurations across five countries, demonstrated the endpoint of portfolio governance. Their center of excellence treated agents as a system. No agent stood alone. Each was part of a portfolio with shared standards, shared infrastructure, and shared operational practices.

A year after the Install Reminder Agent launched, the logistics company operated three agents. The Install Reminder (now with verification), the COI Agent, and the Spanish Install Reminder.

The roadmap showed three more for the following year. Inbound support triage, callback scheduling, and Portuguese language support.

Marcus pulled up the timeline for the first agent. Four months from concept to production. The COI Agent, the third one they'd built, had taken six weeks. The Spanish Install Reminder, which reused almost everything from the English version, had taken two weeks.

The pattern was clear. Each agent launched faster than the one before. Shared components accumulated. Operational practices solidified. The team improved at a skill they'd use for years.

The Install Reminder Agent had been a project. The portfolio was becoming something else. A capability that would keep compounding as long as they kept building on what came before.

Chapter 31: Handling Drift

What you'll learn: How to detect and correct the silent performance degradation that accumulates without anyone noticing.

Key takeaways:

Voice agents experience four types of drift. Model drift (provider updates change behavior), data drift (caller patterns shift), behavior drift (prompt edits accumulate inconsistencies), and business drift (the business changes, but the agent doesn't).
Model drift is the most insidious. Your agent changes without you having to do anything.
Detect drift by comparing behavior distributions week over week. Automated checks flag statistically significant shifts before metrics visibly degrade.
Set degradation thresholds in advance. Know which metric movements trigger investigation versus immediate action.
Prevention beats detection. Monitor provider changelogs, track caller characteristics, maintain prompt hygiene, and integrate voice agents into business change processes.

The scheduling agent hadn't changed in six weeks. No prompt updates, no tool changes, no configuration modifications. Grace reviewed the deployment logs to confirm. Nothing had been touched.

But patient behavior had changed. The agent interpreted "sometime next week" as a request for options. Now it was booking Monday appointments without offering alternatives. Patients were getting appointments they hadn't explicitly chosen.

Grace investigated the platform changelog. The LLM provider had shipped a model update three weeks earlier. The update was labeled as a "minor improvement to instruction following." Grace hadn't been notified because it wasn't a breaking change.

The model update had subtly shifted how the model interpreted scheduling language. What the old model treated as a preference request, the new model treated as a direct instruction. The agent's prompt hadn't changed. The model underneath it had.

This was model drift. The agent changed without anyone at the healthcare provider making any changes.

Four types of drift

Grace catalogued the drift categories her team needed to monitor.

Model drift happened when LLM or speech providers updated their systems. A new model version could change how prompts are interpreted. A new speech-to-text update could change transcription accuracy for certain accents. A new text-to-speech update could change voice characteristics. The agent's behavior shifted without any change on the customer's side.

Model drift was the most insidious because it was invisible. No deployment happened. No configuration changed. The metrics simply started moving for reasons that weren't apparent in internal logs.

Data drift happened when caller populations or behaviors changed over time. Seasonal patterns affected what people called about. Marketing campaigns brought different demographics. Product changes altered the questions customers asked. The agent who worked well for last month's callers might not work well for this month's callers.

Data drift was gradual and often masked by volume. A slow shift in caller demographics took months to affect metrics noticeably. By the time the degradation became visible, the underlying cause was long past.

Behavior drift happened when accumulated prompt changes introduced inconsistencies. A prompt that started clean accumulated fixes, edge case handling, and special cases. Over time, instructions conflicted with each other. The model received contradictory guidance and produced unpredictable behavior.

Behavior drift was self-inflicted. It came from the team's own incremental changes, made without reviewing the entire prompt. Each change seemed reasonable. The accumulation became incoherent.

Business drift occurs when the business changes, but the agent doesn't. Policies updated. Products launched. Prices changed. The agent continued operating on stale information. Callers received outdated answers or were routed to offerings that no longer existed.

Business drift was the most common because organizations changed constantly, and updated processes often forgot the voice agent.

How to spot it

Each drift type required a different detection.

Model drift detection compared behavior distributions week over week. What percentage of scheduling calls resulted in offering options versus direct booking? What percentage of verification calls required multiple attempts? Shifts in these distributions, unconnected to any internal change, suggested model drift.

Grace built automated checks that compared this week's behavior distribution to last week's. Statistical tests flagged significant deviations. When the "option offering rate" dropped from 70% to 40% with no internal changes, the alert fired.

Data drift detection tracked caller characteristics and intent distributions. Were callers asking different questions? Were demographics shifting? Were call volumes changing by time of day or day of week? Changes in input patterns, even when agent behavior was consistent, suggested data drift.

Grace tracked intent distribution weekly. When a previously rare intent suddenly became common, or a common intent became rare, investigation was warranted. The drift might be benign (seasonal change) or might require agent adjustment.

Behavior drift detection required periodic prompt audits. Grace scheduled monthly reviews where the team read the full prompt and checked for contradictions, redundancies, and unclear instructions. The review asked: if this prompt were being written fresh today, would it look like this?

Automated tools could help. Prompt linting checked for obvious issues. But human review caught the subtle inconsistencies introduced by incremental editing.

Business drift detection requires integration with business change processes. When policies are updated, the voice agent should be on the change checklist. When products are launched, prompts should be reviewed. When pricing changed, any agent who discussed pricing should be updated.

Grace worked with the business teams to add voice agent review to their change management processes. The agent became a stakeholder in product and policy changes, notified alongside other systems that needed updates.

How much is too much

Not every drift required immediate action. Grace defined thresholds that distinguished acceptable variation from concerning degradation.

For completion rate, a 2-point drop within normal variance didn't trigger an investigation. A 5-point drop did. The threshold was calibrated to historical variance. More volatile metrics had wider thresholds.

For satisfaction scores, any drop beyond historical variance triggered an investigation. Satisfaction was noisier than completion rate, so the threshold was set relative to observed variance, not an absolute number.

For tool error rates, any increase beyond baseline triggered an investigation. Errors were expected to be rare and stable. Increases were never normally distributed.

The thresholds were different by goal. A cost-focused agent could tolerate more satisfaction variance than a CX-focused agent. A revenue-focused agent had tight thresholds on conversion rate but looser thresholds on handle time.

Grace documented the thresholds and the reasoning behind them. When an alert fired, the responder knew whether it was barely over threshold or significantly over threshold. Severity drove response urgency.

Fixing drift

When drift was detected, correction depended on the type.

Model drift correction often meant prompt adjustment. If the new model interpreted language differently, the prompt needed to be more explicit. "When the caller says 'sometime next week,' offer three available options before booking" was clearer than assuming the model would infer the right behavior.

Sometimes model drift couldn't be corrected by prompt changes. If the new model was significantly worse for the use case, Grace escalated to the platform team to explore rollback or alternative providers.

Data drift correction meant agent adaptation. If new caller segments had different needs, the agent needed to address them. If seasonal patterns change, call topics and prompts might need seasonal variants. Data drift was a signal to update, not a problem to fix.

Behavior drift correction meant prompt cleanup. Grace periodically refactored prompts, consolidating redundant instructions, resolving contradictions, and simplifying complex sections. The refactored prompt produced the same behavior but with cleaner instructions that reduced future inconsistency risk.

Business drift correction meant content updates. Update the prompt with new policies. Update the knowledge base with new products. Update the tool contracts with new backend behaviors. Business drift was a synchronization issue, solved by keeping the agent up to date with the business.

Stopping it before it starts

Detection was necessary, but prevention was better.

Model update monitoring meant tracking provider changelogs and testing against new versions before they affected production. Grace established a staging environment that ran on preview model versions. When providers announced updates, the staging environment validated behavior before the update reached production.

Data monitoring meant continuous tracking of caller characteristics. Grace built dashboards that showed intent distribution, caller demographics, and call patterns over time. Gradual shifts became visible before they caused problems.

Prompt hygiene meant disciplined editing practices. Every change was small and targeted. Every change was documented with a rationale. Monthly audits checked for accumulating inconsistencies. The prompt stayed clean because the team actively maintained it.

Business integration meant embedding the voice agent in change management. When policies changed, the agent was updated. When products were launched, the agent learned about them. The agent was treated as part of the system, not a separate tool that might be forgotten.

The model drift incident took three weeks to identify through QA sampling. With automated drift detection, it would have been flagged in days.

Grace built detection for all four types. Weekly behavior distribution comparisons. Monthly intent analysis. Quarterly prompt audits. Continuous integration with business change processes.

Three months later, the LLM provider shipped another update. The behavior distribution dashboard flagged a shift within 48 hours. The agent had started interpreting "sometime this week" as Monday preference, the same pattern as before.

Grace adjusted the prompt to be more explicit. Patients never noticed.

The drift was inevitable. The degradation wasn't. The difference was whether anyone was watching the space between this week and last week, looking for patterns that seemed normal in isolation but weren't normal when you tracked the trend.

Chapter 32: Building Internal Capability

What you'll learn: How to evolve from a pilot project to a strategic organizational capability.

Key takeaways:

Organizations move through three maturity stages. Pilot (one use case, external support, learning), Scale (multiple use cases, internal team, optimization), and Strategic (portfolio management, center of excellence, self-service).
Key roles to build include conversation designers, prompt engineers, QA specialists, platform engineers, and analytics engineers.
The most successful programs retrain existing contact center staff rather than replacing them. They understand customers from years of interaction and can improve agents faster than external teams.
Governance must scale with the portfolio. Lightweight at the Pilot stage, formal at the Strategic stage.
The endgame is voice AI as a strategic organizational capability, not a one-off automation project. Organizations that build this capability gain a durable competitive advantage.

The automotive marketplace started with a pilot. One use case, one engineering team, and a consulting firm helping with conversation design. The agent routed customer inquiries, freeing human agents for complex issues.

Three years later, they operated over 1,000 agent configurations across 5 countries. The four floors of call center agents who used to take calls now maintain and improve AI systems. A center of excellence set standards, trained teams, and reported program outcomes to executives.

That evolution didn't happen by accident. It happened because Lucia, the VP who championed the program, understood that voice AI was a capability, not a project. Projects end. Capabilities mature.

The maturity model

Lucia navigated three distinct stages, each requiring a different level of organizational investment.

The pilot stage was about learning. One use case. Small team. External support for specialized skills. Success meant proving the concept worked. Failure was acceptable because learning happened either way.

At the pilot stage, the automotive marketplace had two engineers, borrowed time from the product team, and a consulting firm providing conversation design expertise. The scope was narrow. The goal was validation.

The scale stage was about multiplication. Multiple use cases. Internal team. Optimization focus. Success meant demonstrating repeatable value. Failure was more costly because the investment had increased.

At the scale stage, Lucia built an internal team of conversation designers who understood voice. Prompt engineers who could iterate without external help. QA specialists who owned quality. Platform engineers who maintain the infrastructure. The consulting firm transitioned to advisory, available for difficult problems but not embedded in daily work.

The strategic stage was about organizational embedding. Portfolio of agents. Center of excellence. Self-service tooling. Success meant voice AI becoming a standard capability for the organization. Failure was unlikely because the capability had become institutionalized.

At the strategic stage, the center of excellence set standards that all agents followed. Business teams could request new agents through a defined process. Some teams could iterate on prompts themselves using approved tooling. The capability extended beyond the central team into the broader organization.

Key roles

Each maturity stage required different roles.

Conversation designers translated business requirements into conversation flows. They understood how people talked on the phone and how agents should respond. They wrote the initial prompts and defined the conversation structure.

Prompt engineers refined and optimized the prompts. They ran the weekly optimization reviews. They made targeted changes based on data. They maintained prompt quality over time.

QA specialists ensured conversation quality through sampling and scoring. They built rubrics. They trained reviewers. They connected findings to improvements.

Platform engineers maintained the infrastructure. They managed integrations with STT, LLM, and TTS providers. They built monitoring and tooling. They ensured the platform was reliable and performant.

Analytics engineers built the data pipelines that enabled conversation analytics. They structured logs. They created dashboards. They enabled the analysis that drove optimization.

At the pilot stage, one or two people covered multiple roles, with external support. At the scale stage, roles became dedicated. At the strategic stage, roles expanded into teams with specialized functions within each.

Build, buy, or partner

Lucia made deliberate choices about which capabilities to build internally and which to source externally.

She built conversation design internally. This was core to competitive advantage. Understanding their customers' conversations was strategic knowledge that shouldn't live outside the company.

She bought platform infrastructure. Building a voice AI platform from scratch wasn't worth the investment when mature platforms existed. The technology was commoditizing. The differentiation was in how it was applied.

She partnered for specialized expertise. When entering new markets, she engaged consultants who understood local conversation patterns. When facing novel technical challenges, she engaged specialists. The partners extended capability without requiring permanent headcount.

The build/buy/partner decision changed over time. At the pilot stage, more were partnered because internal capability didn't exist. At the strategic stage, more was built because the capability had become central to operations.

Training the transformed workforce

The four floors of call center agents didn't disappear. They transformed.

Lucia invested heavily in training. Agents who had spent years talking to customers learned how to design conversations. They understood customer needs from experience. Training gave them the vocabulary and frameworks to apply that understanding to AI systems.

The training program ran for three months. Fundamentals of voice AI. Prompt writing and iteration. Quality review methodology. Analytics interpretation. Hands-on practice with real agents.

Not everyone completed the transition successfully. Some agents struggled with the analytical aspects of the new work. Lucia created two paths. Agents who thrived with AI work became experience builders, maintaining and improving agents. Agents who preferred human interaction became the escalation layer, handling complex calls that AI couldn't handle.

Both paths were positioned as valuable. The escalation role wasn't a consolation prize. It was a specialized function that required human judgment in difficult situations. The experience builder role wasn't a replacement for the old job. It was a higher-leverage application of the same customer knowledge.

Retention during the transformation was higher than Lucia expected. Agents appreciated the investment in their development. They preferred learning new skills to watching their jobs slowly become obsolete.

When you have a thousand agents

With over a thousand agent configurations, governance became essential.

Lucia established approval processes. New agents required a business case review. Prompt changes to production agents required peer review. Changes affecting multiple agents required an architecture review.

She established standards. Prompt structure followed a template. Error handling followed patterns. Monitoring included required metrics. New agents launched with consistent infrastructure.

She established ownership. Each agent had a designated owner responsible for its performance. Owners attended monthly portfolio reviews. Ownership rotated when people moved to new roles.

She established a measurement. The program reported quarterly to the executive team. Outcomes were tied to business metrics. Investment was justified by demonstrated value.

Governance added process overhead. But governance also prevented the chaos that would have resulted from a thousand agents managed independently with no coordination.

Knowing where you are

Lucia developed a self-assessment that helped teams understand where they were and what they needed to build.

Pilot maturity meant one or two agents, external support for key skills, reactive operations, and limited measurement. Teams at pilot maturity should focus on proving value with minimal investment.

Scale maturity meant multiple agents, an internal team for core skills, proactive operations, and systematic measurement. Teams at scale maturity should focus on building internal capability and demonstrating repeatable value.

Strategic maturity meant a portfolio of agents, a center of excellence, self-service for routine work, and program-level reporting. Teams at strategic maturity should focus on organizational embedding and continuous capability development.

The assessment asked specific questions. Do you have dedicated conversation designers? Do you run weekly optimization reviews? Do you have automated drift detection? Do business teams know how to request new agents?

Teams assessed themselves honestly and built roadmaps to reach the next stage. The roadmaps focused on capability gaps rather than just agent counts.

Three years later

Three years into the transformation, voice AI was part of how the automotive marketplace operated. Not a special project with dedicated attention. A standard capability, like the website or the mobile app.

The four floors of call center agents had become four floors of AI experience builders. They understood customers from years of direct interaction. They understood AI after months of training and practice. They improved agents faster than any external team could because they lived in the customer conversations every day.

Lucia walked one of those floors during a quarterly review. A former call handler, now an experience builder, was explaining a prompt change she'd made to handle a specific customer complaint pattern. She'd found the issue in conversation analytics, designed the fix, tested it, and deployed it. The whole cycle had taken three days.

Three years earlier, that same employee had answered phones. Now she was teaching an AI system to do it better than she ever could.

The technology had been the beginning. This was what it had been for.