Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 1

Strategy

~33 min • 4 chapters

Chapter 1: From IVR to Voice Agents

What you'll learn:

What changed to make voice agents possible and where the technology succeeds and fails at enterprise scale.

Key takeaways:

IVR hit a ceiling because it only understands the options you programmed. Voice agents understand meaning and can ask clarifying questions, handle multi-step workflows, and take action.
Three technologies matured together. Large language models that work with meaning rather than keywords, speech recognition that handles real-world audio, and synthesized speech that doesn't exhaust callers.
Voice agents can be specialists in everything at once. One agent handles scheduling, billing, and support in a single conversation without transfers or holds.
Most pilots succeed. Most production deployments fail. The gap is operational, not technical.
Voice agents fail when deployed in high-emotion situations, when every case is an exception, when rules aren't documented, or when success requires persuasion.

Audrey's furnace dies at 11pm on a Tuesday in January. She calls the HVAC company that installed it three years ago. Press 1 for sales, press 2 for service, press 3 for billing. She presses 2. Another menu. She presses 3 for emergency service. She's on hold. After four minutes, she hangs up and calls a competitor.

That company lost a $400 service call and possibly a customer for life. Not because they lacked technicians. Because their phone system was designed for the company's convenience, not hers. Multiply this by millions of calls a day, and you start to see the scale of the problem.

Why IVR hit a ceiling

IVR emerged in the 1970s as an elegant solution to a real problem. Companies were drowning in call volume. Touch-tone menus let callers route themselves to specialized agents. Route billing questions to billing experts. Route technical issues to technical specialists. An assembly line for phone calls, and for simple needs, it worked.

The limitation was baked into the design. IVR systems are deterministic. They understand the options you programmed and nothing else. A caller says, "I need to change my delivery address but also check on a refund," and the system forces a choice. Pick one intent. Get routed. Explain yourself again. Get transferred. Start over.

Adding speech recognition didn't fix it. Early systems matched callers' utterances to predefined intents. Say "billing," and you get routed to billing. Say "I'm calling about the charge on my statement from last week," and the system either extracts the word "billing" or fails. Better training data and smarter models couldn't break through because the ceiling was architectural. The system could only ever be as flexible as the buckets you defined in advance.

What changed

Three technologies matured at roughly the same time. Each was necessary. None was sufficient alone.

Large language models learned to work with meaning rather than keywords. Traditional NLU classifies inputs into predefined categories. Language models build representations of meaning that generalize across contexts. A caller can say, "I ordered something last week, and it still hasn't shown up," and the model understands they're asking about order status without using those words. More importantly, it can figure out what information it needs and ask for it naturally.

Speech recognition crossed a threshold. Modern ASR handles accents, background noise, crosstalk, and the false starts of natural speech. Someone can call from a car with the radio on, interrupt themselves twice, and the system keeps up.

Synthesized speech stopped being a barrier. Earlier text-to-speech had a mechanical quality that made extended conversations exhausting. Current TTS matches tone and pacing well enough that the voice itself no longer becomes a distraction.

Chain these together, and the loop runs in under a second. Caller speaks; speech becomes text; the model reasons and responds; text becomes speech. Fast enough that the conversation feels natural.

What this makes possible

Those three capabilities combine into something IVR could never deliver.

Go back to Audrey and the broken furnace. In a voice agent world, she calls at 11pm, and an agent answers immediately. It asks her to describe the problem. She says the furnace stopped working and the house is getting cold. The agent asks if she smells gas. She doesn't. It asks for her address, confirms she's a customer from a previous installation, and offers three appointment windows for the next morning. She picks one. It confirms the details and asks if there's anything else. The call takes two minutes. The company keeps the customer.

That interaction breaks the old assembly-line model. A voice agent can be a specialist in everything at once. It doesn't route Audrey to a scheduling specialist, then to a billing specialist, then back to scheduling. It handles everything in one conversation. No transfers, no hold, no repeating yourself.

Where this breaks down

The technology is ready, but deploying it well requires understanding where it fits and where it doesn't. Some situations need humans. When Audrey calls back furious because the technician showed up late, tracked mud through her house, and charged more than the estimate, she doesn't want efficient problem resolution. She wants to be heard. A voice agent might say all the right words and still make it worse.

Some situations are too ambiguous. Voice agents work best when the rules are clear. When your internal documentation conflicts, when your best human agents would handle the same situation differently, when tribal knowledge fills gaps in formal policy, agents reflect that confusion back to callers. They're exactly as good as your documentation, which is often not good enough.

The question isn't whether voice agents can handle calls. They can. The question is which calls, with what support structure, and with what fallback when they hit their limits.

Getting to production

Most pilots succeed. Most production deployments struggle. The gap between them is operational, not technical.

Production systems need to work reliably across the full range of real-world variance that doesn't show up in controlled tests. They need to fail safely, with clean handoffs to humans instead of fabricated confirmations or dropped calls. They need monitoring, debugging, cost controls, and the ability to roll back changes that cause problems.

At enterprise scale, you'll run more than one agent. Your scheduling agent needs different capabilities than your billing dispute agent. When a caller's needs span multiple domains, something has to route between agents, maintain context across handoffs, and recover when a component fails. This is orchestration, and it's where enterprise voice AI either scales or stalls.

Organizations are handling millions of calls through AI agents today. The technology works. But most voice agent projects still fail because teams pick the wrong use cases, scope too broadly, or launch without the operational infrastructure to run at scale. The rest of this playbook is about avoiding those mistakes.

Chapter 2: Where Voice Agents Work

What you'll learn: How to evaluate whether a use case will succeed before you build, using patterns from hundreds of deployments.

Key takeaways:

Five characteristics predict success. High volume, repetitive and predictable patterns, clear success criteria, strong backend systems, and time-sensitive value.
Miss one characteristic and you can compensate. Miss two and the project struggles. Miss three and you're better off not starting.
Voice agents can exceed human performance in specific ways. Expertise across every domain simultaneously, perfect consistency across thousands of calls, and availability at any hour without degradation.
Voice agents fail when emotional stakes dominate, when every case is an exception, when rules aren't documented, or when success requires persuasion.
The fit assessment should happen before any technical work. A well-fit use case on a mediocre platform outperforms a poor-fit use case on excellent technology.

Maria runs screening operations for a staffing company that places hourly workers. Bartenders, line cooks, warehouse packers, and forklift operators. Her team screens thousands of candidates a week, and each role requires different questions. A bartender needs knowledge of drinks and customer service. A forklift driver needs certification. A line cook needs speed and experience with specific cuisines.

No single human interviewer can screen for all of these roles. So Maria hires specialists. Except the specialists burn out, turn over constantly, and drift off-script when they get tired.

Then there's Diego, who applied for a warehouse job on a Sunday night. Qualified, available, ready to start Monday morning. But Maria's screening team doesn't work Sundays. By Monday afternoon, Diego had already taken a job with a competitor.

She moved to voice agents. One agent screens for every role, every shift, every day. It never gets tired. It never forgets to ask about certifications. Diego would have been screened Sunday night, cleared within minutes, and working Monday morning. Maria's team now filters more than half of the unqualified candidates before they reach a human, handling over a million minutes of screening calls per month.

Another company saw these results and tried the same approach for billing complaints. High volume. Repetitive. Clear outcome needed. It failed within weeks. Customers called frustrated about charges they didn't recognize and were assigned to an agent who couldn't investigate account history, couldn't process a refund, and couldn't tell when a customer was about to escalate. Every call became a fifteen-minute frustration followed by a transfer.

The difference wasn't the technology. The same platform powered both deployments. The difference was fit.

What makes a use case work

After watching hundreds of deployments succeed and fail, patterns emerge. The use cases that work share five characteristics. Miss one and you can compensate. Miss three and you're better off not starting.

High volume. Voice agents have fixed costs. Building conversation flows, integrating systems, testing edge cases, and monitoring performance. That investment pays off only if you spread it across enough calls. Fifty calls a week will never justify the effort. Fifty thousand calls a month changes the math entirely.

Repetitive and predictable. Voice agents excel when conversations follow recognizable patterns. Not identical scripts, but variations on known themes. Maria's screening calls varied by role, but the structure was consistent: verify identity, confirm availability, ask role-specific questions, assess, and schedule. Billing complaints looked repetitive from the outside but weren't from the inside. Each one required a different judgment about a different account history.

Clear success criteria. The best use cases have unambiguous outcomes. Is the candidate qualified or not? The appointment is booked or not. The payment is collected or not. When success requires judgment about whether this particular customer got what they needed, you're in human territory.

Strong backend systems. Voice agents are only as capable as the systems they connect to. If your order management system has a clean API, the agent can check the status and modify deliveries. If resolving a call requires a human to navigate three screens and copy and paste between windows, the agent can only collect information. That's not automation. That's a complicated answering machine.

Time-sensitive value. Voice agents answer immediately. No hold times. No callbacks. If that immediacy matters, voice agents have a structural advantage. That's what the Diego problem was about. The company that could screen him on Sunday night got him. The company that made him wait lost him.

Maria had all five. The billing complaint team had one: volume. That's why one worked, and the other didn't.

Where voice agents exceed humans

Most discussions of voice agents frame the technology as matching human performance at a lower cost. In specific situations, it goes further.

Maria's agent holds the screening criteria for every role simultaneously. A human who screens bartenders all day can't also screen forklift drivers at the same expert level. One agent with breadth no single human can match. It screens candidate five thousand the same way it screened candidate one, which matters for fairness, compliance, and data quality. And it doesn't care when the phone rings. Sunday night costs the same as Tuesday afternoon.

These advantages compound. An agent that's expert in everything, perfectly consistent, and always available isn't just a cost savings. It changes what's operationally possible.

Where voice agents fail

Failure also follows patterns.

Emotional stakes dominate the interaction. When someone calls about a fraudulent charge that drained their checking account, they're scared. They want reassurance from someone who understands the severity. Voice agents can say appropriate words. They cannot provide genuine reassurance. Mild frustration about a delayed package is manageable. Distress is not.

Every case is an exception. Some processes exist precisely because the situations don't fit standard patterns. If the human handling these calls spends most of their time exercising judgment about one-off situations, a voice agent will spend most of its time transferring to humans.

The rules aren't written down. If your best agents succeed because of tribal knowledge and years of pattern recognition that nobody has documented, the voice agent has nothing to work with. Ask yourself: if you hired a smart new employee and gave them only your written documentation, could they handle this use case on day one? If not, your voice agent will have the same problem.

Know what you're solving for

Your primary goal shapes which characteristics matter most.

If your goal is to cut costs, volume matters most. You need enough calls to justify the investment. If your goal is to fix customer experience, friction matters most. Long hold times, repeated transfers, and callbacks. If your goal is driving revenue, conversion matters most. Lead qualification, appointment booking, payment collection, anywhere speed and consistency translate to dollars.

The same use case looks different through each lens. Order status checks are high volume and repetitive, making them attractive for cost reduction. But customers don't hate checking order status the way they hate disputing a bill, and there's no revenue upside. Know your goal before you evaluate fit.

Before you build

Use the five characteristics as a filter. High volume, repetitive patterns, clear success criteria, strong backend systems, and time-sensitive value. Five yeses and you have a strong candidate. Three or four and proceed carefully. Fewer than three, and look elsewhere.

Chapter 3: Choosing Your Primary Goal

What you'll learn: Why voice agent programs must optimize for one goal and how to identify which of the three goals is yours.

Key takeaways:

Voice agent programs fail when they try to serve multiple masters. Every design choice involves tradeoffs, and without a declared winner, every decision becomes a debate.
Three goals cover virtually all deployments. Cut costs (containment and efficiency), fix customer experience (satisfaction and effort), or drive revenue (conversion and retention).
Follow the budget to find your goal. If money comes from Operations or Finance, the goal is cost. From Customer Experience, the goal is CX. From Sales or Revenue Ops, the goal is revenue.
The Capability Ladder helps scope appropriately. Level 1 is informational only. Level 5 is proactive optimization. Most first agents should target Level 2 or 3.
Secondary goals will follow from success on your primary goal. Optimize for customer experience and cost savings often follow when satisfied customers call less often.

The VP of Operations wanted to cut costs. The Chief Customer Officer wanted to fix satisfaction scores. The CFO wanted both, plus faster collections on overdue accounts.

So the team built a voice agent that tried to do everything. Six months later, nobody was happy. The agent contained calls, but satisfaction dropped because it prioritized efficiency over experience. The collection prompts annoyed people who called about unrelated issues. Every metric moved a little, but none moved enough to matter.

Rachel led the second attempt. She spent two weeks on the call data before proposing anything. The company's biggest problem wasn't cost or collections. It was churn. Customers were leaving because support experiences were frustrating.

Rachel pitched one goal. Fix the customer experience. Every design decision would optimize for customer effort and satisfaction. Containment would matter only if it didn't hurt experience. Collections would stay with humans.

Satisfaction scores jumped. Churn dropped. And costs dropped too. Not because the agent was designed to cut costs, but because satisfied customers called less often. Fewer repeat contacts. Fewer escalations. Fewer angry calls that took twice as long to resolve.

The executive team tried to optimize for everything and achieved nothing. Rachel optimized for one thing and got the others as side effects.

Why one goal

Voice agent programs fail when they try to serve multiple masters. Not because the technology can't handle complexity, but because every design choice involves tradeoffs that need a tiebreaker.

Should the agent spend an extra thirty seconds confirming the customer is satisfied, or move efficiently to the next call? Should it push toward self-service resolution, or offer a human when the customer seems to want one? If your goal is to cut costs, you optimize for containment and speed. If your goal is customer experience, you optimize for satisfaction and effort. If your goal is revenue, you optimize for conversion. These priorities conflict constantly, and without a declared winner, every decision becomes a debate.You will track secondary metrics. But when two priorities conflict, you need to know which one wins.

The three goals

Cut costs. Handle more interactions with fewer people. Measure containment rate and cost per interaction. This goal fits when your support operation is a cost center under pressure, when volume is growing faster than budget, or when you're spending heavily on after-hours and surge staffing.

Fix customer experience. Reduce the friction that drives complaints, erodes loyalty, and shows up in satisfaction scores. Measure customer effort score, satisfaction score, repeat contact rate, and escalation rate. This goal fits when experience is a strategic priority, when competitors are winning on service, or when your support operation generates too many complaints and too much churn.

Drive revenue. Convert more leads, save more customers who try to cancel, collect more payments, and book more appointments. Measure conversion rate, retention rate, collection rate, or booking rate depending on the use case. This goal fits when you have high-volume revenue conversations that humans can't scale, or when you're leaving money on the table because you can't staff enough people to capture demand.

Finding your goal

Three questions usually clarify which goal is yours.

Who is sponsoring this initiative? If the money comes from Operations or Finance, the goal is to reduce costs. If it's coming from a Chief Customer Officer, the goal is to improve the customer experience. If it's coming from Sales or Revenue Operations, the goal is revenue. The sponsor's success metrics will become your success metrics, whether you like it or not.

What triggered the project? "Why are we spending so much on support?" means cost. "Why are customers so frustrated?" means experience. "Why aren't we capturing more revenue?" means revenue.

How will success be measured in twelve months? If you can't answer this clearly, have that conversation before you build anything. Ambiguity about success criteria is how teams end up building the agent that tries to do everything.

The capability ladder

Voice agents can do more than answer questions. They can look up information, make changes, execute workflows, and initiate conversations proactively. More capability means more value and more risk.

Level 1 is informational only. FAQs, store hours, policies. No customer-specific data. Low risk, low value.

Level 2 is read-only account access. Order status, account balance, appointment details. Requires authentication and system integration. Value jumps because these are the use cases customers actually call about.

Level 3 is simple updates. Change an address, cancel a subscription, reschedule an appointment. Requires write access and business logic. Containment rates improve because the agent can resolve issues rather than just report them.

Level 4 is multi-step workflows. Process a refund by checking eligibility, calculating the amount, issuing the credit, and sending a confirmation. It requires orchestration and careful error handling. This is where agents begin to replace a significant portion of the human workload.

Level 5 is proactive and optimizing. The agent suggests alternatives, offers retention deals, and identifies cross-sell opportunities. Requires business rules and guardrails to prevent overreach.

Start at Level 2 or 3. Prove the technology works. Build operational confidence. Then climb.

Goal drift

Here's the failure mode to watch for after launch.

You start with a clear goal. Cut costs. The agent launches. It contains calls. The numbers look good. Then someone notices satisfaction scores for agent-handled calls are slightly below human baselines. Someone suggests adding more empathetic language. Someone else suggests longer conversations. Someone else suggests offering callbacks instead of pushing for resolution.

Each suggestion sounds reasonable. Each one degrades containment slightly. After six months of reasonable suggestions, your cost-cutting agent no longer cuts costs. It became a mediocre experience agent by accident.

Guard against it by returning to your primary metric in every decision. Does this change improve cost per interaction? If not, why are we doing it? Maybe the answer justifies the tradeoff. But the question needs to be asked.

Rachel's agent succeeded because she never wavered. When someone suggested optimizing for containment, she asked whether it would hurt experience. When it would, she said no. When the CFO asked about collections, she said it was not yet a goal. Every decision filters through it.

Chapter 4: Platform vs. Build it Yourself

What you'll learn: What the build vs. buy decision actually involves and why most teams underestimate the infrastructure complexity.

Key takeaways:

Teams think they are deciding whether to build a voice agent. They are actually deciding whether to build an orchestration layer.
The voice agent stack includes real-time audio infrastructure, speech recognition and synthesis, language model integration, multi-agent coordination, telephony and carrier integration, monitoring and observability, and versioning and rollback. Conversation design is the tip of the iceberg.
Hidden costs of DIY include latency optimization across the full stack, reliability engineering for real-time audio, carrier relationships and telephony complexity, and ongoing maintenance as models and providers change.
Full DIY appeals to teams that underestimate complexity. Managed services appeal to organizations that want outcomes without the necessary capabilities. A platform with customization fits most enterprise needs.
The migration that follows a failed DIY attempt takes weeks. The DIY attempt takes months. Most teams end up on a platform eventually.

Daniel's team had built impressive systems before. Real-time fraud detection. A recommendation engine handling millions of requests per day. When the company decided to deploy voice agents for customer support, Daniel figured his team could handle it.

Six weeks later, they had a working demo. The agent answered questions about order status, understood natural language, and connected to backend systems. The CEO tried it. The CTO approved moving forward.

That's when the trouble started.

The demo ran on a single server with one call at a time. Production needed two hundred concurrent calls during peak hours. The demo had acceptable latency when everything worked. Production needed sub-second response times while juggling audio streams, transcription, language models, and text-to-speech simultaneously. The demo failed by crashing. Production needed to fail by routing callers to humans without anyone noticing.

Daniel's team spent three months on infrastructure they hadn't anticipated. Carrier relationships and SIP trunking. Audio codec optimization. Barge-in handling. State management across transferred calls. Monitoring for a system where "call dropped" could mean twenty different things.

By month six, they had burned through their entire year's budget and were running a small telephony company inside their organization. The voice agent itself represented maybe 20% of what they built. The other 80% was plumbing.

A year later, the company migrated to a platform. The migration took six weeks. They launched three new use cases in the following quarter.

The real question is orchestration

Most teams frame this as "should we build a voice agent?" The actual decision is whether to build an orchestration layer.

A single agent handling a single use case is a prototype. Enterprise scale looks different. Your scheduling agent needs capabilities different from those of your billing agent. Your outbound collections agent needs different guardrails than your inbound support agent. When a caller's needs span multiple domains, something has to route between specialists, maintain context across handoffs, and recover when components fail.

That something is orchestration. Routing calls to the right agent. Transferring between agents without losing context. Managing failover when providers go down. Coordinating the real-time pipeline of audio, transcription, reasoning, and synthesis within the latency budget that makes conversation feel natural.

When Daniel's team built their agent, they thought they were building conversation logic. They ended up building orchestration infrastructure. Most of those eight months were spent on problems that had nothing to do with how the agent talked to customers.

What the stack actually involves

Teams consistently underestimate the scope.

Real-time audio infrastructure handles voice as a continuous stream, rather than discrete requests. Audio buffers, network jitter, stream synchronization, all processed in memory. A 500-millisecond hiccup that's invisible in a web application creates an awkward pause in conversation.

Speech recognition and synthesis involve multiple providers with different accuracy, latency, voice quality, and pricing. You'll want different providers for different scenarios. Your system abstracts across them and handles failover when one degrades.

Telephony and carrier integration is an entire domain that most software teams lack. Phone numbers, SIP trunking, audio codecs, DTMF tones, call recording, carrier-specific behaviors. This is specialized knowledge that takes months to acquire.

Multi-agent state management gets complicated fast. Context needs to follow calls across transfers. History from earlier in the conversation needs to be available to the current agent. Versioning and rollback need to work without dropping active calls.

Monitoring and observability for voice systems have more failure modes than typical software. Was it a transcription error? A model hallucination? A dropped audio stream? An integration timeout? You need to trace across every component.

Then there are the non-obvious problems that consume months. Barge-in handling when callers interrupt mid-sentence. Turn-taking detection to distinguish between pauses and finished speaking. Latency optimization, where shaving 50ms off each step separates natural from awkward. Concurrent call scaling, where handling two hundred calls with consistent latency requires infrastructure most teams have never built.

These problems are solved once by platform teams and reused by every customer. Solving them yourself means your team becomes an expert in telephony infrastructure rather than in voice agent design.

Where you land on the spectrum

The decision falls along a spectrum with three positions.

Full DIY. You assemble the entire stack. Speech-to-text, language model, text-to-speech, orchestration, telephony, monitoring, all of it. You own everything. Total cost of ownership consistently exceeds initial estimates because the problem is genuinely larger than it appears from the outside.

A platform with customization. The platform handles infrastructure while you focus on what agents do. You design conversations, build integrations, and define business logic. You customize heavily but don't rebuild the foundation. This is where most enterprises land.

Managed service. Someone else builds and operates the agents. You provide requirements and system access. You get working agents without building or configuring anything in depth.

When DIY makes sense

DIY is not always wrong. It makes sense in specific circumstances.

Your requirements fall outside the scope of any platform. Deep modifications to audio processing, transcription behavior, or real-time pipeline architecture that platforms cannot accommodate after genuine investigation, not assumption.

You have a significant existing voice infrastructure. Extending it for AI agents may cost less than adopting a platform. The marginal cost changes when you're not starting from scratch.

Regulatory constraints prevent using third-party platforms. Some industries limit which vendors can process customer data. If platforms cannot meet compliance requirements, DIY may be the only option.

Voice AI is core to your competitive advantage. If you want full control over the technology stack for strategic reasons, building in-house may make sense even when it costs more.

Be honest with yourself. Most teams that choose DIY do so because they underestimate complexity, not because their requirements demand it.

When the platform makes sense

For most enterprises entering voice AI, platforms make more sense for predictable reasons.

Time to value compresses. Eight months becomes eight weeks. If getting to production matters, platforms win. Operational burden shifts. Your engineers focus on agents and business logic instead of plumbing. Pre-built integrations accelerate common scenarios, including CRM, ticketing, scheduling, and payment connectors. Multi-agent orchestration comes built in. Provider flexibility lets you swap speech recognition, synthesis, and language models without rewriting your stack.

Some teams land on a hybrid. Platform for orchestration and telephony, custom components where they need control. This works when the boundary is clear. It doesn't work when teams build custom infrastructure because they think it will be easy, and discover it isn't. That's not a hybrid approach. That's Daniel's eight months.

Evaluating platforms

If you're evaluating platforms, focus on what matters.

Multi-agent orchestration is the core value. Can you deploy multiple specialized agents? Can calls transfer while preserving context? Can you define routing rules?

Provider flexibility protects against lock-in. Can you swap providers without code changes? Can you use different providers for different agents?

Latency under load determines conversation quality. Ask for percentiles under production conditions with concurrent calls, not demo performance.

Observability determines operational sanity. When a call goes wrong, can you trace through transcription, reasoning, and synthesis? Can you replay calls?

Enterprise security and compliance determine viability. Where does data flow? What certifications exist? Get specific answers.

These criteria matter more than feature lists. A platform that is fast, flexible, and observable will serve you better than one with more features but worse fundamentals.

Daniel's team was fully capable of building what they built. But eight months and triple the budget later, they were building infrastructure when their goal was deploying agents. They were solving orchestration problems when their customers just wanted to check order status.

Don't spend eight months on someone else's problem.