Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 7

Operate

~30 min • 3 chapters

Chapter 25: Day-to-Day Operations

What you'll learn: The operational discipline required to run voice agents in production without silent degradation.

Key takeaways:

Voice agents need the same operational rigor as any production service. Daily checks, weekly reviews, monthly business reviews, runbooks, and versioning.
Daily automated checks catch immediate problems. Weekly human reviews catch quality drift. Monthly business reviews connect performance to outcomes.
Prompt changes should be treated like code deployments. Versioned, tested, gradually rolled out, and rollback-ready.
Every operational responsibility needs explicit ownership. Unowned metrics don't get watched.
The weekly review is the most important operational practice. It's the heartbeat of a program that improves instead of slowly failing.

Three months after launch, Marcus stopped watching the dashboard.

The Install Reminder Agent had been running smoothly. Completion rates were stable. Error rates were low. The team had moved on to building the next agent. The first agent was "done."

Nina noticed the problem during a routine call review. Install rates had dropped from 22% to 16% over two weeks. No alerts had fired because no one had configured alerts for business metrics. The dashboard clearly showed the decline, but no one had looked at it in three weeks.

Marcus investigated. The SMS gateway had added a new required field in an API update. Calls to the old endpoint returned 200 status codes but didn't actually send messages. The agent heard "success" from the API and confirmed delivery. Drivers waited for links that never arrived.

30% of install reminders over 2 weeks failed silently. The agent said, "I've sent you the link," with complete confidence, but the link wasn't being sent.

A weekly operational review that included tool-call success rates would have caught this in a matter of days. Marcus built the review process they should have had from the start.

Daily, weekly, monthly

Marcus established three rhythms.

Daily automated checks run every morning. Tool call success rates across all agents. Transcription confidence averages. Latency percentiles. Error rate trends. These checks ran automatically and posted results to a shared channel. If any metric crossed a threshold, it tagged the relevant owner.

The daily checks caught immediate problems. An API outage at 3am showed up in the morning check as a spike in tool errors. A transcription provider degradation appeared as a confidence drop. The checks didn't require human attention on normal days, but surfaced issues quickly when something went wrong.

Weekly human reviews happened every Thursday. Marcus and Nina sampled calls together. Ten random calls from the week. Five calls with unusually long handle times. Five calls that escalated to humans. They listened, scored against the quality rubric, and identified patterns.

The weekly review caught quality issues that automated checks missed. An agent who technically completed calls but rushed confirmations. A tool that succeeded but returned data that the agent misinterpreted. A new edge case that occurred three times this week and would happen more often at scale.

The weekly review also groomed the improvement backlog. Every issue identified became a ticket. Issues were prioritized by frequency and impact. The highest-priority items got fixed before the next review.

Monthly business reviews connected agent performance to business outcomes. Install completion rates. Cost per successful install. Comparison to human-handled benchmarks. Trend lines over the quarter.

The monthly review involved business stakeholders, not just the technical team. It answered the question leadership cared about: Is this investment working? The review surfaced strategic decisions. Should we expand to new use cases? Should we further optimize the current agent? Should we adjust targets?

Runbooks

When the SMS gateway incident happened, Marcus scrambled to figure out what to do. There was no playbook. He made decisions under pressure without guidance.

He built runbooks afterward.

The tool failure runbook specified what to do when a tool stopped working. Check the provider status page. Run a manual test against the API. Check recent changes to tool contracts or backend systems. If the failure is on the provider side, switch to the backup provider. If it's a contract mismatch, identify the change and deploy a fix.

The quality degradation runbook specified what to do when metrics dropped. Sample recent calls for the affected metric. Identify the common failure pattern. Check for recent prompt changes, provider updates, or backend changes. Isolate the cause before attempting fixes.

The escalation runbook specified who owned which decisions. Operations could disable a single agent autonomously. Disabling multiple agents required engineering approval. Rolling back to a previous version required a fifteen-minute assessment window. A complete shutdown required director approval.

Each runbook existed before the next incident, not in response to it. Marcus added to the runbooks whenever a new failure mode appeared. The runbooks captured organizational learning in actionable form.

Versioning

The SMS gateway incident taught Marcus another lesson. He didn't know what had changed because he wasn't tracking versions carefully.

He established versioning for everything that could change.

Prompt versions tracked every change to system prompts. Each version had a timestamp, an author, a summary of what changed, and a link to the test results that validated the change. Rolling back to a previous prompt version was a one-command operation.

Tool contract versions tracked changes to backend integrations. When the SMS gateway added a required field, the old contract version should have been marked deprecated, and the new contract version should have been validated before the change went live. The version history would have shown exactly when the mismatch began.

Knowledge base versions tracked changes to retrieved content for RAG-enabled agents. New documents, updated documents, deleted documents. If an agent suddenly started giving different answers, the version history showed what content had changed.

Agent configuration versions bundled all components together. This prompt version, with this tool contract version, with this knowledge base version. Rolling back an agent meant reverting to a known-good configuration bundle rather than trying to figure out which individual component caused the problem.

Each version included "what changed" notes. Not just "updated prompt" but "added handling for drivers who report storage issues, adjusted timeout from 5s to 8s for slow network conditions." The notes made debugging faster.

Ownership

The SMS gateway incident happened partly because no one owned the monitoring.

Marcus established explicit ownership for every operational responsibility.

Daily check review was the responsibility of the on-call engineer. They reviewed the automated check results every morning and acknowledged or escalated issues.

The weekly call review was led by Nina. She scheduled the sessions, prepared the sample, facilitated the review, and documented the findings.

Tool health monitoring was owned by Priya. She maintained relationships with backend teams, tracked their change schedules, and ensured tool contracts remained synchronized.

Business metric reporting was owned by Marcus. He prepared the monthly review, maintained the dashboards, and escalated when metrics drifted from targets.

Ownership was written down and visible. When something needed attention, there was no ambiguity about who was responsible. Unowned metrics didn't get watched. The explicit assignments prevented gaps.

Prompt changes as deployments

Early on, Marcus treated prompt changes casually. A quick edit, a manual test, push to production. The feedback cycle was fast and informal.

After a prompt change caused unexpected escalation spikes, he formalized the process.

Prompt changes went through the same rigor as code deployments. Write the change. Run the unit test suite. Run the boundary test suite. Get a review from at least one other person. Deploy to a staging environment. Run end-to-end tests against staging. Deploy to production at 10% traffic. Monitor for an hour. Expand to 100% if metrics are stable.

The process added overhead. A change that used to take ten minutes now takes two hours. But the process caught issues before production. The escalation spike would have appeared in staging, not in front of real callers.

Marcus also established change windows. Prompt changes are deployed on Tuesday through Thursday during business hours. Never on Friday before a weekend. Never during known high-volume periods. The windows ensured that someone was available to respond if a change caused problems.

The agent that had degraded silently for two weeks never degraded silently again.

Marcus checked the install rate dashboard every Thursday before the weekly review. Three months after establishing the operational cadence, the rate held steady at 24%. Higher than before the SMS gateway incident. Higher than when no one was watching.

The weekly reviews had surfaced optimizations that the team missed during development. A confirmation phrasing that reduced confusion. A retry pattern that recovered from transient failures. Small improvements, found because someone was looking.

Nina once asked him why he still checked the dashboard manually when the alerts would catch any real problems. He pulled up the historical chart and pointed to the two-week gap where the line had dropped from 22% to 16%. Nothing had alerted me. The dashboard clearly showed the decline. No one had looked.

He looked now.

Chapter 26: Monitoring and Alerting

What you'll learn: How to monitor voice agents at three layers so you catch problems before customers notice.

Key takeaways:

Monitoring operates on three layers. System health (latency, uptime, errors), leading indicators (transfer trends, confidence drops), and business outcomes (containment, conversion, satisfaction).
Most teams only build system health monitoring. That tells you the machine is running, not that it's achieving anything.
Leading indicators are the early warning system. Monitor transfer rate trends, tool error changes, conversation length shifts, and transcription confidence drops.
Different goals need different dashboards. Cost dashboards emphasize containment. Revenue dashboards emphasize conversion. CX dashboards emphasize satisfaction.
Structured event logging across the full call lifecycle enables replay and root cause analysis when things go wrong.

The dashboard showed green across the board. Uptime at 99.9%. Latency averaging 800 milliseconds. Tool error rate under 1%. The insurance brokerage's qualification agent was running perfectly by every measure the engineering team tracked.

But the transfer-to-close rate had dropped by 8 points over the past month. Qualified prospects were reaching out to the sales team but not converting. Something was wrong that the green dashboard couldn't see.

Diane, who'd designed the pilot scorecard, investigated. She found the root cause in the prompt history. A compliance update three weeks earlier had modified the language the agent used during qualification. The new phrasing was legally correct but clinically cold. Prospects who'd felt engaged during qualification now felt processed. They disengaged before the warm transfer could connect them with a sales agent.

The system was healthy. The business outcome was degrading. The monitoring couldn't see the difference.

Diane rebuilt the monitoring to watch what actually mattered.

Three layers

Monitoring operates on three distinct layers. Most teams only build the bottom one.

System health tracks whether the infrastructure is working. Latency percentiles, uptime, error rates, and concurrency utilization. These metrics tell you the machinery is running. They don't tell you it's achieving anything.

Leading indicators track metrics that predict changes in outcomes before they appear in business results. Transfer rate trends, tool error rate changes, conversation length shifts, transcription confidence drops, escalation rate changes. These metrics serve as an early warning system. A leading indicator moves before the business outcome moves.

Business outcomes track whether the agent is achieving its goal. For cost-focused agents, the containment rate. For CX-focused agents, satisfaction scores. For revenue-focused agents, the conversion rate. These metrics answer the question leadership actually cares about.

Diane mapped the qualification agent to all three layers.

System health included API latency, transcription service uptime, LLM inference time, and tool call success rates. These metrics caught outages and performance degradation.

Leading indicators included average conversation length, question rephrasing rate, escalation rate by stage, and drop-off points in the qualification flow. These metrics caught conversation quality problems before they affected business results.

Business outcomes included transfer-to-close rate, the primary metric that justified the agent's existence. This metric moved slowly but definitively showed whether the program was working.

The compliance update that damaged conversations never affected system health. Latency was unchanged. Error rates were unchanged. But conversation length dropped by 30 seconds, and the escalation rate at the warm transfer stage increased. These leading indicators would have surfaced the problem weeks before the transfer-to-close rate showed it.

Different dashboards for different goals

Different goals require different dashboards.

Cost-focused dashboards emphasize containment. What percentage of calls resolve without human involvement? What's the cost per contained call versus the cost per escalated call? Where do escalations cluster, and are those escalation triggers appropriate?

CX-focused dashboards emphasize experience quality. What are the satisfaction scores for agent-handled calls versus human-handled calls? What's the distribution of conversation lengths, and are there outliers indicating confusion or frustration? What's the callback rate, and what does it suggest about unresolved issues?

Revenue-focused dashboards emphasize conversion. What's the conversion rate at each stage of the funnel? Where do prospects drop off? What conversation patterns predict successful conversions?

Diane built three dashboard views for the qualification agent. The engineering team used the system health view for daily operations. The QA team used the leading indicator view for quality monitoring. The business team used the outcome view for monthly reviews.

Each view showed relevant metrics without cluttering with irrelevant ones. The engineering team didn't need to see the transfer-to-close rate every day. The business team didn't need to see latency percentiles. Focused dashboards made patterns visible instead of drowning them in data.

Recording what happened

When something went wrong, Diane needed to understand exactly what happened on a specific call. This required structured logging across the full call lifecycle.

Call start logged the timestamp, caller ID, entry point, and any context passed from previous interactions.

Identity verification logged the verification method, outcome, and confidence level.

Intent discovery logged the discovered intent, confidence, and any disambiguation that occurred.

Tool calls logged which tool was called, with what parameters, what response was received, and how long it took.

Agent responses logged what the agent said, whether it was interrupted, and how the caller responded.

Escalations logged the trigger, the destination, and the context passed to the receiving handler.

Call end logged the outcome, duration, and any scheduled post-call actions.

Each event had a timestamp and a correlation ID that linked all events for a single call. Diane could query for a specific call and reconstruct its complete timeline. When a prospect complained about a bad experience, she could trace exactly what happened turn by turn.

The logging paid for itself during root cause analysis. Instead of guessing what might have caused a problem, Diane queried for calls with specific characteristics and identified patterns. Calls that escalated at the warm transfer stage showed a common sequence in the logs. The compliance language appeared, the caller's response time increased, and they asked to speak with a person. The pattern was visible in the data.

What's worth a page at 3am

Not every metric deviation needed a page at 3am. Diane designed alerting thresholds that matched urgency to response requirements.

Pages went to on-call engineers for issues requiring immediate response. System outages, tool failure rates exceeding 5%, and latency sustained above 2 seconds. These alerts indicated that something was broken and that calls were failing.

Slack notifications went to the team channel for issues requiring same-day attention. Error rate trends upward over an hour, transcription confidence drops below the threshold, and the escalation rate spikes for a specific trigger. These alerts indicated that something was degrading and should be investigated before it caused an outage.

Weekly review items captured issues that didn't require immediate action but needed tracking. Slight metric drifts, new error codes appearing at low frequency, and conversation patterns that differed from historical baselines. These items fed the weekly operational review.

Each alert had a named owner and a runbook entry. The page said who to contact, what to check first, and what actions to take. Alerts without owners became noise. Alerts with clear ownership got resolved.

Diane also implemented alert fatigue prevention. An alert that fired constantly was worse than no alert. She tuned thresholds to minimize false positives and grouped related alerts to prevent notification storms during real incidents.

How bad is it

When something went wrong, Diane needed to know how wrong it was.

She defined severity levels with clear criteria.

Severity 1 meant the agent was completely unavailable or causing harm. All calls are failing, incorrect information is consistently being provided, and compliance violations are occurring. Response time was immediate. All hands until resolved.

Severity 2 meant significant degradation affecting a material percentage of calls. Tool failures affecting 20% of calls, latency causing caller drop-offs, and accuracy problems in a specific conversation path. Response time was within an hour. Dedicated engineer until it is resolved.

Severity 3 meant minor issues affecting a small percentage of calls or causing inconvenience, but not failure. Occasional tool timeouts, slight quality degradation, and a non-critical feature are unavailable. Response time was within one business day.

Severity 4 meant issues identified but not affecting current operations. Bugs in non-production code, documentation gaps, and improvement opportunities. These fed the backlog for future sprints.

The classification drove resource allocation. Severity 1 incidents got everyone's attention. Severity 4 issues got scheduled for convenient times. Without classification, every issue competed for attention equally, so critical issues sometimes waited while minor ones got fixed.

The compliance update that had triggered the eight-point drop happened on a Tuesday. Under the new monitoring, a drop in conversation length was observed on Wednesday morning. The increase in the escalation rate showed up on Thursday. By Friday, Diane had identified the root cause and scheduled the fix.

One week from the introduction to correction. Not one month.

The compliance language got revised. Still legally correct. No longer clinically cold. Transfer-to-close rate recovered within two weeks.

The green dashboard still showed green. Uptime still reads 99.9%. But now there were two other dashboards open beside it. One showed leading indicators. One showed business outcomes. The infrastructure running wasn't the same as the business working. Diane watched both.

Chapter 27: Quality Assurance

What you'll learn: How to run ongoing QA as a continuous practice that catches issues metrics miss.

Key takeaways:

QA is a continuous loop, not a pre-launch gate. Voice agent quality degrades as models update, caller behavior shifts, and prompts accumulate edits.
Sample strategically. Random calls for representative quality, plus targeted sampling of edge cases, long calls, escalations, and low-confidence transcripts.
Score on five dimensions. Task completion, policy compliance, conversation naturalness, escalation appropriateness, and persona consistency.
Human reviewers catch what automation misses. Pacing mismatches, tone problems, comprehension gaps, and awkward recoveries.
Every QA finding should map to an action. Prompt change, flow change, new test case, or explicitly accepted risk.

The metrics looked fine. The healthcare provider's scheduling agent successfully completed appointments. Satisfaction scores on post-call surveys were acceptable. Task completion rate exceeded the target.

But no-show rates for agent-booked appointments were 15% higher than human-booked ones. Patients were confirming appointments and then not showing up. The numbers said the agent was working. The operational reality said something was wrong.

Dr. Reyes, the clinic administrator who'd championed the deployment, started listening to recordings herself. She heard the pattern within the first hour.

The agent was rushing elderly patients through the confirmation step. It spoke quickly, confirmed the date and time in one breath, and moved on before patients had processed the information. Patients said "yes" because they didn't want to slow down the call, not because they'd understood the details.

Human schedulers paced themselves. They sensed when a caller needed more time. They offered to repeat the information. They confirmed comprehension, not just verbal assent. The agent confirmed the words without confirming the understanding.

This was the quality gap that lived between "task completed" and "task completed well."

Continuous QA

Dr. Reyes realized that QA couldn't be a pre-launch gate. It had to be an ongoing practice.

She established a QA sampling routine. Every week, reviewers listened to a sample of calls and scored them against a rubric. The sample wasn't random. It was stratified to catch different kinds of issues.

Random sampling provided a representative view of overall quality. Twenty random calls per week gave a statistically meaningful picture of typical performance.

Targeted sampling focused on high-risk call types. Calls with elderly patients, based on caller history. Calls exceeding average length, suggesting confusion or difficulty. Calls that triggered escalation, to verify the handoff was appropriate. Calls with low transcription confidence, where audio quality might have caused errors.

Incident-triggered sampling investigated specific problems. When no-show rates spiked for a particular appointment type, review those specific calls. When a patient complains, review that call and similar ones.

The combined sampling caught issues that pure random sampling would miss. The elderly pacing problem was observed in targeted samples of longer calls involving older patients. It wouldn't have surfaced quickly in a random sample because it affected only a subset of interactions.

The scoring rubric

Dr. Reyes built a rubric with five dimensions. Each dimension had specific criteria that reviewers could apply consistently.

Task completion asked whether the primary goal was achieved. Was the appointment booked correctly? Was all the required information collected? Was the confirmation sent? This was the baseline requirement, but not sufficient alone.

Policy compliance asked whether the required procedures were followed. Was identity verified appropriately? Were disclosures made where required? Were the scope boundaries respected? Compliance failures could be worse than task failures.

Conversation naturalness asked whether the interaction felt human. Did the agent's pacing match the caller's needs? Did responses flow naturally? Were transitions smooth? This was where the elderly pacing issue showed up.

Escalation appropriateness asked whether transfers were handled correctly. Did the agent escalate when it should have? Did it avoid unnecessary escalation? Was context preserved during the handoff?

Persona consistency asked whether the agent maintained its intended voice. Did warmth stay consistent throughout? Did stress responses stay in character? Did the agent sound like the same entity from start to finish?

Each dimension had a 1-5 scale with specific anchors. A 3 on naturalness meant "acceptable but room for improvement." A 5 meant "indistinguishable from a skilled human." A 1 meant "notably artificial or uncomfortable."

Reviewers scored independently before discussing. Calibration sessions aligned scoring standards. When reviewers disagreed significantly, they discussed the specific call to understand the different interpretations.

Finding what metrics miss

The elderly pacing issue exemplified what automated metrics couldn't catch.

Task completion rate was 100% for these calls. The appointments got booked. Satisfaction scores were acceptable because elderly patients gave polite ratings regardless of experience. No automated metric flagged a problem.

Human reviewers caught it because they could hear the mismatch between the agent's pace and the caller's comprehension. They noticed patients saying "okay" with hesitation. They noticed confirmations happening before patients had time to write down the appointment details.

Dr. Reyes catalogued the categories of issues that required human review.

Pacing mismatches where the agent moved faster or slower than the caller needed.

Tone inappropriate to the context, where the agent's emotional register didn't match the situation.

Comprehension gaps where the caller clearly didn't understand, but the agent proceeded anyway.

Missed cues where the caller signaled something the agent should have addressed, but didn't.

Awkward recoveries where the agent handled an interruption or error in a way that technically worked but felt unnatural.

Automated tests could validate logic. Human reviewers validated experience.

The feedback loop

QA findings need to become improvements. Dr. Reyes built a feedback loop that connected review findings to concrete changes.

Every QA finding is mapped to one of four actions.

A prompt change addressed issues with how the agent was instructed. The elderly pacing issue prompted a deliberate pause after confirmations and offered to repeat details for calls exceeding a certain duration.

Flow change addressed issues in the conversation structure. A finding that patients got confused about the appointment location led to restructuring the confirmation to state the location before the time.

New test cases capture issues that should be caught automatically in the future. The pacing issue became a test case that validated appropriate pausing behavior with simulated callers who were slower.

Accepted risk acknowledged issues that weren't worth fixing. A finding that some callers found the agent's voice slightly artificial wasn't actionable without changing the TTS provider, which would have other trade-offs.

Each week's QA review produced a prioritized list of changes. The highest-impact changes were implemented before the next review. The feedback loop ran continuously, not just during launch preparation.

Getting reviewers to agree

Different reviewers applied the rubric differently. One reviewer's 3 was another's 4. Without calibration, the scores weren't comparable.

Dr. Reyes ran monthly calibration sessions. Reviewers independently scored the same set of five calls. They compared scores and discussed discrepancies. Where interpretations differed, they refined the rubric criteria until agreement improved.

Calibration revealed which rubric dimensions were clear and which were ambiguous. Task completion was easy to agree on. Naturalness was harder. The calibration discussions produced more specific guidance for the subjective dimensions.

Dr. Reyes tracked inter-rater reliability scores. When the agreement dropped below the threshold, she scheduled additional calibration. Consistent rubric application produced consistent quality signals.

Quality over time

Individual call scores mattered less than trends over time.

Dr. Reyes built dashboards that tracked average scores by dimension, week over week. A dropping naturalness score signaled accumulating prompt changes that had degraded conversation flow. A rising escalation appropriateness score validated recent improvements to handoff logic.

She tracked score distributions, not just averages. An average of 4 could mean all calls scored 4, or half scored 5 and half scored 3. The distribution showed whether quality was consistent or variable.

She tracked common failure categories. What percentage of QA findings were pacing issues, compliance issues, or escalation issues? Shifts in the category mix revealed where to focus improvement efforts.

The trends turned QA from a spot-check into an operational signal. Quality wasn't a gate to pass. It was a metric to continuously improve.

The pacing issue that caused elevated no-show rates was fixed within 2 weeks of identification. The prompt adjustment added deliberate pauses after confirmations. For calls exceeding a certain duration, it offered to repeat details. For elderly-flagged callers, it asked directly: "Would you like me to go through that again?"

No-show rates for agent-booked appointments dropped to match those for human-booked appointments within a month.

Dr. Reyes still listened to recordings herself, most weeks. Not because the QA team couldn't handle it. Because the gap between tasks completed and tasks completed well was audible in ways that dashboards couldn't show. The hesitation before a patient said "okay." The slight confusion in a voice that politeness kept from complaining.

The metrics had said the agent was working. Her ears had told her it wasn't. She trusted her ears now, and she'd built a team that trusted theirs.