Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 8

Improve

~19 min • 2 chapters

Chapter 28: Conversation Analytics

What you'll learn: How to mine conversation data to find improvement opportunities that intuition would miss.

Key takeaways:

Conversation data is the highest-signal source for voice agent improvement, richer than surveys, business metrics, or system logs.
Different goals require different analytics. Cost teams analyze handle time drivers. CX teams analyze frustration signals. Revenue teams analyze conversion drop-off points.
Find divergence points, the conversation turns where similar calls start producing different outcomes. That turn is your optimization target.
Common issue signatures include escalation spikes, tool error clusters, false successes, and drop-off points. Each has a standard investigation path.
Prioritize improvements by frequency, impact, and fixability. Fix what matters most and can be fixed fastest.

The staffing marketplace processed tens of thousands of screening calls daily. Each call generated a transcript, tool call logs, outcome labels, and timing data. The data accumulated faster than anyone could read it.

Tomás, who'd built the test pyramid, faced a new challenge. The test suite caught regressions. It didn't identify improvement opportunities. For that, he needed to find patterns in production conversations that revealed what was working and what wasn't.

He started by looking at completion rates by question order. The screening flow asked candidates about role preference, location, and availability. Different prompt versions asked these questions in different sequences.

The data showed something unexpected. Candidates who were asked about availability third, after role and location, had a 23% higher completion rate than candidates asked about availability first. The sequence mattered more than the questions themselves.

Tomás restructured the conversation flow based on that insight. Screening throughput improved 15%. One pattern, found in conversation data, changed the business outcome.

Data sources

Conversation analytics drew from multiple sources. Each source revealed different aspects of agent performance.

Transcripts showed what was said. The exact words the caller used, the exact words the agent used, and the sequence of exchanges. Transcripts were searchable and could be analyzed for patterns in language, phrasing, and topic progression.

Tool call logs documented the actions taken. Which tools were called, with what parameters, and what responses were received? Tool logs revealed where the agent succeeded and failed at executing tasks.

Event timelines showed when things happened. Timestamps for each turn, each tool call, each state transition. Timelines revealed pacing issues, delays, and instances where callers waited too long.

Outcome labels showed how calls ended. Successful completion, escalation, caller abandonment, or error. Outcome labels enabled segmentation: what distinguished calls that succeeded from calls that failed?

Caller metadata added context. Caller history, demographic indicators where available, time of day, and entry point. Metadata-enabled analysis of whether certain caller segments had different experiences.

Tomás built pipelines that combined these sources. A single call could be viewed as a transcript with embedded timestamps, tool calls shown at the points they occurred, and an outcome labeled at the end. Analysis could slice across any combination of features.

Different goals, different questions

Different goals required different analytical approaches.

Cost-focused analytics asked: where are calls taking too long, and why? Tomás segmented calls by handle time and compared long calls to short calls. Long calls showed more question repetition, more disambiguation attempts, and more tool retries. Each finding pointed to a moment in the conversation that could be optimized.

CX-focused analytics asked: where are callers getting frustrated, and what causes drop-off? Tomás identified calls where callers abandoned before completion. He mapped the conversation turn where abandonment happened. Clusters appeared: callers who gave up during identity verification, callers who left when asked for availability, callers who hung up after a tool error. Each cluster was a different problem with a different fix.

Revenue-focused analytics asked: Where are callers disengaging before conversion? For outbound qualification calls, Tomás mapped the conversation path for calls that converted versus calls that didn't. Converted calls showed specific question patterns, confirmation sequences, and phrasing. Non-converted calls showed different patterns. The differences became optimization targets.

The analytical framework matched the goal. A cost team would prioritize handling time drivers. A CX team would prioritize signals of frustration. A revenue team would prioritize conversion predictors.

Where calls start to differ

The most valuable insights came from identifying moments in conversations where similar calls produced different outcomes.

Tomás developed a divergence analysis method. Take two populations of calls: successful and unsuccessful. Find the point in the conversation where they start to differ. That turn is the divergence point.

For the availability question ordering, the divergence appeared at the first question. Calls that opened with availability had higher early abandonment rates. Calls that opened with role showed higher engagement. The divergence point was turn one.

For the escalation rate, the divergence appeared at error recovery. When a tool failed, and the agent handled it gracefully, callers stayed. When the agent fumbled the recovery, callers requested a transfer. The divergence point was the error response.

For the conversion rate, the divergence appeared in the summary. Calls in which the agent summarized qualifications before transfer had higher downstream conversion rates. Calls that transferred without a summary had lower conversion. The divergence point was the pre-transfer moment.

Each divergence point was an optimization opportunity. Fix the early question, and abandonment drops. Improve error recovery and escalation drops. Add a summary, and conversion rises.

Patterns that signal problems

Tomás catalogued the data patterns that indicated common problems.

Escalation spikes appeared as sudden increases in transfer rate. The signature was a cluster of calls escalating at the same conversation turn. Investigation of those turns revealed the trigger. A new edge case the agent couldn't handle. A prompt change that made the agent overly cautious. A backend change that caused tool failures.

Tool error clusters appeared as repeated failures with similar inputs. The signature was tool calls failing for specific parameter patterns. Investigation revealed the API contract mismatch. A field that had become required. A validation that had changed. A rate limit is being hit.

False successes appeared as mismatches between agent-reported outcomes and backend confirmation. The signature was the agent claiming success, but the backend showed no record of it. Investigation revealed the tool-first truth violation. The agent confirmed before the tool was confirmed.

Drop-off points appeared as clusters of caller abandonment at specific conversation turns. The signature was a turn where the abandonment rate spiked compared to adjacent turns. The investigation revealed the problematic moment. A confusing question. An uncomfortable disclosure. A delay that felt like a hang.

Each signature pattern had a standard investigation path. Tomás trained the team to recognize signatures and follow the corresponding analysis playbook.

Prioritizing improvements

Not every insight was worth acting on.

Tomás prioritized improvements using three factors.

Frequency measured how often the issue occurred. An issue affecting 30% of calls mattered more than one affecting 3%.

Impact measured the extent to which the issue affected the goal metric. An issue that reduced the completion rate by 10 points mattered more than one that reduced it by 1 point.

Fixability measured how difficult the solution was to implement. An issue that could be solved with a prompt change was faster to address than one requiring backend work.

The product of frequency, impact, and fixability produced a prioritization score. Tomás maintained a ranked list of improvement opportunities. Each week, the team tackled the highest-scoring items.

Some high-frequency issues had a low impact. Callers often asked a specific question that was easy to answer. Frequent but not problematic.

Some high-impact issues had low frequency. A rare edge case caused total call failure when it occurred. Impactful but not urgent.

The prioritization framework balanced these tradeoffs. It prevented the team from optimizing based on what was most interesting rather than what was most valuable.

The availability question ordering was the first insight. Over the course of six months, conversation analytics revealed a dozen more.

Candidates who heard their name mentioned during confirmation had 8% higher completion rates. Calls in which the agent acknowledged potential scheduling conflicts had 12% lower callback rates. Candidates who were offered a specific callback time rather than "we'll call you back" answered 25% more often.

Each came from the same process. Segment by outcome. Find the divergence point. Understand what distinguishes success from failure. Change the agent. Measure.

Tomás kept a list on his wall. Twelve insights, twelve changes, twelve measured improvements. The list had started with one line: "Availability third, not first. +23% completion."

That line had restructured the entire screening flow. The eleven lines that followed had each made the agent slightly better. None would have been visible without looking at the data. All had been hiding in conversations the team already had.

Chapter 29: Optimization Strategies

What you'll learn: How to run a weekly optimization loop that compounds small improvements into large gains.

Key takeaways:

Optimization should run weekly, not quarterly. The agents that improve fastest make targeted fixes every week based on data.
The weekly review covers top transfer reasons, top tool errors, false success detection, and low-confidence transcripts.
Make targeted changes, not wholesale rewrites. Small changes produce clean signals. You know what caused the improvement or regression.
A/B test conversation variants when you're not sure which option is better. Run tests to completion before declaring winners.
Small, fast iterations beat large redesigns. Fourteen small changes in fourteen weeks outperform one big redesign that takes fourteen weeks.

The insurance brokerage's qualification agent launched at a 12% transfer-to-close rate. Fourteen weeks later, it operated at 27%. The improvement didn't come from a single redesign. It came from small changes, made weekly, each addressing a specific problem.

Diane ran the optimization loop. Every week, she reviewed the same data: top transfer reasons, top tool errors, false success detection, and low-confidence transcripts. Every week, she identified the highest-impact issue and made a targeted fix.

Week one: 30% of transfers happened because prospects asked plan-specific questions that triggered immediate compliance escalation. But many of those questions could be answered within the agent's licensed scope. Diane adjusted the escalation trigger to distinguish between "plan recommendation," which required transfer, and "plan information," which the agent could provide.

Week three: prospects who heard a summary of their qualification results before transfer closed at higher rates. Diane added a pre-transfer summary. "Based on what you've shared, you're likely eligible for several plans. I'm going to connect you with someone who can walk you through the specific options."

Week seven: A/B testing revealed that changing "let me connect you with a specialist" to "let me connect you with Sarah, who can walk you through your options" increased prospect willingness to stay on the line by 18%. Diane updated the transfer phrasing.

Each change was small. The cumulative effect was transformative.

The weekly review

Diane's optimization loop ran weekly.

Top transfer reasons showed why calls didn't complete within the agent. Diane ranked transfer reasons by frequency. The most common reason was investigated first. Sometimes the transfer was appropriate, and nothing needed to change. Sometimes the agent unnecessarily escalated the call, and a prompt adjustment could have contained it.

Top tool errors showed where backend integrations failed. Diane ranked errors by frequency and impact. A rare error that caused total call failure got priority over a common error that the agent recovered from gracefully.

False success detection compared agent-reported outcomes to backend confirmation. The agent said it completed the qualification, but did the CRM actually record the qualified lead? Mismatches revealed tool-first truth violations or backend synchronization issues.

Low-confidence transcripts showed calls where speech-to-text struggled. Low transcription confidence correlated with higher error rates and worse outcomes. Diane investigated whether the issues were audio quality, accent handling, or specific vocabulary that confused the engine.

The review took two hours. It produced a prioritized list of issues. The highest-priority issue became that week's optimization target.

Targeted changes

Diane made small changes, not wholesale rewrites.

When the compliance escalation was too aggressive, she adjusted the specific trigger phrase in the prompt. She didn't rewrite the entire compliance section. The targeted change was testable, measurable, and reversible.

When the pre-transfer summary improved conversion, she added four sentences to the prompt. She didn't restructure the conversation flow. The addition was minimal and focused.

When the personalized transfer phrasing worked better, she changed three words. "A specialist" became "[Agent Name], who can walk you through your options." The change was tiny. The impact was measurable.

Targeted changes produced clean signals. When something improved, Diane knew what caused it. When something regressed, she knew what to revert. Broad rewrites muddied the signal. They changed multiple things at once, making it impossible to attribute outcomes.

Diane maintained version control on every change. Each version had a description of what changed, why, and what metric she expected to move. The version history was an optimization journal.

A/B testing

Not every change had a predictable outcome. For changes where Diane wasn't sure which option was better, she ran A/B tests.

The traffic split was 50/50 for consequential changes, 90/10 for risky ones. Half of the calls received version A, and half received version B. Outcomes were tracked by version.

Test duration depended on traffic volume and expected effect size. A change expected to move conversion by 10 points needed fewer calls than a change expected to move it by 2 points. Diane calculated minimum sample sizes before launching tests.

Diane watched for early signals but didn't stop tests prematurely. A version that looked better on day one might not hold on day seven. She ran tests to completion unless safety concerns required early termination.

When the results were clear, the winning version rolled to 100%. When the results were ambiguous, Diane investigated the cause. Sometimes, both versions performed equally because the change didn't matter. Sometimes effects differed by caller segment, suggesting a need for conditional logic.

A/B testing added rigor to optimization. Instead of guessing which option was better, Diane measured. The measurement took longer but produced certainty.

Changing how the conversation flows

Some improvements required changes to conversation structure, not just prompt wording.

Reordering steps changed the sequence of information collection. The staffing marketplace found that asking for availability third improved completion. Similar reordering opportunities existed in other flows. Diane tested different sequences for the qualification questions and found that asking about current coverage before asking about health needs produced better engagement.

Confirmation patterns changed when and how the agent verified information. More frequent confirmation at high-stakes moments reduced downstream errors. Less confirmation at low-stakes moments reduced handle time. Diane tuned the confirmation frequency by stake level.

Disambiguation approaches changed how the agent handled ambiguous inputs. Instead of asking open-ended clarifying questions, the agent offered constrained choices. "Did you mean your current plan or a new plan?" instead of "Could you tell me more about that?" Constrained choices produced faster resolution.

Flow changes were harder to test because they affected multiple conversation turns. Diane ran longer A/B tests for flow changes and monitored more metrics. A flow change that improved completion rate but increased handle time might not be worthwhile.

Fixing the backend

Some improvements came from the backend, not the conversation.

Reducing tool latency improved conversation flow. A tool that returned in 500ms instead of 1500ms eliminated the awkward pause. Diane worked with backend teams to optimize slow endpoints.

Improving error handling made failures graceful. Instead of generic error messages, specific error codes enabled specific recovery paths. The agent could say, "That date isn't available, but the following week has openings," instead of "Something went wrong, let me try again."

Adding fallback paths handled outages without breaking conversations. When the primary eligibility check was unavailable, a cached result could serve. When the scheduling API timed out, the agent could offer to call back rather than failing silently.

Integration optimization required cross-team coordination. Diane maintained relationships with backend teams and shared data on how their systems affected conversation quality. When she could show that a 500ms latency improvement would increase completion rate by 3 points, backend priorities shifted.

Measuring impact

Every optimization needed measurement.

Diane established a before/after comparison methodology. She compared metrics for the two weeks before and after a change. She controlled for day-of-week and time-of-day effects by comparing equivalent periods.

She watched for external variables that could confuse results. A marketing campaign that drove different caller demographics. A seasonal shift in caller needs. A backend change that coincided with her prompt change. When external variables were present, she adjusted the analysis or extended the observation period.

She tracked cumulative impact over time. The 15-point improvement in transfer-to-close rate was the sum of 14 individual improvements. Some changes contributed 3 points. Some contributed 1 point. Some contributed negative points and were reverted. The cumulative graph showed progress even when individual weeks felt slow.

The qualification agent started at 12% transfer-to-close rate. After fourteen weeks, it operated at 27%.

Diane kept a changelog. Week one: compliance escalation adjustment, +2 points. Week three: pre-transfer summary, +3 points. Week five: A/B test on greeting, no significant difference. Week seven: personalized transfer phrasing, +2 points. Week nine: callback timing optimization, +1 point.

Fourteen entries. Some added points. Some added nothing. Two were reverted after metrics dropped. The cumulative effect was 15 points of improvement that no single redesign could have achieved.

Week fifteen, Diane started the review the same way she had every week since week one. Top transfer reasons. Top tool errors. False success detection. Low-confidence transcripts. The agent was better than it had been. The review would find something to make it better still.

The loop didn't end at 27%. That was just where the agent was this week.