Vapi raises $50M Series B
Read More →
Vapi raises $50M Series B to power the next generation of enterprise voice AI
Vapi raises $50M Series B
Read More →
Vapi raises $50M Series B to power the next generation of enterprise voice AI
Vapi raises $50M Series B
Read More →
The most important decision about implementing voice AI comes before conversation design and starts with selecting the right use case. Not every conversation belongs on a phone call. Password resets, order lookups, anything that ends with a link or a code: chat handles these well because the user can re-read, copy, and click to reach a swift resolution without leaving their interface. If the resolution is a URL, a code, or a number, chat is the right channel, and trying to bring those types of interactions to voice can be the reason a use case fails in the first place.
Voice earns its place when the problem is urgent, hard to type, sophisticated, or even emotionally loaded. A patient describing symptoms, a customer disputing a charge, a driver who can't use their hands. These callers want to explain the problem in their own words and get it resolved in one conversation, especially when the resolution might not be so straightforward. Picking which use cases move to voice is half the work, and it should happen before anyone writes a prompt.
Once you've picked the right one, your chat scripts are a real asset. They encode your business logic, your edge cases, and years of tuning. But they're written for eyes and a cursor, and calls go to ears. This requires different media, each with its own rules and criteria for success. If a team copies and pastes its best-performing chat flows into a system prompt without recognizing the nuances that make a voice agent perform well, it could affect conversions, transfer rates, and other critical bottom-line metrics for your brand. Even if they have proven successful for chat, they were written to be read, not heard. Designing for the ear is its own discipline. Here are some tips and tricks for getting the most out of designing voice flows.
The spoken word is only present for a moment before it dissipates, and the flows must be created to reflect this. While chat tolerates a paragraph of context before the question, on a call, this is different, and the caller has forgotten the context by the time the question arrives. Break every long passage into turns that carry one idea or one question each, and put the important part first so you can ask the question while you still have their attention. Say "Would Thursday at 2 pm work?" Not "I found availability on Thursday, and it looks like 2 pm is open, so would you like me to go ahead and book that for you?" If you share too much information and are not intentional about how you share it, you might not just waste context; you might lose the person on the other end of the line.
While chat has a scroll bar, the voice medium does not. The date your script collected three steps ago is still on screen in chat and can easily be retrieved in a few moments, so the script does not need to repeat. A caller cannot scroll back, so there is a greater need for both repetition and context. That means your appointment booking must include that context and ask: "Should I book Thursday, January 15th at 2 pm?" instead of "Should I go ahead and book that?" Find every place in your chat flow that refers back to earlier input, and make the agent say the value out loud, including dates, times, or other critical context that might be forgotten or disregarded from earlier moments in the conversation.
A chat script can present five locations or four appointment slots because the user can scan them. Users have more time and can process more information for the potential options that they can take. When people are told information out loud, they hold and consider two or three options and not much more. Read four time slots aloud, and callers ask you to repeat them. If you offer two, they will pick one. Give fewer viable options, and customers can assess them and choose what suits them best. Turn lists into narrowing questions: "I have openings Thursday afternoon or Friday morning. Which works better?" Keep the rest in reserve. You can break up choices into multiple questions or offer fewer options so they can select the one best suited to them.
Chat flows often gather everything and show a summary card at the end, which works because the user can review it. Visual aids can be added, and the customer has time to evaluate the decisions made. The spoken version of that summary is a string of five details the caller won't catch an error in. When spoken all at the same time, it can be difficult to synthesize and assess all the information confirmed on the call. Instead, when using voice, echo each critical slot the moment you collect it, when it is fresh in your mind. For example, when you are confirming a date and time, "Got it, January 15th." One staffing company, for example, applied this to availability and certification questions and cut downstream placement errors by 30%.
"What would you like to change?" works in chat, where the user can type a precise answer. A user can think and type out a thoughtful response that can then be analyzed. Spoken, open questions produce open answers, which in turn produce more clarifying questions. They can lead to more edge cases and often push resolution further away than it was at the beginning of the call. Turn them into choices that make deterministic paths based on their response: "Would you like to reschedule to a different date, or cancel?” One team made this single swap and took 40 seconds off the average call duration. This makes calls more straightforward and increases overall resolution rates.
Chat rarely mishears, since the text can be referenced and, if needed, re-referenced. It does not have to deal with pauses, accents, and mispronunciations. Voice, on the other hand, is a different medium. In some cases, the transcription might mishear and could turn "fifteen" into "fifty." Your chat scripts have no retry logic because they never needed it: the words are there for continued reference. Your voice agent needs a ceiling: ask normally, then rephrase with a format hint ("Could you say the date as month and day? For example, January fifteenth"), then escalate gracefully. Three attempts, then a human: this allows a graceful resolution if the call is not handled the first time perfectly, and allows opportunities for the information to be shared, without immediate human escalation.
Any place your chat script ends with a URL or a reference code is a place your voice script should say, "I'm going to send you a link by text so you don't have to write it down." Never read a URL aloud. It is unlikely that a link will be remembered during a live conversation. To share a link, it is much easier for a customer to click through an email or chat than to transcribe it or type it in. If a flow is mostly links, it probably should've stayed in chat. If only one link is needed, connect a tool to send it via email or text.
The cheapest test in voice design costs nothing. Before you ship a converted flow, read it aloud. If you run out of breath on a turn, the caller will too. This post draws on chapter 9 of the Voice Agent Playbook, which covers conversation design in full, including the pre-launch checklist.
