Vapi raises $50M Series B to power the next generation of enterprise voice AI

Vapi raises $50M Series B

Part 3

Design

~72 min • 6 chapters

Chapter 9: Conversation Design Fundamentals

What you'll learn: Why designing for voice is fundamentally different from designing for text, and the principles that make voice agents effective.

Key takeaways:

Voice is not text. Every word exists for one moment. If the caller misses it, it's gone. This constraint reshapes everything about conversation design.
Keep turns short. One idea, one question, or one confirmation per turn. Front-load the important part. Never assume the caller remembers information from earlier turns.
Every agent turn follows a pattern. Acknowledge (show you heard), act (do something useful), advance (move the conversation forward). Skipping acknowledgment makes the agent feel cold.
Slot filling requires progressive confirmation. Confirm each piece of critical information as you collect it. Don't wait until the end to read it all back at once.
Before finalizing any flow, read it aloud. If you run out of breath or lose track of your own point, the caller will too.

Elena had designed chatbots for three years before her company asked her to build their first voice agent. She figured it would be straightforward. Same logic, different interface. She took her best-performing chat scripts, adapted them into a system prompt, and launched a pilot.

The results were brutal. Callers complained the agent talked too much. They forgot what it asked. They interrupted constantly, then got frustrated when the agent kept talking over them. Transfer rates hit 60% in the first week.

Elena spent the next month learning what she should have known from the start. Voice is not text. The principles that make a chatbot effective will make a voice agent fail.

Voice is not text

In a chat interface, a user can re-read a message, scroll up, copy a confirmation number, and take 30 seconds to respond. In a voice conversation, every word exists for exactly one moment. If the caller misses it, it's gone.

This constraint reshapes everything.

Keep turns short. One idea, one question, or one confirmation per turn. Elena's first agent delivered paragraphs of context before asking questions. Callers forgot the question by the time she asked it.

Front-load the important part. Put the action or question first. Say "Would Thursday at 2pm work?" not "I found availability on Thursday, and it looks like 2pm is open, so would you like me to go ahead and book that for you?"

Never assume the caller remembers. If you collected a date three turns ago, repeat it when you confirm. Don't say "Should I go ahead and book that?" Say "Should I book Thursday, January 15th at 2pm?"

The difference between voice and chat becomes clear when you see the same task written for both channels.

Chat delivers troubleshooting steps as a numbered list. Voice delivers one step at a time with a check-in before moving on. "First, close the app completely and reopen it. Tell me when you've done that."

Chat puts a URL on screen. Voice can't. "I'm going to send you a link by text so you don't have to write it down."

Elena built a rule for her team. Before finalizing any conversation flow, read it aloud. If you run out of breath or lose track of your own point, the caller will too.

Turn structure

Every agent's turn follows a simple pattern. Acknowledge, then act, then advance.

Acknowledging shows you heard the caller. "Got it." "Okay, January 15th."

The act does something useful. Look up availability, confirm a detail, process a request.

Advance moves the conversation forward. Ask the next question, present options, or confirm the outcome.

Elena's first agent skipped the acknowledgement. Callers felt ignored. It jumped straight from their answer to the next question without any signal that it heard them. Adding a one-word acknowledgment made the agent feel human.

Don't treat this as a rigid template. If the agent hits all three beats in the same mechanical cadence every turn, it sounds scripted. Sometimes the acknowledgement and act merge. "Got it, I'm pulling up your account now." Sometimes the advance is implicit. After confirming a booking, silence is the advance because the caller knows the conversation is wrapping up.

Information density

People retain two to three pieces of information from a single spoken passage. Elena learned this by listening to call recordings. When her agent listed four appointment times, callers asked her to repeat the list. When it listed two, they picked one and moved on.

Bad: "I found three openings: Thursday, January 15th at 2pm, Friday, January 16th at 10am, and Monday, January 19th at 3:30pm. Which would you prefer?"

Better: "I have openings Thursday afternoon or Friday morning. Which works better?" Then narrow from there.

When you must present multiple options, limit them to two or three at a time. Present the best-fit first and offer to show more

Progressive confirmation

Most voice agent tasks involve collecting information. A name, a date, an account number. The temptation is to collect everything first and confirm at the end.

Don't.

Each time you collect a critical slot, echo it back immediately. "Got it, January 15th." This catches errors early and gives the caller confidence that the agent is tracking.

If you wait until the end to confirm everything, you create risk. "So that's John Smith, January 15th at 2pm, for a cleaning appointment at the downtown location." If the date is wrong, you've wasted the entire conversation. Worse, the caller may not catch the error buried in a string of five details.

Confirm every slot that would be expensive to get wrong. Dates, times, spelling of names, dollar amounts, cancellations, anything irreversible.

Maria's team at the staffing company learned this at scale. Progressive confirmation of availability, certifications, and work location reduced downstream placement errors by 30%. Catching a wrong answer on turn 3 beats discovering it after a placement fails.

Disambiguation

When the caller's input is ambiguous, the agent needs to narrow. But narrowing can create confusion if done wrong.

Offer two to three choices, never more. "Did you mean the Main Street location or the Airport location?" Not "We have locations on Main Street, Airport Road, Downtown, Westside Plaza, and the new one on Fifth Avenue."

Ask narrowing questions, not open-ended ones. If a caller says, "I need to change my appointment," don't ask, "What would you like to change?" Ask "Would you like to reschedule to a different date, or cancel?"

Open-ended disambiguation produces open-ended answers, which produce more disambiguation. Elena tracked the length of the conversation before and after switching from open questions to constrained choices. Average call duration dropped by 40 seconds.

Error recovery

Errors will happen. The transcription engine will mishear "fifteen" as "fifty." The caller will say something outside the agent's scope of work. A backend system will time out.

The agent's error behavior defines the caller's experience more than its happy-path behavior. People forgive a mistake if the recovery is smooth. They don't forgive a loop.

Elena's original agent had no retry limit. She found call recordings in which callers repeated the same information 6, 7, or 8 times before hanging up. Adding a three-attempt ceiling with graceful escalation cut abandonment rates in half.

Attempt 1 is the normal ask. "What date works for you?"

Attempt 2 rephrases and constrains the format. "Could you say the date as month and day? For example, January fifteenth."

Attempt 3 offers an alternative channel or transfer. "I'm having trouble catching that. Let me connect you with someone who can help."

Three attempts maximum. After three failures on the same slot, escalate.

Multilingual

If your customer base speaks multiple languages, you need a strategy beyond translation.

Detect the caller's language within the first turn and route to the appropriate configuration. Don't assume language from the phone number. Adapt the persona for cultural norms, not just the words. A direct style that works in American English may feel rude in Japanese. A casual approach that works in Mexican Spanish may feel too informal in Colombian Spanish.

One automotive marketplace deploying agents across five Latin American countries learned this the hard way. Translation got the words right but missed the cultural rhythm. Callers noticed.

What Elena learned

Six months after her failed pilot, Elena launched her fourth voice agent. This one handled appointment scheduling for a medical clinic. The transfer rate was 18%. Caller satisfaction exceeded the human baseline.

She kept a list of what changed. She read every conversation aloud before shipping. She limited turns to one idea, one question. She confirmed dates and times immediately after collecting them. She gave the agent three attempts at any slot, then a graceful exit.

Voice design isn't about making the agent sound human. It's about respecting how humans actually listen. Short turns. Progressive confirmation. Constrained choices. Bounded retries.

Elena's first agent failed because she treated voice like text with audio. Her fourth agent succeeded because she designed for ears, not eyes.

Pre-launch checklist

The mistakes that catch experienced designers:

☐ Read it aloud. Did you run out of breath on any turn? Did you lose your own point?

☐ Count your options. Any turn presenting more than 3 choices?

☐ Find high-stakes slots. Dates, times, names, amounts. Is each confirmed immediately after collection?

☐ Spot open-ended questions. "What would you like to do?" should be "Would you like to reschedule or cancel?"

☐ Check retry limits. Does any slot allow more than 3 attempts before escalation?

☐ Find the URLs. Any link or reference number being read aloud instead of pushed to SMS/email?

☐ Test the escalation phrase. Is it graceful ("Let me connect you with someone who can help") or apologetic ("I'm sorry, I'm having trouble")?

☐ Match your goal. If cutting costs, are you minimizing turns? If fixing CX, are you allowing patience? If driving revenue, are you handling objections?

Chapter 10: Inbound vs. Outbound Design

What you'll learn: Why agents designed for inbound calls fail on outbound calls, and how to design for each direction.

Key takeaways:

Inbound discovers intent. Outbound earns permission. The fundamental asymmetry changes everything about conversation design.
The same greeting that works inbound fails outbound. "How can I help you today?" confuses someone who didn't initiate the call.
Outbound calls must establish context, state purpose, and request permission to continue within the first ten seconds. Callers who don't understand why you're calling will hang up.
Voicemail strategy matters for outbound. Keep messages under 20 seconds. State who you are, why you called, and one clear next action.
Success metrics differ by direction. Inbound measures containment and resolution. Outbound measures contact rate, right-party contact rate, and conversion.

Marcus listened to fourteen recordings before he understood the pattern.

Every call opened the same way. The agent greeted the driver like a customer who'd called in. The drivers responded like people who hadn't called anyone. Confused pauses. "Who is this?" Hang-ups. One driver stayed on long enough to ask if it was a scam before disconnecting.

The agent Marcus built worked beautifully for inbound. Customers called the support line to check delivery status, and it handled 70% of those calls without a transfer. Same tone, same structure, same logic. He'd reused 80% of the design for the outbound campaign. Leadership wanted drivers to install the company's mobile app, so he built an agent to call and remind them.

Every assumption was wrong.

Inbound callers wanted to be there. They had a question. The agent's job was to discover what they needed. Outbound callers didn't want to be anywhere. They were interrupted mid-day by a number they didn't recognize. The agent's job was to earn permission to continue. Marcus had built an agent that waited politely for intent from people who had no intent. They weren't confused about what they wanted. They were confused about why their phone rang.

He rebuilt from scratch. New opening, new pacing, new structure. The second pilot led with identification, purpose, and a permission question. Answer rates doubled.

The fundamental difference

Inbound means the customer has intent, and the agent discovers it. The caller picked up the phone for a reason. They may be frustrated, confused, or in a hurry. The agent's first job is to figure out what they want.

Outbound means the agent has intent, and the customer may resist. The agent is interrupting someone's day. The caller didn't ask for this conversation. The agent's first job is to earn permission to continue.

This asymmetry changes everything. The opening, the pacing, the tone, the error recovery, the success metrics.

Inbound design

Inbound agents must solve two problems fast. Identify who is calling and discover what they want.

Start short. Callers who've navigated a phone tree or waited on hold don't want a speech. "Thanks for calling. How can I help?" Then listen.

Handle intent shifts. A caller who starts with "I want to check my balance" may follow up with "Actually, can I also update my address?" Inbound agents need to handle these pivots gracefully, either by staying flexible within scope or cleanly handing off to a specialist.

Acknowledge the wait. If your agent sits behind a queue, callers may arrive frustrated. "I've been on hold for 20 minutes" is the context the agent needs to acknowledge. "I apologize for the wait. Let's get this taken care of quickly" resets the emotional temperature before you do anything else.

Design routing explicitly. Inbound agents often serve as the first point of contact. Which intents does this agent handle? Which gets transferred? How does the transfer happen? A warm transfer, where the agent briefs the next handler, is dramatically better than a cold transfer, where the caller starts over.

Marcus's inbound agent worked well because it did these things. It greeted briefly, listened, confirmed intent, and either handled the request or transferred with context. The design assumed the caller wanted to be there. That assumption held for inbound. It failed completely for outbound.

Outbound design

Outbound agents must earn the right to continue within the first ten seconds.

Identify, state purpose, ask permission. "Hi, this is the scheduling team at Acme Health calling about your upcoming appointment. Is now a good time?" Three elements, in that order. Skip any of them, and you sound like a robocall. Skip all three and ask "How can I help you?" as Marcus did, and you sound like a confused robocall.

The permission question shifts based on what you're asking for. An appointment reminder needs minimal permission because the caller benefits. "Is now a good time?" works. A collections call needs more. "I'm calling about your account. Do you have a few minutes to discuss payment options?" A qualification for sales is the most important. "We help companies like yours reduce shipping costs. Would you be open to a brief conversation about whether that might be relevant for you?"

The higher the ask, the more explicit the permission request should be.

Use what you already know. Unlike inbound, where you start from zero, outbound calls should be loaded with context. Who you're calling, why, what data you already have. A staffing company pre-loaded every screening call with the candidate's application data. The agent never asked questions it already had answers to. That signaled competence from the first turn and significantly reduced average call time.

Prepare for resistance. Outbound callers may be skeptical, busy, or hostile. Design specific responses for each.

Skepticism ("Is this a scam?") needs a verification option. "You can call us back at the number on your statement."

Busy ("I'm in a meeting") needs a callback offer. "No problem. When would be a good time to call back?"

Hostile ("Stop calling me") needs a graceful exit. Respect the request and end the call.

A transportation platform calling independent truck drivers encountered a sharper version of this. Drivers were deeply skeptical of automated calls and expected to talk to a real dispatcher. The agent had to handle interruptions, overlapping speech, and blunt pushback without losing its footing. Conversation design that felt too polished triggered immediate hang-ups. The team rewrote the opening three times before finding language that sounded like a colleague, not a script.

Voicemail

A significant percentage of outbound calls reach voicemail. You need a strategy.

Detection isn't perfect. Sometimes the agent starts talking to a voicemail greeting, thinking it's a person. Design for graceful recovery when detection fails.

Leave or retry? Decide in advance. Appointment reminders benefit from a message. Sales calls may benefit from a callback with no message. The answer depends on whether a voicemail advances your goal or just alerts the recipient to screen future calls.

Keep it under 20 seconds. State who you are, why you're calling, and the callback number. Nothing else. Long voicemails get deleted.

Marcus's team tested both approaches. For app installation reminders, leaving a voicemail with a callback number produced more completions than silent retries. For payment reminders, silent retries outperformed voicemails. The right answer depends on the use case.

Blended scenarios

The cleanest designs assume pure inbound or pure outbound. Reality is messier.

Outbound becomes inbound. You call a customer, they miss it, they call back. Now your inbound agent needs to handle "I got a call from this number." That means recognizing the caller by their phone number, pulling the context of the outbound campaign, and resuming the conversation where it would have started.

Marcus discovered this gap when drivers started calling back the number that had called them. The inbound agent had no idea why they were calling. It asked them to explain from scratch. Drivers, already skeptical of the original call, hung up frustrated. The fix required connecting inbound and outbound systems so the inbound agent could say, "Hi, we called earlier about setting up your driver app. Do you have a couple of minutes now?"

Inbound overflow becomes outbound. During peak hours, callers who can't get through request a callback. Now you're making an outbound call to someone who was inbound. The opening shifts. "Hi, this is Acme returning your call from earlier today. Is now a good time?"

This sounds simple, but the details matter. How long is the time between their call and your callback? If it's been two hours, they may have forgotten or solved the problem themselves. The agent needs to re-establish context. "You called earlier about a delivery question. Is that still something you need help with?" If it's been five minutes, you can be more direct. "Thanks for holding. I can help you now."

Transfers between directions. Sometimes an inbound call needs to spawn an outbound action. A customer calls to schedule a technician visit, and the system needs to call the technician to confirm availability. Or a customer reports a problem that requires a callback from a specialist. These handoffs need explicit design. What context transfers? Who initiates the outbound leg? What happens if the outbound call fails?

Design for crossovers explicitly. They're more common than most teams expect, and the seams between inbound and outbound are where customer experience breaks down.

Metrics

Inbound and outbound agents are measured differently.

Inbound success means resolving issues without transferring. The metrics that matter are containment rate and first-call resolution. Secondary measures include satisfaction scores, handle time, and transfer rate.

Outbound success means reaching the right person and achieving the campaign objective. The metrics that matter are contact rate and conversion rate. Secondary measures include right-party contact rate, callback rate, and compliance adherence.

Don't apply inbound metrics to outbound agents. A high transfer rate is a failure for inbound but irrelevant for outbound. A low handle time is efficient for inbound but may signal incomplete conversations for outbound.

One insurance brokerage found the most revealing outbound metric wasn't contact rate but what happened after the transfer. They tracked the transfer-to-close rate as their north star. Better conversation design during qualification drove a 15-point lift in downstream conversions, even when contact rate stayed flat. The insight was that reaching more people mattered less than reaching them well.

Compliance

Inbound and outbound have different compliance concerns.

Inbound focuses on identity verification before disclosing account details, PCI compliance for payment handling, and call recording consent.

Outbound focuses on calling-hours regulations, consent management, do-not-call list compliance, and disclosure requirements. In the US, TCPA violations carry significant fines. Build compliance into the agent design from the start, not as an afterthought.

Marcus's team learned this when a driver complained about receiving a call at 6am. The system had used the driver's registered timezone, but the driver had moved. One complaint became an audit. Build timezone verification and calling-window logic into the campaign, not just the agent.

What Marcus learned

Six months after his failed outbound pilot, Marcus ran both agents in production. The inbound agent handled the delivery status. The outbound agent handled app installation reminders. They shared some underlying infrastructure, but almost none of the conversation design.

He kept the recordings from that first pilot. Fourteen calls, fourteen variations on the same confusion. The agent greeted drivers like they'd called in. The drivers wondered why their phone rang.

The inbound agent opened with a question. The outbound agent opened with an explanation. The inbound agent discovered intent. The outbound agent stated intent. The inbound agent measured containment. The outbound agent measured conversions.

Same company. Same platform. Different directions. Different designs.

Chapter 11: Voice and Persona

What you'll learn: How persona design affects conversion rates and the four dimensions you can adjust to match your brand.

Key takeaways:

Persona isn't cosmetic. It's a conversion lever. The same conversation flow with different persona settings can produce dramatically different results.
Four dimensions define persona. Warmth (friendly to neutral), formality (casual to professional), pace (relaxed to efficient), and assertiveness (suggestive to directive).
Match persona intensity to stakes. Low-stakes interactions can be warmer and more casual. High-stakes interactions need more formality and precision.
Define forbidden behaviors explicitly. Never claim success without tool confirmation. Never guess at identity. Never read back sensitive information beyond policy.
A persona must stay consistent under stress. When callers get frustrated or confused, the agent's character should hold steady, not shift into a different mode.

Marcus had fixed the opening. The agent now identified itself, stated its purpose, and asked permission. Drivers stopped hanging up in the first five seconds. But they were still hanging up in the first thirty.

He listened to more recordings. The agent explained the app's value clearly. It sent the install link. It was confirmed when drivers completed the installation. Technically, everything worked. But something was wrong with how it sounded.

The agent spoke like a terms-of-service document read aloud. Accurate, professional, completely lifeless. Drivers couldn't tell if they were talking to a person or a machine, and in that uncertainty, they defaulted to ending the call. The ones who stayed on seemed to tolerate the agent rather than engage with it.

Marcus brought in Nina, a conversation designer who'd worked on chat products. She listened to ten recordings and diagnosed the problem in fifteen minutes. The agent had no personality. It was correct but forgettable. Drivers had no reason to trust it, like it, or stay on the line.

Nina rewrote the persona. Warmer, more direct, slightly informal. She added brief empathy cues and shortened the transitions. Same flow, same logic, different voice.

Conversion went from 14% to 22%. The agent said the same things. It just said them like a person.

Persona is a conversion lever

Persona determines whether a caller stays on the line or hangs up. It determines whether they trust the agent enough to share their date of birth or credit card number. It determines whether they feel helped or feel they have been processed.

The wrong persona doesn't just reduce satisfaction; it undermines it. It reduces containment. Callers who don't trust the agent demand a human. Callers who feel talked down to get defensive. Callers who find the tone confusing disengage.

Marcus had thought of persona as cosmetic. Something the marketing team cared about. Nina showed him it was operational. The 8-point lift in conversion came from changing the wording, not the workflows.

The four dimensions

Nina explained how she thought about voice persona. Four dimensions, each a dial you can turn.

Warmth is how approachable and empathetic the agent sounds. A collections agent needs lower warmth than a patient scheduling agent. But even collection agents shouldn't sound hostile. Marcus's original agent had warmth set to zero. It was polite but cold. Nina turned it up just enough that drivers felt acknowledged.

Formality is the register of the language. "I'd be happy to help with that" versus "Sure, let me look into it." Match the formality to your caller demographic. Enterprise B2B skews formal. Consumer retail skews casual. Truck drivers expecting a dispatcher call don't want corporate customer-service language. Nina made the agent sound like a colleague, not a help desk.

Pace is the rate at which the agent speaks and responds. Some callers want efficiency. Others need patience. The default should match your most common caller profile. Nina slowed the agent down slightly. Drivers were often on the road, distracted. Rushing them didn't help.

Assertiveness is the degree to which the agent guides the conversation. A scheduling agent should be more assertive. "Let's get you booked. What day works?" A support triage agent should be less assertive. "Take your time. Tell me what's going on." For installation reminders, Nina made the agent assertive but not pushy. "I'll text you the link right now. It takes about two minutes to install."

Every use case has its own mix. Nina didn't follow a formula. She listened to what was failing and adjusted the dials.

Matching intensity to stakes

Not every use case requires the same level of persona intensity.

Low intensity fits transactional, utilitarian tasks. Password resets, balance inquiries, and order status checks. The caller wants speed, not personality. Keep the persona minimal. Polite, efficient, forgettable.

Medium intensity fits service interactions where trust matters. Appointment scheduling, account changes, and return processing. The caller needs confidence that the agent is handling things correctly. The persona should be warm but professional.

High intensity fits persuasive or emotionally charged interactions. Sales calls, retention calls, and complaint handling. The caller needs to feel heard, understood, or convinced. The persona needs to be distinctive and adaptive.

Marcus's installation reminder sat at medium intensity. Drivers needed enough personality to trust the agent but not so much that it felt like a sales pitch. Nina found the balance by testing. She recorded twenty calls, listened for where drivers disengaged, and adjusted until the drop-off points disappeared.

Persona under stress

The real test of persona isn't how the agent behaves when everything goes smoothly. It's how it behaves when things go wrong.

When the caller is angry, the agent should acknowledge the emotion without matching it. "I understand this is frustrating." works. "I'm sorry you feel that way" sounds dismissive. Never argue. Never get defensive. Never tell the caller to calm down.

When the caller is confused, the agent should slow down, simplify, and rephrase. Don't repeat the same explanation louder. Use analogies. Offer to walk through it step by step.

When policy blocks the request, the agent should be honest and offer alternatives. "I can't process a refund after 90 days, but I can offer store credit or connect you with a manager who might have more options." Never say "that's our policy" and stop talking.

Nina stress-tested Marcus's agent with adversarial scenarios. What happens when a driver swears at it? When does a driver ask the same question four times? When a driver says, "This is bullshit," and waits for a reaction? The agent needed responses for all of these that stayed in character. Warm but unflappable. Direct but not confrontational.

Those edge cases defined the agent's personality more than the happy path did.

Forbidden behaviors

Some behaviors break trust regardless of persona.

Never claim success without confirmation. The agent should never say "I've sent you the link" before the API returns successfully. If the system hasn't confirmed it, the agent doesn't know it happened. Marcus caught this one early. A driver complained that no link arrived. The agent had announced that it was sending it before the message actually went through.

Never guess your identity. The agent should never say "Hi Mike" based on the caller ID alone if the verification policy requires confirmation. Caller ID can be spoofed. Misidentifying someone creates compliance risk and erodes trust.

Never read back more than policy allows. If policy says "confirm the last four digits," the agent should never read the full card number, even if they have access. Design the prompt to enforce data minimization.

Never invent information. If the agent doesn't know something, it should say so. "I don't have that information, but I can connect you with someone who does" is always better than a guess.

Nina built these constraints into the prompt as hard rules. The persona could flex. The forbidden behaviors couldn't.

Disclosure

Should callers know they're speaking with an AI?

Some jurisdictions require disclosure. Some industries have emerging regulatory guidance. Even where not legally required, disclosure builds trust with some audiences and erodes it with others.

Decide your policy before launch. If you disclose, do it naturally in the opening. "Hi, I'm an automated assistant with Acme Logistics, calling to help you set up the driver app." If you don't disclose, ensure the agent never claims to be human if asked directly.

Marcus's team chose to disclose. Drivers appreciated the honesty. When they asked, "Is this a real person?" The agent confirmed it was automated and offered to connect them with a human if preferred. Most didn't take the offer. The disclosure built trust rather than undermining it.

Consistency across channels

If your voice agent shares a brand with chat, email, or web experiences, the persona should feel recognizably similar. Not identical. Voice requires shorter responses and more conversational language than chat. But the personality, values, and tone should match.

A driver who texts with your support bot and then gets a phone call should feel like they're dealing with the same company. Not two different organizations with two different personalities.

Nina audited the existing chat scripts before finalizing the voice persona. She borrowed phrases that worked. She flagged inconsistencies. The goal was to express one personality differently in each channel, not multiple personalities.

What Nina taught Marcus

Three months after Nina joined, Marcus's installation reminder agent was the highest-converting outbound campaign in the company. Same flow he'd built originally. Same API calls, same logic, same compliance checks. Different voices.

He asked Nina what she'd actually changed. She pulled up the original prompt and the revised one. The differences looked minor. A few words here and there. Some phrasing adjustments. Nothing structural.

But the recordings told a different story. In the original, drivers responded with short, guarded answers. In the revised version, they talked. They asked questions. They stayed on the line.

The persona wasn't cosmetic. It was the difference between a driver tolerating the call and a driver completing the install. Marcus had built the machine. Nina gave it a voice.

Chapter 12: Prompt Engineering for Voice

What you'll learn: How to write system prompts that produce effective voice agents, with specific techniques for spoken delivery.

Key takeaways:

Specificity is the whole game. Vague prompts produce vague agents. "Be helpful" means nothing. "Confirm the appointment date before ending the call" means something.
Keep responses to two sentences maximum. Read every response aloud. If it sounds like a paragraph, it's too long for a voice.
Structure prompts in five sections. Identity (who the agent is), style (how it speaks), task (what it does), guardrails (what it must not do), and tool instructions (how to use integrations).
Tool-first truth is non-negotiable. The agent should never confirm success before the tool confirms success. "I've sent your link" requires the SMS tool to return success first.
Manage conversation state explicitly. Track identity status, discovered intent, collected slots, last tool call, confirmed details, and transfer reason. Don't rely on the model to remember.

Nina had fixed the persona. The agent now sounded human, warm, and direct, and drivers stayed on the line. But Marcus kept finding edge cases where it broke. A driver asked about pay rates, and the agent tried to answer. Another driver gave a date in a format the agent misunderstood. A third became frustrated when the agent said, "I've sent you the link," but no link arrived.

Each failure traced back to the same place. The system prompt.

Nina had written prompts for chatbots before, but voice was different. A chatbot could produce three paragraphs with bullet points and let the user read carefully. A voice agent had to produce short responses that worked when heard once at conversational speed. The constraints were tighter. The failure modes were less forgiving.

She rewrote the prompt from scratch. Not just the persona this time. The entire instruction set.

Brevity is non-negotiable

The first thing Nina learned was that voice prompts must explicitly enforce brevity. She added a constraint to the prompt that responses should be two sentences or fewer for routine turns. Anything longer and drivers interrupted or tuned out.

In text, you can be comprehensive. In voice, you must be precise. "I found two options. Thursday afternoon or Friday morning." Not "I checked our availability system and found that we have openings on Thursday in the afternoon around 2pm or alternatively on Friday morning if that works better for your schedule."

Nina tested the prompt by reading every response aloud. If she ran out of breath, the response was too long.

Spoken structure, not written structure

The second lesson was that voice responses can't use formatting that only works visually. No bullet points. No numbered lists. No markdown. The model had to produce responses that sounded natural when spoken.

"I found two options. Thursday afternoon or Friday morning" works. "1. Thursday 2pm 2. Friday 10am" does not. The synthesis engine would read it literally. One period Thursday, 2pm, 2 period Friday, 10am. Drivers would have no idea what the agent meant.

Nina added explicit instructions to the prompt. Never use lists. Never use numbered steps in speech. Present options conversationally.

Tool-first truth

The third lesson came from the driver who never received the link.

Marcus's agent had said, "I've sent you the link" before the SMS API returned a success response. The tool call failed silently. The agent didn't know. The driver waited for a text that never came.

Nina added a hard rule to the prompt. Never narrate success before a tool confirms it. The agent should say "Let me send you that link now" while calling the tool, then confirm only after the API returns successfully. "Done. You should have it in a few seconds."

This applied to every action. Never say "I've booked your appointment" until the booking API confirms the booking. Never say "I've updated your address" until the system acknowledges. If the tool hasn't confirmed it, the agent doesn't know it happened.

The five sections

Nina organized her prompts into five sections. She'd tried unstructured prompts before. They worked until they didn't. Clear sections made the prompt easier to debug when something failed.

Identity defined who the agent was. Name, role, organization, disclosure policy. "You are an automated assistant for Acme Logistics, helping drivers set up the mobile app."

Style defined how the agent spoke. Tone, formality, response length, conversational patterns. "Speak in a direct, friendly tone. Keep responses under two sentences. Sounds like a colleague, not a help desk."

The task defined what the agent did. Supported intents, the flow for each, and decision logic for branching. "Your job is to help drivers install the app. First, confirm you're speaking with the right person. Then explain what you're calling about. Then send the install link. Then confirm they received it."

Guardrails define what the agent must never do. Hard boundaries, forbidden topics, escalation triggers. "Never discuss pay rates or contract terms. If the driver asks, say you're not able to help with that but can connect them with dispatch."

Tool instructions defined how the agent used its tools. When to call each tool, what inputs to collect first, and how to interpret results. "Before calling send_install_link, confirm the driver's phone number. If the tool returns an error, apologize and offer to try again or transfer to a human."

Marcus had asked why she kept these sections separate rather than writing a single continuous prompt. Nina showed him what happened when instructions blended together. The model followed some and ignored others. With clear sections, she could isolate which instructional category was failing.

Managing state

Voice conversations have a state that accumulates across turns. The driver's identity status. The intent they expressed. The phone number they confirmed. The result of the last tool call. What's already been said.

Nina made state tracking explicit in the prompt. "Remember what the driver has already told you. Never ask for information you've already collected unless they want to change it."

Without this instruction, the model sometimes forgot that it had already confirmed the phone number and asked again. Or it skipped confirmation of a detail from three turns ago. Explicit state instructions fixed these loops.

Some platforms offered structured state management with variables that persisted across turns. Nina used these when available. They were more reliable than depending on the model's memory of conversation history alone.

Output for spoken delivery

Synthesis engines had quirks that the prompt needed to handle.

Numbers had to be written in a speakable format. "January 15th," not "1/15." "Two hundred dollars," not "$200," Nina added instructions to spell out dates, times, and currency amounts.

Abbreviations and acronyms could trip up the synthesis engine. If the business used "ETA," Nina specified whether the agent should say "E-T-A" or "estimated time of arrival." Different audiences expected different things.

Punctuation controlled pacing. Commas created pauses. Periods created stops. "Let me check that for you one moment," ran together, feeling rushed. Nina used punctuation deliberately to shape how the agent's speech would land.

Latency masking

Tool calls and model inference took time. The driver heard silence. Silence longer than about 800 milliseconds felt broken.

Nina's prompt instructed the agent to fill processing delays with short acknowledgments. "Let me pull that up." "One moment while I check." "Looking into that now."

The key was keeping these fillers brief. A long filler that rambled while waiting for a tool response sounded worse than a short filler followed by silence. "So let me just take a quick look here and see what I can find for you in the system," was filler pretending to be speech. "Let me check," followed by a pause, was honest and professional.

Common patterns

Certain prompt patterns appeared in almost every voice agent Nina built.

Gating prevented the agent from skipping steps. "Do not call the send_link tool until you have confirmed the phone number with the driver."

Fallback handled confusion gracefully. "If you cannot determine what the driver needs after two attempts, say 'Let me connect you with someone who can help' and transfer."

Scope boundaries kept the agent out of trouble. "If the driver asks about pay or contracts, say 'I'm not able to help with that, but I can transfer you to someone who can.' Do not attempt to answer."

Confirmation prevented errors on important details. "After collecting the phone number, repeat it back and ask, 'Is that right?' Only proceed after the driver confirms."

Nina kept a library of these patterns. Each new agent started with the relevant patterns copied in, then customized for the specific use case.

Iteration

Prompts were never done on the first draft. Nina's process was consistent.

Write the initial prompt based on the conversation design. Test with fifteen to twenty simulated conversations covering happy paths, edge cases, and adversarial inputs. Listen to the responses. Does the agent sound natural? Does it follow the flow? Does it respect the guardrails?

When something failed, Nina fixed the specific instruction that caused the failure. She didn't rewrite the whole prompt. A targeted change was easier to test than a broad rewrite.

After each change, she retested to confirm the fix didn't break something else. Prompt changes were high-leverage but high-risk. A single word change could shift behavior across thousands of calls.

What Nina built

Three months after joining Marcus's team, Nina had built prompts for four different voice agents. The installation reminder. A delivery status line. A callback scheduler. A driver onboarding flow.

Each prompt followed the same five-section structure. Each used the same patterns for gating, fallback, scope boundaries, and confirmation. The differences were in the specifics. The identity, the task flow, and the guardrails for each use case.

Marcus asked if he could edit the prompts himself. Nina showed him how. The Task and Style sections were safe for business owners to modify. The Guardrails and Tool Instructions required more care. She separated the sections so Marcus could adjust conversation flow without accidentally breaking compliance rules.

At an insurance brokerage using similar agents, this separation enabled something unexpected. Sales training leaders began iterating on conversation flows themselves. They edited Task and Style while engineers owned Guardrails and Tool Instructions. Prompt maintenance stopped being a bottleneck for engineering. It became a business capability.

The difference between a vague prompt and a specific one was the difference between an agent that sometimes worked and an agent that worked predictably. "You are a helpful assistant. Help the customer with their appointment." That was a vague prompt. It produced a vague agent.

"You are Maya, a scheduling assistant for Acme Health. Your job is to book, reschedule, or cancel appointments. Speak in a warm, professional tone. Keep responses under two sentences. Verify the patient's date of birth before accessing their account. Never confirm a booking until the schedule_appointment tool returns success."

That was a specific prompt. It produced an agent that behaved the same way on call one thousand as it did on call one. Nina had learned that specificity was the whole game.

Chapter 13: Edge Cases and Escalation

What you'll learn: How to design for the calls that don't follow the happy path, which is where production actually lives.

Key takeaways:

What separates a production agent from a demo is how it handles everything else. The happy path is the middle trail. The edges are where production lives.
Find edge cases in three places. Historical call data shows what humans handled. Frontline interviews reveal weird situations. Transcript analysis catches what testing missed.
Every edge case gets one of two treatments. Handle it (add logic to the agent) or escalate it (define the transfer trigger). No edge case should produce undefined behavior.
Hard boundaries require immediate escalation. Threats of harm, legal demands, fraud indicators, requests for actions outside scope, and repeated authentication failures.
Prevent duplicate actions with request IDs and confirmation locks. Every state-changing tool call needs a unique identifier to prevent duplicate bookings, payments, or messages from retries.

The agent had been in production for two weeks when Marcus started getting complaints.

A driver called back angry because the agent told him the app was installed when it wasn't. Another driver spent six minutes trying to explain that his phone didn't have enough storage, while the agent kept offering to resend the link. A third asked about his insurance paperwork, and the agent tried to help, only to wander into territory it knew nothing about.

Happy paths are easy. Any reasonably configured agent can handle a caller who clearly states their intent, provides the right information, and follows the expected flow. What separates a production agent from a demo is how it handles everything else.

Marcus had tested the happy path exhaustively. The driver answers, the agent explains, the link gets sent, the app gets installed, everyone's happy. Five steps, five minutes. But production surfaced scenarios he hadn't imagined. Drivers who'd already installed the app yesterday. Drivers whose phones were too old. Drivers who wanted to talk about entirely different topics. Each unhandled edge case became a failed call, and each failed call became a complaint.

Finding edge cases

Don't try to imagine edge cases in a conference room. Find them in the data.

Marcus started with the calls that went wrong. He pulled every conversation that exceeded average handle time, every call that ended without a successful install, and every instance where the driver hung up mid-conversation. Patterns emerged. Drivers saying "I already did this" when they hadn't. Drivers asking questions outside the agent's scope. Drivers with technical problems that the agent couldn't solve.

He talked to the dispatch team, who'd handled driver calls before the agent existed. They knew things he didn't. Which drivers have always had problems? Which questions come up every week? Which situations require a supervisor? The dispatch team had built up institutional knowledge about edge cases because they handled them every day. Marcus had been designing from the happy path. They'd been living in the exceptions.

A staffing marketplace running thousands of screening calls found the same thing. At high volume, transcript analysis revealed edge-case clusters invisible at lower scales. Candidates are giving contradictory answers across questions. Candidates who'd already been screened are being called back. Candidates applying for roles in locations they couldn't reach. These patterns only became visible when you had enough data to see them.

Marcus built an edge case catalog. For each one, he documented what triggered it, how often it occurred, the current resolution, and whether the agent should handle it or escalate it.

Handle or escalate

Not every edge case needs an automated solution.

The decision was simple in principle. Handle it if the resolution is structured, low-risk, and within the agent's scope. Escalate it if the resolution requires judgment, involves high stakes, or falls outside the scope.

The driver who said, "I already installed it," could be handled. Check the backend. Confirm whether the installation is actually completed. Respond based on what the system showed. Structured, low-risk, within scope

The driver's request for insurance paperwork needed to be escalated. Not because the agent couldn't read the policy, but because the wrong answer created problems. Contract questions, pay disputes, and legal territory. Transfer to someone with authority.

Marcus made a rule for himself. When in doubt, escalate. He could always move edge cases from "escalate" to "handle" in future versions as he understood them better. Moving them the other direction, after something went wrong, was much harder.

Hard boundaries

Some situations required immediate transfer. No retry loops. No attempts to resolve. The agent stopped and handed it off.

Marcus defined five hard boundaries for his agent.

If the driver became abusive or threatening, transfer immediately. The agent would acknowledge calmly. "I understand you're frustrated. Let me connect you with someone who can help." Then transfer. No delay.

If the driver mentioned lawyers, lawsuits, or complaints to regulators, transfer immediately. Legal threats went to a specific queue where someone with authority could respond.

If identity couldn't be verified after the maximum attempts, transfer. The agent couldn't proceed without knowing who it was talking to.

If the driver asked about money, contract terms, or pay rates, transfer. These were high-stakes topics where a wrong answer could expose the company to liability.

If the issue required expertise the agent didn't have, say so and transfer the case. Don't pretend.

At an insurance brokerage, some hard boundaries came from regulation, not design preference. Any conversation that crossed from eligibility questions into specific plan recommendations required immediate transfer to a licensed agent. The AI could answer. But an unlicensed entity providing that guidance was a compliance violation. The boundary existed because the law required it.

Softer triggers

Beyond hard boundaries, Marcus built escalation triggers that detected when a conversation was going off the rails.

If the driver's tone shifted from neutral to frustrated, lower the threshold for transfer. An angry driver with a routine issue should reach a human sooner than a calm driver with the same issue.

If the agent asks the same question three times, escalate. Nina had built the three-attempt ceiling into the prompt. Three failures on the same slot meant the agent and the driver had reached an impasse.

If transcription confidence dropped below a threshold, offer a transfer. Poor audio quality, heavy background noise, and strong accents were what the system struggled with most. The agent should acknowledge the difficulty rather than continuing to mishear

If the call exceeded ten minutes for a task that normally took three, something was wrong. Offer to escalate.

These triggers weren't about the agent being unable to help. They were about recognizing when continuing would make things worse.

How to transfer

How Marcus transferred mattered as much as when.

The best experience was a warm transfer. The agent briefed the next handler before connecting the driver. "I'm transferring you to dispatch. I've let them know you're calling about the app installation and that you've had trouble with storage space, so you won't have to repeat yourself."

This required integration with the contact center's transfer system and the ability to pass context. Marcus worked with engineering to build it. The driver's identity, what they'd asked for, what the agent had tried, and why it was transferring. All packaged and sent to whoever picked up next.

The worst experience was a cold transfer. The driver was dumped into a queue and had to start over from scratch. Marcus used this only as a fallback when warm transfer wasn't possible.

A third option was the scheduled callback. The agent couldn't resolve it, and no one was available for a warm transfer. "I can't help with that directly, but I can have someone call you back within the hour. Would that work?" This preserved the driver's time and avoided another hold queue.

The most common complaint about transfers was "I already explained this." Marcus made eliminating that complaint a priority. If the driver had to repeat themselves, the transfer had failed even if the call eventually succeeded.

Thresholds by goal

The right escalation threshold depended on what Marcus was optimizing for.

If the goal was cutting costs, the threshold should be higher. Transferring defeated the purpose. Invest in handling more edge cases within the agent. Only escalated when resolution truly wasn't possible. But never sacrifice safety for containment.

If the goal was fixing customer experience, the threshold should be lower. A bad automated experience was worse than a human-handled one. Escalate early if there is any risk of frustrating the caller. Better to transfer a call the agent could have handled than to botch one it couldn't.

If the goal was to drive revenue, the threshold depended on the context. A driver close to completing the install should stay with the agent. A driver raising objections that the agent couldn't handle should reach a human who could close.

Marcus's installation agent was primarily about driving installs, which meant driving revenue. He set thresholds contextually. If the driver was progressing through the flow, keep them with the agent. If they were stuck or frustrated, transfer before the call became a total loss.

Preventing duplicates

Some edge cases weren't about conversation flow. They were about the agent doing something twice.

A driver called, the agent sent the install link, the call dropped, the driver called back, and the agent sent another link. Now the driver had two texts and was confused. Worse still, an agent could book the same appointment twice, process the same cancellation twice, or send the same confirmation twice.

Marcus worked with engineering to add request IDs. Every action the agent initiated got a unique identifier. The backend rejected duplicates. If the agent tried to send a link that had already been sent, the system stopped it.

Nina added confirmation locks to the prompt. Once the driver confirmed an action and the tool executed it, the agent marked that action complete. Even if the conversation looped back to the same topic, the agent wouldn't re-execute.

These were engineering guardrails, not conversation design. But they prevented a class of edge cases that conversation design alone couldn't solve.

What Marcus learned

Three months into production, Marcus had handled hundreds of edge cases. Some he'd automated. Some he'd built escalation paths for. Some he'd decided weren't worth solving because they happened once a month.

The edge case catalog kept growing. Every week, something new surfaced. A driver calling from a different phone number than the one on file. A driver who spoke Spanish when the agent only handled English. A driver who wanted to install the app for a colleague.

He stopped thinking of edge cases as problems to eliminate and started thinking of them as territory to map. The happy path was a trail through the middle. The edges were where production actually lived.

His best agents weren't the ones who handled every scenario. They were the ones who knew what they could handle, knew what they couldn't, and transferred gracefully when they hit their limits.

Chapter 14: Multi-Agent Architectures

What you'll learn: When to split capabilities across multiple agents and the four patterns for coordinating them.

Key takeaways:

Packing everything into a single prompt creates agents that hallucinate more, cost more, respond more slowly, and are harder to improve. Specialization wins at scale.
Four coordination patterns exist. Specialist agents by intent, supervisor with routing, sequential handoff, and parallel consultation. Each fits different workflow shapes.
Draw boundaries by intent (billing vs. support), by workflow phase (verification vs. resolution), or by compliance domain (licensed vs. unlicensed advice).
Handoffs should be invisible to callers. The caller experiences one continuous conversation even when multiple agents are involved.
Start with a single agent. Prove value first. Then specialize when you understand where the boundaries should fall. Multi-agent becomes standard at 200k+ calls.

Marcus's installation agent had been running for four months when leadership approved the roadmap for three more. A COI verification agent. A support triage agent. An install verifier that would run after drivers completed the app setup.

His first instinct was to expand the existing agent. Add the new intents to the same prompt. Add the new tools to the same configuration. One agent, four use cases. It would be simpler to manage.

Nina talked him out of it.

She'd seen this pattern before. A single agent that starts focused and capable, then grows as the business adds requirements. More intents, more tools, more edge cases, more guardrails. The prompt balloons to three thousand words. The agent starts hallucinating on edge cases it used to handle cleanly. Response latency creeps up. Testing becomes a nightmare because any change might break something unrelated.

Packing everything into a single prompt creates agents that hallucinate more, cost more per call, respond more slowly, and are harder to improve. At some point, the cost of coordinating multiple agents becomes lower than maintaining a single sprawling agent.

Marcus needed to learn when that point arrived and how to design for it.

When single-agent works

A single agent works well when the scope is one workflow with a clear, linear flow. When the intent set is narrow, fewer than seven intents. When the tool count is small, fewer than five tools. When compliance requirements are uniform across the entire conversation.

Marcus's installation agent fit all four criteria. One workflow. Four intents. Two tools. Consistent compliance requirements throughout. A single agent handled it cleanly.

The support triage agent would not fit those criteria. It needed to handle scheduling, billing inquiries, account changes, and technical issues. Different intentions require different domain knowledge. Some topics had strict compliance requirements while others were flexible. The tool count would exceed ten.

Nina explained the tradeoffs. A focused prompt with three tools outperforms a sprawling prompt with fifteen. When the scheduling accuracy drops, you want to know exactly which prompt and which tools to investigate. With a monolithic agent, everything is tangled together.

The tipping point varies. High-volume deployments, 200,000 calls per month or more, feel the cost of hallucinations more acutely. But even smaller deployments benefit from multi-agent when the intent space is diverse. Marcus's support line would handle about 50,000 calls per month. Still worth splitting.

The four patterns

Nina walked Marcus through the architectures she'd seen work.

Specialist agents by intent were the most common. Each agent owns a domain. Scheduler. Billing. Technical support. FAQ. A front-door agent or IVR menu routes the caller to the right specialist.

This worked when intents were clearly separable and callers typically needed one specialist per call. It failed when callers frequently needed multiple specialists in a single conversation. Each transfer added friction.

An automotive marketplace had deployed this pattern across the full car-buying lifecycle. Specialist agents for financing questions, maintenance scheduling, and trade-in valuations. Each has its own domain knowledge and tool set. Running across five countries with over a thousand localized configurations. The specialist pattern was the only way to keep each agent focused enough to perform.

Supervisor with routing used a lightweight router agent to handle the opening: greetings, identity verification, and intent discovery. Once the router identified the caller's need, it handed off to the appropriate specialist. The router didn't resolve anything itself. It just classified the intent and delegated the conversation.

This worked for inbound contact centers where intent was unknown at the start of the call. The router could be simple, rules-based intent classification, or sophisticated, an LLM with nuanced routing logic.

Nina recommended keeping the router's prompt minimal. Its only job was getting the caller to the right place.

Sequential handoff passed callers through agents in sequence. Agent A verified identity. Agent B handled the issue. Agent C closed the conversation. Each completed its piece and advanced the call.

This worked for workflows with distinct phases. Verification, then service, then wrap-up. Clean separation of concerns. Each agent is testable independently.

Parallel consultation kept the primary agent on the call while querying a secondary agent behind the scenes. Checking with a specialized FAQ agent. Running a compliance verification. The caller only heard one voice.

Less common and more complex to implement. But powerful when the primary agent needed expert input without transferring the caller.

Marcus chose a supervisor with routing for the support line. Unknown intent at call start. Multiple possible destinations. A simple router that verified identity, discovered what the caller needed, and handed off to the right specialist.

Drawing boundaries

The hardest part was deciding where one agent's responsibility ended and another's began.

Nina described three approaches.

Boundary by intent gave each agent a set of intents. The Scheduler handled book, reschedule, and cancel. The Billing agent handled payments, balances, and disputes. Cleanest separation. Easiest to test.

The boundary by workflow phase divided the conversation into segments. The Verifier owned the first two minutes. The Resolver owned the middle. The Closer owned the last minute. This worked for consistent workflows but broke down when phases overlapped.

Boundaries by compliance domain isolate sensitive operations. The PCI agent handled payment card data in a hardened, audited environment. Everything else happened in a standard agent. Only one agent ever touched sensitive data, and that agent had minimal scope.

Without clear boundaries, agents drifted. A scheduler tried to handle complaints because no one said it couldn't. Clear scope documents prevented the drift.

Invisible handoffs

Agent-to-agent handoffs should feel invisible to the caller. No "transferring you now" between internal agents. The experience should feel like one continuous conversation with a single assistant.

Marcus worked through the design decisions with Nina.

What context should I transfer? They had three options. Full conversation history carried everything but consumed tokens and increased latency. Compressed context summarized key information but risked losing nuance. Clean slate passed nothing and made the downstream agent start fresh.

For the support line, they chose compressed context. The router passed the caller's verified identity, discovered intent, and any collected details. It didn't pass the full greeting transcript. The specialist had what it needed without the overhead.

Should the voice change? Usually no. Same voice across all agents, so the caller perceived one assistant. But some designs intentionally used voice changes. An explicit escalation to a senior advisor might signal the transition on purpose. Marcus kept voices consistent.

What if the handoff failed? If the target agent was unavailable or the mechanism errored, the originating agent needed a fallback. Retry, transfer to a human, or gracefully close the interaction. Never silence. Marcus built retries with a human fallback after two failures.

Shared state

In a multi-agent system, agents need to share information without having to re-collect it from the caller.

When the router handed off to the scheduling specialist, it passed structured data. Caller name, verified identity, discovered intent, and any slots already collected. The specialist started with context instead of asking "Can I get your name?" for the second time.

For complex workflows where multiple agents might interact with the same caller, Marcus set up a shared state store. A key-value mechanism that all agents could read and write. The billing agent could check whether the scheduler had already collected the account number. The technical agent could see what the billing agent had already tried.

Nina warned him about what not to share. Don't pass raw audio or full transcript history between agents unless necessary. It wasted tokens and confused downstream agents with irrelevant context. Pass summaries and structured data. Keep the context lean.

Testing integration points

Multi-agent systems introduced bugs that didn't exist in single-agent systems.

Routing failures caused the router to send callers to the wrong specialist. Marcus tested with ambiguous intentions. "I want to change my appointment but also have a billing question." Which agent got the call? The router needed logic for multi-intent scenarios.

Context loss meant Agent B didn't have the information Agent A collected. Marcus tested by verifying that post-handoff agents never re-asked for information that had already been provided. If the billing agent asked for an account number that the router had already verified, something was broken.

Infinite loops meant Agent A thought the issue belonged to Agent B, and Agent B thought it belonged to Agent A. Neither resolved it. Marcus built a loop detection. If a caller had been handed off more than twice without resolution, escalate to a human.

Boundary disputes meant a caller's issue spanned two agents' domains. Neither fully handled it. Marcus tested cross-domain scenarios and ensured clean handoffs even when the issue was ambiguous.

The integration points were where failures hid. Testing agents individually wasn't enough. Marcus ran end-to-end tests across the full multi-agent flow.

Start simple, then split

Nina's advice was to start with a single agent when building a first deployment. Prove the concept works. Adding multi-agent complexity to a pilot was an unnecessary risk.

Move to multi-agent when the single agent's prompt exceeds a manageable size, typically around 2,000 words. When accuracy drops as you add more intents. When different parts of the conversation need different compliance treatments. When you need independent testing on separate domains.

A staffing marketplace had followed this trajectory. A single screening agent went from concept to production in thirty days. It proved it could handle tens of thousands of decisions per day at 85% lower cost than projected human spend. Only then did they explore specialized agents for different job categories and worker tiers. The single agent earned the right to become multiple agents by succeeding first.

Marcus's installation agent had earned that right. Four months of production. Hundreds of thousands of calls. Proven value. Now it could stay focused on installations while new agents handled new domains.

What the architecture becomes

At enterprise scale, multi-agent becomes the default. A router or supervisor handles the front door. Specialist agents own each major intent domain. Shared services like identity verification and context lookup serve multiple agents. A governance layer manages prompt versions, tool contracts, and performance metrics across all agents.

The architecture is starting to look less like a single smart agent and more like a microservices system. Each agent is a service with a defined interface, clear inputs and outputs, and independent deployment. Each can be tested, improved, and replaced without touching the others.

An automotive marketplace operated at this scale. Ten to fifteen thousand calls per day. 450 concurrent sessions across five countries. Integrated with internal orchestration services and custom observability pipelines. Their call center employees, four floors of agents across multiple cities, had been retrained as AI experience builders who maintained and improved agent configurations rather than taking calls.

That was the end state. Not one agent did everything. A system of focused agents, each good at one thing, coordinated to handle anything.

Marcus wasn't there yet. But with the router and four specialists in development, he was building in that direction. One agent had been the right starting point. Multiple agents would be how the system matured.