What you'll learn: How to manage the latency budget that makes voice conversations feel natural, with a component-level breakdown.
Key takeaways:
Voice agents have a total latency budget of roughly 1,000 milliseconds. Speech-to-text takes 200ms, LLM inference takes 400ms, text-to-speech takes 200ms, network takes 200ms. Every architectural choice spends or saves from this budget.
- Where latency hides. Long prompts, slow tool APIs, non-streaming TTS, and provider variance during peak hours. Audit each component individually.
- RAG can reduce latency by keeping prompts short. Retrieving 50-100ms of relevant context beats stuffing everything into a long prompt that slows inference.
- Build for failover from day one. Provider outages happen. The system should switch to backups without dropping calls.
- Architecture should match your scale. Starter deployments need simplicity. Growth deployments need optimization. Enterprise deployments need redundancy and geographic distribution.
The agent was correct but felt slow. Drivers complained that it took too long to receive a response. Marcus listened to recordings and timed the gaps. Sometimes, there was a full two seconds between when the driver stopped talking and when the agent replied. By then, drivers had already started repeating themselves or hung up.
He brought the recordings to Priya, the infrastructure engineer who'd built the platform the agents ran on. She listened to three calls and diagnosed the problem immediately.
"You're blowing your latency budget."
Marcus didn't know he had a latency budget. Priya explained.
The thousand-millisecond window
Voice conversations feel natural when responses arrive within about one second. Faster feels instant. Slower feels broken. Human conversations have natural pauses, but those pauses have rhythm. An AI that pauses for two seconds in the wrong place sounds like it crashed.
Priya drew the stack on a whiteboard. Speech-to-text transcribes what the driver said. The LLM figures out what to say back. Text-to-speech turns that response into audio. Each step takes time. Add network latency and telephony overhead, and you've spent your thousand milliseconds before you know it.
She broke it down. Speech-to-text, maybe 200 milliseconds on a good provider. LLM inference, 300 to 500 milliseconds depending on the model and prompt length. Text-to-speech, another 150 to 250 milliseconds. Network round-trip and telephony, 100 to 200 milliseconds total.
Add those up on a good day, and you're at 750 milliseconds. Add them up when the LLM is slow, or the prompt is long, and you're past 1,200. Past that threshold, the conversation feels sluggish. The driver loses patience.
Marcus realized he'd been optimizing conversation design without thinking about what happened beneath the surface. Nina's prompts were good, but if the infrastructure couldn't deliver them fast enough, the experience still failed.
Where latency hides
Priya walked through where teams typically lost time.
Speech-to-text varied more than people expected. Some providers returned transcripts in 150 milliseconds. Others took 400. Accuracy mattered too. A fast provider that mishears the driver creates additional latency downstream because the agent responds to the wrong thing, causing the conversation to loop.
LLM inference was the biggest variable. Longer prompts meant more tokens to process. Marcus's prompts had grown to 1,500 words as he'd added guardrails and edge case handling. Every word costs time. Priya suggested trimming the prompt or moving rarely used instructions into conditional sections that load only when relevant.
Tool calls added latency inside the LLM step. If the agent needed to check a backend system before responding, that API call happened mid-inference. A slow API led to slow responses. Marcus had one tool that sometimes took 800 milliseconds on its own. Combined with everything else, responses stretched past two seconds.
Text-to-speech seemed fast but had hidden costs. Some engines needed to process the entire response before speaking. Others could stream, starting to speak while still generating audio for the rest of the sentence. Streaming TTS shaved 100-200 milliseconds off perceived latency because the driver heard the audio sooner.
Network and telephony overhead was mostly fixed, but poor architecture could inflate it. If every component talked to every other component across the public internet, round-trip times would accumulate. Co-locating services or using dedicated connections reduced this overhead.
Priya audited Marcus's deployment and found that he could recover 400 milliseconds. Faster STT provider. Shorter system prompt. Streaming TTS. Async tool calls where possible. The agent felt responsive again.
The latency budget template
Priya gave Marcus a template his team could use for any new agent.
Start with 1,000 milliseconds as the target. Allocate across components based on what you can actually achieve.
Speech-to-text. Target 200 milliseconds. Measure your actual provider. If it's slower, either accept it or switch providers.
LLM inference. Target 400 milliseconds. This depends on prompt length, model choice, and whether you're using tools. Measure with your actual prompt, not a test prompt.
Text-to-speech. Target 200 milliseconds with streaming. Non-streaming adds 100 to 150 milliseconds.
Network and telephony. Budget 200 milliseconds. Measure your actual infrastructure. Co-location helps.
Tool calls. Budget separately. If a tool call is required before responding, add its latency to the LLM step. If it averages 300 milliseconds, your LLM budget is really 700 milliseconds total.
The numbers would vary by deployment. The discipline was measuring each component and knowing where you stood against the budget. Teams that didn't track this discovered latency problems only when users complained.
Knowledge bases and retrieval
The support triage agent needed access to policy documents. Hundreds of pages covering procedures, exceptions, and edge cases. Marcus's first instinct was to stuff the relevant sections into the prompt.
Priya stopped him. Long context windows were possible but expensive. Every token in the prompt added inference time and cost. A 10,000-token context costs more than a 2,000-token context and responds more slowly.
Retrieval-augmented generation solved this differently. Store the documents in a knowledge base with semantic search. When a caller asks a question, retrieve only the relevant sections and inject them into the prompt. The prompt stayed short. The agent still had access to the full knowledge base.
Priya set up the architecture. Documents chunked into searchable segments. A retrieval step before the LLM call that pulled the three most relevant chunks. Those chunks are injected into the system prompt as context. The agent could answer questions about any policy without loading all policies into every call.
This added 50 to 100 milliseconds of latency to the retrieval step. But it reduced latency further by keeping the prompt short. And it scaled. Adding a thousand more documents didn't slow down inference because only the relevant ones were loaded.
For the logistics company's B2B support line, this was essential. Drivers asked questions that spanned insurance policies, operating procedures, and app documentation. No prompt could hold it all. Retrieval made the agent knowledgeable without making it slow.
Failover and redundancy
Marcus asked what would happen if a component failed. The STT provider had an outage. The LLM endpoint went down. The TTS service timed out.
Priya explained that enterprise architecture meant graceful degradation. Every critical component needed a fallback.
For speech-to-text, configure a secondary provider. If the primary fails or responds too slowly, route to the backup. Accept that accuracy might differ slightly between providers. Test the backup regularly to ensure it works.
For LLM inference, options were more limited but still existed. A secondary model, perhaps smaller and faster, could handle calls during primary outages. Or queue calls for callback rather than failing silently. The worst outcome was silence on the line with no explanation.
For text-to-speech, secondary providers were straightforward. The voice might sound different, but that was better than no voice at all.
Priya built a monitoring system that tracked response times for each component. If speech-to-text latency spiked above 300 milliseconds, alerts fired. If LLM inference consistently exceeded 600 milliseconds, someone investigated. Catching degradation before it became a failure was the goal.
Marcus asked how the automotive marketplace handled this at scale. Priya had talked to their team. 450 concurrent sessions across five countries meant any single point of failure could affect thousands of calls. They ran multiple providers in parallel for critical components, using the fastest response and discarding duplicates. Expensive but resilient.
Architecture by scale
Priya outlined how architecture should evolve with scale.
Starter deployments with under 10,000 calls per month can run simply. Single provider for each component. Monitoring but minimal redundancy. The priority was proving the use case worked. Over-engineering at this stage wastes time and money.
Growth deployments (10,000 to 100,000 calls per month) needed more robustness. Failover providers are configured but not necessarily running hot. Latency monitoring with alerts. Knowledge base infrastructure if the use case required it. Still manageable with a small team.
Enterprise deployments with over 100,000 calls per month required the full stack. Multi-provider redundancy is running actively. Dedicated infrastructure or reserved capacity. Sophisticated monitoring and observability. Likely a platform team maintaining the infrastructure separate from the teams building individual agents.
Marcus's installation agent had started as a starter deployment. Four months later, with multiple agents planned and growing volume, they were moving into growth architecture. Priya was building the infrastructure to support it.
Choosing providers
Each component in the stack required a provider decision. Priya walked Marcus through her criteria.
Latency came first for voice. A provider that was 100 milliseconds slower than competitors consumed 10% of the total budget. That cost compounded across thousands of calls. Measure actual latency in production conditions, not benchmarks on a marketing page.
Accuracy mattered especially for speech-to-text. A provider that mishears 5% of utterances creates downstream problems. The agent responds to the wrong thing. The caller repeats themselves. Conversations loop. Slightly higher latency with better accuracy often produced faster overall conversations.
Cost scaled with volume. A provider charging twice as much per minute might be fine at 5,000 calls per month but prohibitive at 200,000. Model the unit economics at your target scale, not your current scale.
Language support mattered for global deployments. The automotive marketplace operating across five countries needed providers that handled Spanish, Portuguese, and regional dialects. Not all providers offered the same quality across languages. Some excelled in English but struggled elsewhere.
Provider flexibility prevented lock-in. Platforms that abstracted provider choices, letting you swap STT or TTS providers without rewriting integration code, gave you leverage. When a provider raised prices or degraded quality, you could switch. When a better option emerged, you could adopt it.
Priya had seen teams locked into underperforming providers because switching would require months of engineering work. She built the logistics company's infrastructure from the start, with provider abstraction. Swapping providers meant changing configuration, not code.
What Priya taught Marcus
Six months after that first conversation about latency, Marcus thought about voice agents differently. The conversation design mattered. The prompt engineering mattered. But underneath both was an infrastructure that could make everything feel fast or make everything feel broken.
He'd learned to measure before optimizing. Gut feel said the agent was slow. Measurements told him exactly which component was the bottleneck. Sometimes it was the prompt. Sometimes it was a tool. Sometimes it was a provider having a bad day.
He'd learned to budget latency the way he budgeted money. A thousand milliseconds total. Allocate carefully. Measure constantly. When something new consumed time, something else had to give it up.
The best agents weren't just well-designed; they were well-architected. The conversation only felt natural when the infrastructure underneath could deliver it without hesitation.

