• Custom Agents
  • Pricing
  • Docs
  • Resources
    Blog
    Product updates and insights from the team
    Video Library
    Demos, walkthroughs, and tutorials
    Community
    Get help and connect with other developers
    Events
    Stay updated on upcoming events.
  • Careers
  • Enterprise
Sign Up
Loading footer...
←BACK TO BLOG /Agent Building... / /How we solved latency at Vapi

How we solved latency at Vapi

How we solved latency at Vapi
Abhishek Sharma • Jul 14, 2025
3 min read
Share
Abhishek Sharma • Jul 14, 20253 min read
0LIKE
Share

Latency is the enemy of conversational flow.

In real-time voice applications, the most important metric is latency to response, measured as the duration between a user’s end of statement and the agent’s start of statement. This cycle is called turn-taking.


Conversational flow breaks when latency exceeds 1200ms. That’s the rough time it takes for the user to have a tangential thought.

This gives us a strict 1200ms latency budget for every turn in a conversation.

We treat this budget like a scarce resource. If we can save milliseconds in the LLM reasoning step, we can spend them on a higher-fidelity TTS model for a more human-like voice.

Here’s how that 1200ms budget is typically spent in a speech-to-speech pipeline:

The ASR (speech-to-text) and TTS (text-to-speech) models are fairly optimized by their underlying providers.

The bottleneck is almost always the LLM.. specifically, the time to first meaningful sentence. 

LLM providers advertise impressive speeds, but their benchmarks rarely hold up in production. 

We tracked OpenAI’s GPT-4o mini over a 7-day period. The latency was anything but stable:

A model that performs well on a Friday night can be unusable on Monday morning. This volatility is the real enemy of conversational AI. 

We observed the same pattern across all available regions in Azure OpenAI; they all vary independently.

We needed a system that dynamically routes every request to the absolute fastest deployment available at that exact moment.

Attempt 1: The Brute-Force Race

The obvious solution is to send every request to all 40+ Azure OpenAI deployments and use the first one to respond. This is bruteforce optimal for latency, but it costs 40x the tokens. Unacceptable.

Attempt 2: Polling for the Fastest Path

We realized we could gauge relative latency by polling each deployment with a cheap, single-token request. This is O1 cost, great!

We built a system to poll every 10 minutes, costing about $400/day. 

The results are stored in Redis. When a call comes in, we check Redis, pick the fastest deployment, and route the request.

This improved average latency, but we still saw spikes lasting 5+ minutes. 

When a deployment degraded between polls, we were stuck routing traffic to a slow endpoint for the next 10 minutes. The accuracy of our proxy list would look like this:

Attempt 3: Using Live Data + Exploration

We needed fresher data. 

By using the latency from our live production requests, we could update our proxy list in real-time. 

If our fastest deployment spiked, we’d detect it on the next request and immediately rotate it out. 

This solved the stale data problem, but created a new one: we were only exploiting our known winners.

We were no longer exploring the other 39 deployments between polls. What if one of them had become faster? We would never know.

The solution was to segment our traffic. We route the vast majority of requests to the current fastest endpoint (exploitation), but send a small, statistically significant subset to test the others (exploration).

We had a system that could intelligently route requests to the fastest deployment. 

We thought we'd solved it. But alas, we were wrong.

The Real Problem

We were still seeing about 5% of conversation turns hang for up to 5000ms. 

Which is the absolute death spiral of conversation flow:

After digging in, we found the cause: sometimes, a request to a provider like Azure OpenAI just hangs. No error, no timeout, nothing. 

The first request to hit the hang gets shot. Our system would detect it and route subsequent traffic away, but that first user's experience was already ruined:

The final piece of the puzzle was building a recovery mechanism. If a request to the fastest deployment takes too long, we don't wait. We cancel it and immediately fire off a new request to the second-fastest deployment.

Setting this threshold is tricky. Too aggressive, and you incur extra costs from unnecessary fallbacks. Too slow, and the user is left waiting.

But, each deployment has its own unique performance profile, so a single threshold wouldn’t do the trick.

We calculated the historical standard deviation for each individual deployment and set a dynamic threshold based on what constitutes abnormal latency for that specific deployment.

If the first request is an outlier, we fall back to the second. 

If the second is an outlier, we fall back to the third. And so on.

This is what it takes to make an off the shelf model like GPT-4o reliably fast for real-time voice. 

This system alone shaved over 1000ms off our P95 latency. 

It’s one of hundreds of infrastructure problems we've had to solve to wrangle these models into something developers can actually use.

Join the newsletter
0LIKE
Share

Table of contents

Join the newsletter

Build your own
voice agent.

sign up
read the docs
Build with Free, Unlimited MiniMax TTS All Week on Vapi
SEP 15, 2025Company News

Build with Free, Unlimited MiniMax TTS All Week on Vapi

Understanding Graphemes and Why They Matter in Voice AI
MAY 23, 2025Agent Building

Understanding Graphemes and Why They Matter in Voice AI

Glow-TTS: A Reliable Speech Synthesis Solution for Production Applications'
MAY 23, 2025Agent Building

Glow-TTS: A Reliable Speech Synthesis Solution for Production Applications

Tortoise TTS v2: Quality-Focused Voice Synthesis'
JUN 04, 2025Agent Building

Tortoise TTS v2: Quality-Focused Voice Synthesis

GPT Realtime is Now Available in Vapi
AUG 28, 2025Agent Building

GPT Realtime is Now Available in Vapi

Flow-Based Models: A Developer''s Guide to Advanced Voice AI'
MAY 30, 2025Agent Building

Flow-Based Models: A Developer''s Guide to Advanced Voice AI

How to Build a GPT-4.1 Voice Agent
JUN 12, 2025Agent Building

How to Build a GPT-4.1 Voice Agent

Speech-to-Text: What It Is, How It Works, & Why It Matters'
MAY 12, 2025Agent Building

Speech-to-Text: What It Is, How It Works, & Why It Matters

Free Telephony with Vapi
FEB 25, 2025Agent Building

Free Telephony with Vapi

Choosing Between Gemini Models for Voice AI
MAY 29, 2025Comparison

Choosing Between Gemini Models for Voice AI

Diffusion Models in AI: Explained'
MAY 22, 2025Agent Building

Diffusion Models in AI: Explained

Understanding VITS: Revolutionizing Voice AI With Natural-Sounding Speech'
MAY 26, 2025Agent Building

Understanding VITS: Revolutionizing Voice AI With Natural-Sounding Speech

Understanding Dynamic Range Compression in Voice AI
MAY 22, 2025Agent Building

Understanding Dynamic Range Compression in Voice AI

Homograph Disambiguation in Voice AI: Solving Pronunciation Puzzles'
MAY 26, 2025Agent Building

Homograph Disambiguation in Voice AI: Solving Pronunciation Puzzles

What Are IoT Devices? A Developer's Guide to Connected Hardware
MAY 30, 2025Agent Building

What Are IoT Devices? A Developer's Guide to Connected Hardware

Vapi x Deepgram Aura-2  — The Most Natural TTS for Enterprise Voice AI
APR 15, 2025Agent Building

Vapi x Deepgram Aura-2 — The Most Natural TTS for Enterprise Voice AI

Scaling Client Intake Engine with Vapi Voice AI agents
APR 01, 2025Agent Building

Scaling Client Intake Engine with Vapi Voice AI agents

Why Word Error Rate Matters for Your Voice Applications
MAY 30, 2025Agent Building

Why Word Error Rate Matters for Your Voice Applications

AI Call Centers are changing Customer Support Industry
MAR 06, 2025Industry Insight

AI Call Centers are changing Customer Support Industry

Building a Llama 3 Voice Assistant with Vapi
JUN 10, 2025Agent Building

Building a Llama 3 Voice Assistant with Vapi

WaveNet Unveiled: Advancements and Applications in Voice AI'
MAY 23, 2025Features

WaveNet Unveiled: Advancements and Applications in Voice AI

Test Suites for Vapi agents
FEB 20, 2025Agent Building

Test Suites for Vapi agents

What Is Gemma 3? Google's Open-Weight AI Model
JUN 09, 2025Agent Building

What Is Gemma 3? Google's Open-Weight AI Model

Mastering SSML: Unlock Advanced Voice AI Customization'
MAY 23, 2025Features

Mastering SSML: Unlock Advanced Voice AI Customization

Bring Vapi Voice Agents into Your Workflows With The New Vapi MCP Server
APR 18, 2025Features

Bring Vapi Voice Agents into Your Workflows With The New Vapi MCP Server