• Custom Agents
  • Pricing
  • Docs
  • Resources
    Blog
    Product updates and insights from the team
    Video Library
    Demos, walkthroughs, and tutorials
    Community
    Get help and connect with other developers
    Events
    Stay updated on upcoming events.
  • Careers
  • Enterprise
Sign Up
Loading footer...
←BACK TO BLOG /Company News... / /How We Achieved 99.99% Reliability At Vapi

How We Achieved 99.99% Reliability At Vapi

How We Achieved 99.99% Reliability At Vapi
Abhishek Sharma • Aug 12, 2025
4 min read
Share
Abhishek Sharma • Aug 12, 20254 min read
0LIKE
Share

At Vapi, we're obsessed with the quality of conversation. But the most human-like AI is useless if the line drops.

A while ago, we were at a respectable ~99.9% uptime. That's about 8 hours of downtime a year. That was still a lot so we set an ambitious goal: 99.99% uptime which is  less than an hour of downtime per year. 

But getting from 99.9% to 99.99% is a different game. We didn't have to do 1 thing 100x better but 100 things 1x better.

It required a top-to-bottom re-architecture.

That effort was led by our core team: Tejas, Sagar, and Bryant.

This is how they did it.

The Core

Our journey started with the database. Our initial setup with Supabase was great for getting started, but it became a single point of failure at scale.

We migrated our primary database to Neon for better stability. But a better database wasn't enough. We needed redundancy. We engineered a fallback system to AWS Aurora. If Neon has an issue, we fail over with minimal disruption. 

We also added guardrails like aggressive statement timeouts to kill rogue queries, partitioning massive tables to keep them fast, and a caching layer in front of the DB. 

That cache now serves 80% of eligible requests, meaning 4 out of every 5 database requests that have the cache in front of them are served from a low-latency cache, making the entire platform faster and more resilient.

On the telephony side, our SIP was running on a single EC2 instance that capped out at around 200 concurrent calls. 

We moved it to auto-scaling groups (ASGs) for each component. 

This eliminated the bottleneck and let us handle sudden traffic spikes without dropping calls.

We thought we were on our way. We were wrong.

The "Fallbacks Everywhere" Philosophy

Our own infrastructure was becoming more stable, but our agents still felt unreliable. 

The problem was with our dependencies. We treat every external provider as a potential point of failure.

For AI models, we fall back between a chain of models. If OpenAI fails, we instantly retry on Azure OpenAI. If that is down, we jump to AWS Bedrock. The switch happens inside the request lifecycle, so a provider outage never becomes a Vapi outage.

We carry this "fallbacks everywhere" philosophy to other critical systems as well. Our DTMF (touch-tone) transmission uses a dual-strategy approach: we try the provider's native API first, but if that fails, we automatically fall back to sending the tones as audio signals. 

This gives us a 98%+ overall success rate for navigating automated phone systems, even when individual methods fail.

Architecting for Resilience and Scale

With external dependencies handled, we had to face the next biggest threat to uptime: our own deployments. 

A single bad deploy can wipe out hard-earned reliability. Deploying directly to production is playing with fire.

To solve this, we moved from a single production cluster to a multi-cluster architecture with a dedicated Canary Manager service.

When we release a new version:

  1. It's deployed to a "canary" cluster with no traffic.
  2. Our Canary Manager slowly "drips" traffic to it, starting with as little as 5% of requests.
  3. It monitors error rates and API health. If anything looks wrong, it automatically rolls back.
  4. If the new code is stable, it gradually increases the traffic and becomes the new production cluster.

This automated, safety-first approach to deployments was a massive step forward.

Handling Spikes with Lambda "Burst Workers"

Voice traffic is incredibly spiky. A customer can launch a call campaign and go from 10 to 10,000 concurrent calls in minutes. Our Kubernetes cluster can autoscale, but VMs take time to spin up.

To bridge that gap, we built Lambda burst workers. When our worker scaler detects a backlog, AWS Lambda functions fire instantly to handle overflow.
The challenge was that these Lambda functions exist outside our secure private network. 

We solved it with a custom QUIC-based proxy that creates a secure, low-latency tunnel from the Lambda environment into our Kubernetes cluster. 

This lets burst workers talk to Redis and databases like they’re local, so even massive spikes pass unnoticed to the end user.

Guaranteeing Completion With Temporal

Our infrastructure was resilient. Our deployments were safe. But what happens if a server crashes halfway through provisioning a phone number?

These business-logic failures are just as damaging as infrastructure outages.

To solve this, we integrated Temporal, a durable execution platform. 

It allows us to write "workflows" that are guaranteed to run to completion, with state automatically saved and retried across failures.

We wrapped all of our most critical business logic in Temporal workflows:

  • Stripe payment handling
  • Phone number purchasing
  • User creation

This ensures that our core business operations are just as fault-tolerant as our infrastructure.

Additional Reliability Patterns

Beyond these major pillars, we've implemented several other critical reliability patterns:

  • Process Isolation: Every call runs in its own isolated virtual machine with dedicated compute resources. This prevents "noisy neighbor" issues where one problematic call could impact others.
  • Circuit Breakers: We use circuit breaker patterns throughout our voice pipeline to prevent cascading failures. When a component starts failing, the circuit breaker stops sending requests to it, allowing it to recover.
  • Comprehensive Monitoring: Our monitoring evolved from simple uptime checks to sophisticated error rate analysis, performance tracking across multiple AI providers, and real-time alerting that can trigger automatic remediation.

The Results

The transformation was measurable:

Before (99.9% era):

  • Frequent dropped calls, especially during traffic spikes.
  • Single points of failure in database and SIP infrastructure
  • Manual rollback procedures taking 15+ minutes
  • A provider outage meant a Vapi outage.

After (99.99% era):

  • A 97% reduction in dropped calls.
  • Multi-region database fallbacks with <5 second failover
  • Automated canary rollbacks in under 60 seconds
  • Provider outages are handled by automatic failover chains.

Reaching 99.99% was a mix of dozens of architectural changes, new systems, and safety mechanisms. It’s the foundation developers need to build applications they can actually trust.

The most sophisticated AI means nothing if the call drops.


If you're an engineer who reads a post like this and gets excited, you're the type of person we're looking for.

We're hiring. See our open roles and the problems you could be solving here.

Build your own
voice agent.

sign up
read the docs
Join the newsletter
0LIKE
Share

Table of contents

Join the newsletter
Build with Free, Unlimited MiniMax TTS All Week on Vapi
SEP 15, 2025Company News

Build with Free, Unlimited MiniMax TTS All Week on Vapi

GPT Realtime is Now Available in Vapi
AUG 28, 2025Agent Building

GPT Realtime is Now Available in Vapi

Everything We Shipped in July 2025
AUG 01, 2025Company News

Everything We Shipped in July 2025

Everything We Shipped in August 2025
SEP 01, 2025Company News

Everything We Shipped in August 2025

Vapi x Deepgram Aura-2  — The Most Natural TTS for Enterprise Voice AI
APR 15, 2025Agent Building

Vapi x Deepgram Aura-2 — The Most Natural TTS for Enterprise Voice AI

Vapi x LiveKit Turn Detection
MAR 20, 2025Features

Vapi x LiveKit Turn Detection

Everything We Shipped in May 2025
JUN 19, 2025Company News

Everything We Shipped in May 2025

 Cartesia''s Ink-Whisper is Now Live on Vapi'
JUN 12, 2025Company News

Cartesia''s Ink-Whisper is Now Live on Vapi

Vapi Dashboard 2.0
MAR 17, 2025Company News

Vapi Dashboard 2.0

Everything We Shipped in June 2025
JUL 07, 2025Company News

Everything We Shipped in June 2025

Vapi AI Prompt Composer '
MAR 18, 2025Features

Vapi AI Prompt Composer

Meet the Winners of the Vapi Build Challenge 2025
JUN 19, 2025Company News

Meet the Winners of the Vapi Build Challenge 2025

Bring Vapi Voice Agents into Your Workflows With The New Vapi MCP Server
APR 18, 2025Features

Bring Vapi Voice Agents into Your Workflows With The New Vapi MCP Server

Now Use Vapi Chat Widget In Vapi
JUL 02, 2025Company News

Now Use Vapi Chat Widget In Vapi

Introducing Vapi Workflows
JUN 05, 2025Agent Building

Introducing Vapi Workflows

Vapi Startups Program
MAR 19, 2025Features

Vapi Startups Program

Add SMS to Your Vapi Agents
APR 22, 2025Features

Add SMS to Your Vapi Agents

This Month at Vapi: MCP, SMS, New Models And More'
APR 28, 2025Company News

This Month at Vapi: MCP, SMS, New Models And More

Voice AI is eating the world
MAR 04, 2025Agent Building

Voice AI is eating the world

GPT-5 Now Live in Vapi
AUG 07, 2025Company News

GPT-5 Now Live in Vapi

Launching the Vapi for Creators Program
MAY 22, 2025Company News

Launching the Vapi for Creators Program

Vapi’s MCP Server is Live on Vercel
MAY 19, 2025Company News

Vapi’s MCP Server is Live on Vercel

Make your voice agents also chat with Vapi’s new Chat API
MAY 29, 2025Company News

Make your voice agents also chat with Vapi’s new Chat API

Vapi X Coval: Test Before You Scale'
MAY 15, 2025Company News

Vapi X Coval: Test Before You Scale

How We Built Adaptive Background Speech Filtering at Vapi
JUL 24, 2025Agent Building

How We Built Adaptive Background Speech Filtering at Vapi