
At Vapi, we're obsessed with the quality of conversation. But the most human-like AI is useless if the line drops.
A while ago, we were at a respectable ~99.9% uptime. That's about 8 hours of downtime a year. That was still a lot so we set an ambitious goal: 99.99% uptime which is less than an hour of downtime per year.
But getting from 99.9% to 99.99% is a different game. We didn't have to do 1 thing 100x better but 100 things 1x better.

It required a top-to-bottom re-architecture.
That effort was led by our core team: Tejas, Sagar, and Bryant.
This is how they did it.

Our journey started with the database. Our initial setup with Supabase was great for getting started, but it became a single point of failure at scale.
We migrated our primary database to Neon for better stability. But a better database wasn't enough. We needed redundancy. We engineered a fallback system to AWS Aurora. If Neon has an issue, we fail over with minimal disruption.
We also added guardrails like aggressive statement timeouts to kill rogue queries, partitioning massive tables to keep them fast, and a caching layer in front of the DB.
That cache now serves 80% of eligible requests, meaning 4 out of every 5 database requests that have the cache in front of them are served from a low-latency cache, making the entire platform faster and more resilient.
On the telephony side, our SIP was running on a single EC2 instance that capped out at around 200 concurrent calls.
We moved it to auto-scaling groups (ASGs) for each component.
This eliminated the bottleneck and let us handle sudden traffic spikes without dropping calls.
We thought we were on our way. We were wrong.

Our own infrastructure was becoming more stable, but our agents still felt unreliable.
The problem was with our dependencies. We treat every external provider as a potential point of failure.
For AI models, we fall back between a chain of models. If OpenAI fails, we instantly retry on Azure OpenAI. If that is down, we jump to AWS Bedrock. The switch happens inside the request lifecycle, so a provider outage never becomes a Vapi outage.
We carry this "fallbacks everywhere" philosophy to other critical systems as well. Our DTMF (touch-tone) transmission uses a dual-strategy approach: we try the provider's native API first, but if that fails, we automatically fall back to sending the tones as audio signals.
This gives us a 98%+ overall success rate for navigating automated phone systems, even when individual methods fail.

With external dependencies handled, we had to face the next biggest threat to uptime: our own deployments.
A single bad deploy can wipe out hard-earned reliability. Deploying directly to production is playing with fire.
To solve this, we moved from a single production cluster to a multi-cluster architecture with a dedicated Canary Manager service.
When we release a new version:
This automated, safety-first approach to deployments was a massive step forward.
Voice traffic is incredibly spiky. A customer can launch a call campaign and go from 10 to 10,000 concurrent calls in minutes. Our Kubernetes cluster can autoscale, but VMs take time to spin up.
To bridge that gap, we built Lambda burst workers. When our worker scaler detects a backlog, AWS Lambda functions fire instantly to handle overflow.
The challenge was that these Lambda functions exist outside our secure private network.
We solved it with a custom QUIC-based proxy that creates a secure, low-latency tunnel from the Lambda environment into our Kubernetes cluster.
This lets burst workers talk to Redis and databases like they’re local, so even massive spikes pass unnoticed to the end user.

Our infrastructure was resilient. Our deployments were safe. But what happens if a server crashes halfway through provisioning a phone number?
These business-logic failures are just as damaging as infrastructure outages.
To solve this, we integrated Temporal, a durable execution platform.
It allows us to write "workflows" that are guaranteed to run to completion, with state automatically saved and retried across failures.
We wrapped all of our most critical business logic in Temporal workflows:
This ensures that our core business operations are just as fault-tolerant as our infrastructure.
Beyond these major pillars, we've implemented several other critical reliability patterns:

The transformation was measurable:
Before (99.9% era):
After (99.99% era):
Reaching 99.99% was a mix of dozens of architectural changes, new systems, and safety mechanisms. It’s the foundation developers need to build applications they can actually trust.
The most sophisticated AI means nothing if the call drops.
If you're an engineer who reads a post like this and gets excited, you're the type of person we're looking for.
We're hiring. See our open roles and the problems you could be solving here.