
The decision between an integrated voice AI solution and an open, modular stack is one of the more consequential architectural choices teams are making as they adopt voice AI. The right answer often depends on the use case you’re building. To pressure-test our thinking, we recently sat down with Iz Shalom from Cartesia, whose team works with companies at different stages in their voice AI implementation from PoC to production at scale. The questions below are the ones we keep coming back to, with perspective from both sides of the conversation.
"How should I think about building my voice stack?"
Integrated stacks can be the right choice when your use case is narrow, or when you need to ship a proof of concept fast. Modular open stacks are a better fit when building multiple use cases, supporting many languages, or anticipating that you'll want full control to swap components as the technology evolves or you expand your use cases. As Iz Shalom from Cartesia frames it, neither approach is universally better; the cost of getting the choice wrong shows up much later, once you've built infrastructure assuming the wrong shape.
The complication is that most teams won't know on day one how far they will go with voice agents, making it hard to make this decision upfront. The discussion still matters because PoC ceilings come faster than you expect, and once you hit them, an open stack can give you the ability to adapt and expand if you plan to evolve the ways you use voice AI.
"What should I actually be evaluating before I put a voice agent in production?"
Most teams fixate on the wrong layer. They compare TTS samples by hitting play buttons, they chase vendor latency numbers, and they swap models based on vibes. The teams that ship fast and stay in production do something different: they decide what they're optimizing for before they touch the stack and test it accordingly.
Start with latency, but measure it honestly. Every vendor publishes what they call a P50 latency number, which essentially measures latency under normal circumstances. What is important, however, is that your customers don't experience P50; they experience P99 or latency under the most complex circumstances. In natural human dialogue, the pause between speakers averages around 180ms. Most production deployments are well above that, and once round-trip latency drifts past 3–4 seconds, users disengage even though the call is technically still live. Vendor benchmarks are accurate in controlled conditions, less so in the wild. Test latency in your environment, on your infrastructure, with your tool calls in the loop.
Then get the evaluation of your models right. The most sophisticated teams can evaluate a new model end-to-end in about 24 hours because they've already locked in a primary metric (containment rate, task success, customer satisfaction) and wired automated testing around it with clear success criteria. Without that, every component swap becomes a multi-week judgment call, and the speed advantage of running a modular stack disappears.
Apply the same rigor to voice selection. The model needs enough input to capture not just timbre but the way someone speaks in a specific context. A call center agent sounds like a different person when they pick up a personal call mid-shift. Clone for the context you'll actually deploy in, and evaluate voices conversationally, stand up an agent, and talk to it for 10 minutes. The real differences surface around the sixth or seventh turn, not the first.
"How do you keep up as new models keep shipping?"
If a better STT, LLM, or TTS model drops every few weeks, what does an open stack actually let you do that an integrated platform doesn't?
The mechanical answer is easy: with an open stack, swapping in a new provider at any layer is a seamless configuration change, not a re-architecture. The harder answer is that the swap only matters if you can tell whether the new model is actually better for your use case. Public benchmarks won't tell you that; evaluations will. The teams that get the most value from an open stack are the ones that build a tight evaluation, as touched on above.
A practical pattern worth borrowing from teams running at scale: keep two providers warm at every layer in production traffic shadowing mode, so when a new model ships, you have an A/B harness already wired up.
"Is multilingual a reason to go open stack?"
Yes, and arguably the strongest reason if you're serving any customers speaking any language other than English. Performance in one language guarantees almost nothing about performance in another; every language is a separate test, and while one model might be great in English, it might struggle in French. Integrated platforms tend to push you toward whichever provider their default stack best supports, which is usually English plus a handful of major European languages that might be an afterthought.
The open stack lets you mix providers by language. Use one STT provider for English calls, swap to a different one for Spanish or Hindi where its accuracy and authenticity are stronger. This routing lives in the orchestration layer, the exact place an integrated platform doesn't expose. If your roadmap includes serving outside the typical English/Spanish/French/German/Portuguese/Mandarin set, layer-by-layer choice becomes an imperative.
"What happens when one layer of the stack goes down or underperforms?"
Every additional dependency is another thing that can fail. Integrated platforms hide this from you, but they don't eliminate it; they just don't tell you when it's happening.
For an open stack to be production-ready, two things need to be in place:
Model fallbacks. Two STT providers, two TTS providers, ideally two LLM providers behind a router that fails over on latency spikes or errors. If one provider fails, you already have a backup ready to go.
Conversational fallbacks. When all else fails, the agent should degrade gracefully ("I'm having trouble hearing you — could you say that again?") rather than going silent or hanging up, and leaving your customers confused.
"What changes when you go from one use case to many?"
A lot of teams start by building a single-purpose voice agent, a booking flow, an FAQ bot, and a lead qualifier, and then realize the same call needs to handle three or four different intents. This is where single-prompt assistants typically start breaking down. You cram every instruction into a single system prompt, and the model loses focus, with reliability dropping exactly at the point where you wanted to scale.
The architectural answer is multi-assistant orchestration, which Vapi calls Squads. Instead of one bloated agent, you build a small team of specialists (greeter, qualifier, booker, support agent) and define handoff rules between them. Each specialist has their own prompt, voice, model, and tools. High-complexity tasks route to a frontier model; basic FAQs route to a faster, cheaper one. You get both better reliability and better unit economics at scale.
This is where open stack's flexibility starts to compound. Different agents in the same squad can use different LLMs, voices, and transcribers, each evaluated for the specific job, rather than forcing a single stack to handle everything.
In conclusion: The stack you choose is the ceiling you build under
The technology will keep changing: new models, new providers, new benchmarks every few weeks. The teams that stay in production and keep improving are the ones that choose an architecture flexible enough to absorb those changes, and disciplined enough to measure whether each change is relevant and important to the use cases and the goals they are trying to achieve . *Thanks to Iz Shalom and Nick Robin for the conversation that shaped much of this post. If you want to watch the full webinar recording, [click here]