
For any real-time voice agent, the most critical moment is the silence after a user stops speaking. That time it takes for Speech-to-Text (STT) to finalize a transcript determines whether a conversation feels natural or stilted. Slow STT breaks the illusion of a real conversation.
Today, Cartesia is releasing Ink-Whisper, a new STT model built specifically to solve this problem, and it's available on Vapi now.
Ink-Whisper is the first streaming-optimized version of Whisper, engineered from the ground up for conversational AI. While Whisper set the standard for transcription quality, it wasn't designed for the demands of real-time turn-taking. Ink-Whisper addresses this directly.
Our initial tests and Cartesia's benchmarks show a significant reduction in the time-to-transcript completion.
It's engineered to handle the realities of real-world audio like background noise and varied accents, delivering a complete transcript faster than most streaming APIs.
Better, faster STT is a foundational improvement for any voice application. With Ink-Whisper on Vapi, you can immediately test how this new level of performance impacts your agents.
Because Vapi handles the entire audio pipeline, you can test Ink-Whisper's impact on your agent just by changing a single line in your config. There's no need to build a new integration or change your infrastructure.
This is why we built Vapi as a model-agnostic platform: to give developers the ability to experiment with and deploy the best new models the moment they are released.

If you're already building on Vapi, you can start using Ink-Whisper immediately.
That's it. Your agent will now use Ink-Whisper for transcription. We encourage you to run your own tests against your current default STT provider to measure the impact on latency and user experience for your specific use case.
If you have questions or need support, our team is available on Discord.