stormy-gold•6h ago

custom STT,TTS, VAD problem

Hey guys, i having an issue when integrating my custom transcriber Yandex, to Vapi. The problem, Yandex apparently doesn't support 2 channels, it is mono. So i am splitting two channels into two individual STT streaming session to Yandex. I am not sure if this is the right approach. Plus, Yandex doesn't even provide SDK for streaming, it onyl offers v2 gRPS streaming single channel and v3. This is means that i am entitled to control the channels. I couldn't figure out, how to stop the spill over of audio. What I mean by that: when assistant is speaking it is audio is continiously is being trancsribed, but at that moment if customer starts speaking as well, my server can't distinguish it, so yandex will send a one giant mixed transcript. Well, the expected behavior for me is, agent should stop if user interrupts the agent, well since my TTS also going to be Yandex, i can cut the TTS and silence agent channel, whenever i detect energy increase in customer channel, but that doesn't stop the Vapi agent's LLM, this can lead to inconsistency between the understanding of the user and agent. Please any of guys have a solution to this, help me!

3 Replies

Vapi Support Bot•6h ago

Thanks for sharing the details! To better understand your setup: Are you able to programmatically mute or pause the audio sent to the Yandex STT stream for the assistant channel when you detect user speech (e.g., via VAD or energy threshold), or is the audio always mixed before reaching Yandex? This will help clarify if channel-level control is possible in your current architecture. Sources: - Voice pipeline configuration - Custom transcriber

⤵️ Mention @Vapi Support Bot for follow-ups 🐥

stormy-goldOP•5h ago

I did think about adding VAD / energy detection to pause the assistant channel, but the issue is deeper than that. When I tested with two concurrent STT streams — one for the customer channel (Left) and one for the assistant channel (Right) — the Yandex STT still produced mixed transcripts. What’s happening is: The assistant channel STT session is always receiving audio. When the customer suddenly starts talking, their voice is still present in the assistant channel’s audio, so Yandex transcribes both voices together. As a result, the customer STT stream never gets clean audio, because by the time they speak, the assistant channel stream is already “contaminated”. Well, to be correct customer STT get no audio. So even though I have two separate sessions, it doesn’t matter because Yandex only supports mono STT. I have to downmix each channel to mono before sending it. That means both streams are seeing the same blended audio unless I fully mute one channel. That’s the core problem: I cannot reliably isolate customer vs assistant audio before the STT step, because Yandex STT requires mono input and I can’t prevent spillover unless I disable one side entirely. I think i have realized my mistake, I was previously tried dual separate session for each channel method but i actually forgot to mute the corresponding opposite channels for sessions, so i was basically sending the audio to both sessions, and expecting it to work! I ll be updating my progress here please any feel free to share your opinion! Thanks !

Duckie•5h ago

Message marked as helpful by @Nathan! 🎉

custom STT,TTS, VAD problem

Did you find this page helpful?