Custom Transcriber Issues
I am building a custom transcriber for Vapi using ElevenLabs Scribe v2 Realtime.
The issue is that Vapi sends audio to the custom transcriber that includes both:
(1) the customer’s audio, and
(2) the assistant’s TTS audio.
Since ElevenLabs Scribe v2 Realtime does not support multichannel metadata, diarization, or channel_index in WebSocket mode, all audio is treated as a single mixed PCM stream.
This causes the assistant’s own spoken audio to be transcribed and returned to Vapi, which Vapi then interprets as user input. This results in continuous self-interruption loops.
I need clarification on the following:
The issue is that Vapi sends audio to the custom transcriber that includes both:
(1) the customer’s audio, and
(2) the assistant’s TTS audio.
Since ElevenLabs Scribe v2 Realtime does not support multichannel metadata, diarization, or channel_index in WebSocket mode, all audio is treated as a single mixed PCM stream.
This causes the assistant’s own spoken audio to be transcribed and returned to Vapi, which Vapi then interprets as user input. This results in continuous self-interruption loops.
I need clarification on the following:
- Does Vapi send a mixed mono/stereo PCM stream to custom transcribers, or does it send separate channels?
- If stereo: what is the exact channel mapping (which channel is user, which is assistant)?
- If mono: is there any way to configure Vapi so that the custom transcriber receives only user audio?
- Is there any documented or undocumented setting to prevent assistant TTS audio from being forwarded to the transcriber?
- Is channel metadata supported or planned for custom transcriber mode?