dry-scarletD

Custom Transcriber Issues

I am building a custom transcriber for Vapi using ElevenLabs Scribe v2 Realtime.

The issue is that Vapi sends audio to the custom transcriber that includes both:
(1) the customer’s audio, and
(2) the assistant’s TTS audio.

Since ElevenLabs Scribe v2 Realtime does not support multichannel metadata, diarization, or channel_index in WebSocket mode, all audio is treated as a single mixed PCM stream.

This causes the assistant’s own spoken audio to be transcribed and returned to Vapi, which Vapi then interprets as user input. This results in continuous self-interruption loops.

I need clarification on the following:

  1. Does Vapi send a mixed mono/stereo PCM stream to custom transcribers, or does it send separate channels?
  2. If stereo: what is the exact channel mapping (which channel is user, which is assistant)?
  3. If mono: is there any way to configure Vapi so that the custom transcriber receives only user audio?
  4. Is there any documented or undocumented setting to prevent assistant TTS audio from being forwarded to the transcriber?
  5. Is channel metadata supported or planned for custom transcriber mode?
This information is necessary because ElevenLabs Scribe v2 Realtime cannot perform channel separation or diarization on a mixed stream, so the current architecture forces incorrect transcriptions.
Was this page helpful?