quickest-silver•9mo ago
Issue with custom transcriber
Hello, I’m trying to build a custom transcriber. I have a server that receives audio data from Vapi, sends that data to Google Speech to text API and returns the transcription in the following format:
{
"type": "transcriber-response",
"transcription": “transcription here“,
"channel": "customer"
}
I tested having a conversation with multiple turns. I spoke a few times and the assistant replied back. It works, but it has one issue. I checked the Call Log Details and on the Transcripts tab there’s a single entry that contains all my transcriptions merged in one sentence.
Example:
User: hello how are you what’s the weather in San Francisco
If I use a transcriber available off-the-shelf (like Deepgram), Call Log Details show up correctly.
Example:
User: hello
Assistant: hello
User: how are you
Assistant: I’m just a robot
User: what’s the weather in San Francisco
I believe whenever I speak the assistant gets the merged transcriptions ("hello how are you what’s the weather in San Francisco"), so the issue is not simply affecting Call Log Details, I think it's also affecting the conversation itself
18 Replies
quickest-silverOP•9mo ago
Can anyone help here? @Vapi @Vapi Team
@Miguel C
Hello! It sounds like you're experiencing an issue where your custom transcriber is concatenating all transcriptions into a single entry, rather than separating them by turn. This could be due to how the transcriptions are being processed and sent back to VAPI.
To resolve this, you'll need to ensure that each transcription event sent to VAPI is distinct and corresponds to a single turn of the conversation. This might involve modifying your server logic to:
1. Detect pauses or breaks in the audio stream to determine when one turn ends and another begins.
2. Send a separate transcription event after each turn, rather than accumulating them and sending them all at once.
If you're already doing this and still facing the issue, it could be related to how the events are being handled or formatted when received by VAPI. You may need to review the integration code to ensure it's compatible with VAPI's expected event format.
If you need further assistance with the specifics of your implementation or if you believe there might be an issue with VAPI itself, please provide more details or reach out to the support team for more in-depth troubleshooting.
quickest-silverOP•9mo ago
I'm sending a separate transcription event after each turn. Here's the 3 transcriptions that I sent.
{"type": "transcriber-response", "transcription": "hello", "channel": "customer"}
{"type": "transcriber-response", "transcription": "how are you", "channel": "customer"}
{"type": "transcriber-response", "transcription": "what's the weather in San Francisco", "channel": "customer"}
@Vapi
rare-sapphire•9mo ago
@Miguel C you need to send alternatives transcriptions your sending all transcriptions as customer role only.
FYI: You might need to enable interim transcription.
Do let me know your thoughts on it, and if you want to breakdown then share the call id.
quickest-silverOP•9mo ago
- What do you mean by sending alternatives transcriptions? - Do I also need to send the "assistant transcriptions"?
- If I send interim transcriptions, how does Vapi know when a transcription is final?
Thanks in advance @Shubham Bajaj
rare-sapphire•9mo ago
@Miguel C you have to decide final transcripts to Vapi and differentiate b/w user and assistant transcript and send with right role assigned to Vapi.
If you can share the call ID, I can pinpoint exactly what you're doing wrong.
quickest-silverOP•9mo ago
understood! I was able to make the custom transcriber work by doing that.
Now, nikhil told me about the modelOutputInMessagesEnabled flag which I would like to try out, but I believe it's not working
I shared the issue here: https://discord.com/channels/1211482211119796234/1228667357849849856/1314677498235195435
call_id: e50d7ff7-4058-4e36-b5e1-d1695e8dc03f
rare-sapphire•9mo ago
@Miguel C Looking into it allow me sometime.
quickest-silverOP•9mo ago
thanks 🙂
rare-sapphire•9mo ago
@Miguel C what's currently happening is your passing only the customer transcript you need to send both transcripts of each's turn as per the conversation flow,
modelOutputInMessagesEnabled
is used for setting what to be used in conversation messages LLM Output or TTS Transcription.
Check the screenshot for the reference.
Let me know your comments on this!!
quickest-silverOP•9mo ago
I thought that by enabling
modelOutputInMessagesEnabled
the LLM output would be used to append the assistant messages directly to the conversation system, and I would not need to transcribe the assistant audio.
Even if I enable modelOutputInMessagesEnabled
I still need to transcribe the assistant audio and send it to Vapi?rare-sapphire•9mo ago
modelOutputInMessagesEnabled
adds model output to the message history, but we only support it for 11labs right now since it needs the TTS to provide exact timing.quickest-silverOP•9mo ago
understood, thanks for helping out!
another question 😅 I'm using Google Speech to text API. This API sends me speech activity events such as an event that indicates when the user has stopped talking. Can I send this event to Vapi?
As far as I know, Vapi's documentation only mentions that I can send an event with the transcription. But I would like to know if I can send an event to indicate that the user has stopped talking. The goal is to reduce latency.
rare-sapphire•9mo ago
When you send transcription back to Vapi using custom-sst you control how fast you want to send back transcription to Vapi. Do let me know if anything else is required.
evident-indigo•4mo ago
@Shubham Bajaj is it on the roadmap to add support for
modelOutputInMessagesEnabled
for other TTS providers like Deepgram?It is already out. You can now enable it.
evident-indigo•4mo ago
i see - i'm using Vapi TTS and it doesn't seem to work. Transcribed messages are still being added to the transcript / messages. Here is the call id: 3c73fe72-754d-4792-828e-aff641ba3ad9
would be great if you can help
Hey Raj,
modelOutputInMessagesEnabled: true
only work for 11labs & assistant. So if the user is using squad or any other voice this is ignored.