eastern-cyan•2w ago

Custom LLM getting partial messages, with significantly delayed full messages

Hey all. We're having an issue with our chat/completions API (custom LLM). We've been relying on the messages sent as part of the chat/completions payload instead of managing state ourselves, because it seems that the messages get rewritten. We're getting an issue when the user's response is very short, like a one-word "yes". When the response is very short, it seems that the assistant response in the message payload is somehow cut off. For example, we'll get a call with: assistant: "Let's begin our interview. Are you qualified to work?" user: "Yes" The VAPI messages payload would cut off the second part, and just send us "Let's begin our interview." Then weirdly, some 5 seconds later, we'll get another chat/completions payload with the full "Let's begin our interview. Are you qualified to work?" from the assistant. This consistently repros when the user response is short. When the user response is longer, we don't have this issue. We're not keeping state on our end because it seems we get opportunistic completion endpoint calls, and VAPI is discarding opportunistic calls that don't actually get sent to the user. So we have to rely on the VAPI messages for canonical messages - but in this case, if VAPI is sending truncated messages, we have a problem. It's kind of a weird scenario and hoping someone can help. Thanks!

1 Reply

Shubham Bajaj•2w ago

Hi Phil, It sounds like you're encountering issues with your Custom LLM and the chat/completions API where messages are truncated or delayed, especially following short user responses. This situation can often arise due to the handling of short responses within the LLM's logic or external integrations. Here are a few steps to help address this: 1. Review Message Handling: Ensure that the message payload properly processes short responses. Implement logic to handle messages with minimal characters which might be triggering incomplete responses or delays. 2. Use SSE for Response Streaming: If not already implemented, consider using Server-Sent Events (SSE) to handle response streaming. This allows you to receive message parts as they are generated, potentially mitigating delays (reference). 3. Testing with cURL: You can test the functionality with curtailed responses using cURL to ensure the system handles these scenarios correctly (testing examples). If possible, debugging with these steps can help identify if the problem stems from message processing or if other API configuration adjustment is needed.

Custom LLM getting partial messages, with significantly delayed full messages

Did you find this page helpful?