How to Enable Real-time TTS Chunk Streaming to Caller

Issue Summary

I'm experiencing significant delays in TTS output delivery to callers. My AI agent takes approximately 7 seconds to start responding, with 3-4 seconds being spent on speech generation, despite my custom TTS server being configured for real-time streaming.

Current Behavior

  • Total response delay: ~7 seconds
  • TTS generation time: 3-4 seconds of the total delay
  • Issue: Vapi appears to wait for complete TTS generation before streaming audio to the caller
## Expected Behavior
Real-time streaming of TTS chunks as they're generated, without waiting for complete speech synthesis.

Technical Details


My TTS Server Configuration

  • Chunk duration: 0.2 seconds per chunk
  • Chunk interval: Generated every 0.061 seconds
  • First chunk delay: 1 second
  • Output format: Streaming chunks
### Problem Description
Despite my TTS server outputting audio chunks in real-time, Vapi seems to collect all chunks before streaming them to the caller, resulting in unnecessary latency.

Reference Data

Conversation ID: 04e2c190-981b-4da1-a5bf-c7e8e63300a8

Please review this conversation to see the timing issues in action.

Questions

  1. How can I configure Vapi to stream TTS chunks immediately as they're received from my custom TTS server?
  2. Is there a configuration setting or parameter that controls this buffering behavior?
  3. Are there any specific requirements for the TTS streaming protocol that I might be missing?
## Environment
Any guidance on optimizing this streaming configuration would be greatly appreciated!
Learn to integrate your own text-to-speech system with VAPI
Was this page helpful?