How to Enable Real-time TTS Chunk Streaming to Caller
Issue Summary
I'm experiencing significant delays in TTS output delivery to callers. My AI agent takes approximately 7 seconds to start responding, with 3-4 seconds being spent on speech generation, despite my custom TTS server being configured for real-time streaming.Current Behavior
- Total response delay: ~7 seconds
- TTS generation time: 3-4 seconds of the total delay
- Issue: Vapi appears to wait for complete TTS generation before streaming audio to the caller
Real-time streaming of TTS chunks as they're generated, without waiting for complete speech synthesis.
Technical Details
My TTS Server Configuration
- Chunk duration: 0.2 seconds per chunk
- Chunk interval: Generated every 0.061 seconds
- First chunk delay: 1 second
- Output format: Streaming chunks
Despite my TTS server outputting audio chunks in real-time, Vapi seems to collect all chunks before streaming them to the caller, resulting in unnecessary latency.
Reference Data
Conversation ID:04e2c190-981b-4da1-a5bf-c7e8e63300a8Please review this conversation to see the timing issues in action.
Questions
- How can I configure Vapi to stream TTS chunks immediately as they're received from my custom TTS server?
- Is there a configuration setting or parameter that controls this buffering behavior?
- Are there any specific requirements for the TTS streaming protocol that I might be missing?
- Using custom TTS server, configured as described in https://docs.vapi.ai/customization/custom-voices/custom-tts
Learn to integrate your own text-to-speech system with VAPI