hurt-tomatoH
VAPI5mo ago
hurt-tomato

Help optimizing latency, what affects "endpointing" latency in particular.

We are trying to optimize the latency (and perceived latency) in our Vapi Assistants. We use 4o mini cluster + Deepogram novo3 + 11labs.
Our system prompts are pretty long ~4000 tokens.

My questions:
What affects endointing latency? It goes from 200-2000ms.
Are server webhooks affecting it?
I'm in Europe right now, and I keep getting westus servers, is that because the GPUs are fast, but then network latency adds to "endpointing"?
Does system prompt length affect noticeably? Are you using prompt caching?

Would love any recommendations or hints! Consistent <1000ms calls would be great. (Most of our users are in the US fyi).


Attaching some turn taking logs:
Turn latency: 3428ms (transcriber: 750ms, endpointing: 1742ms, kb: n/a, model: 390ms, voice: 526ms)
Turn latency: 3953ms (transcriber: 709ms, endpointing: 2456ms, kb: n/a, model: 401ms, voice: 375ms)
Turn latency: 2462ms (transcriber: 73ms, endpointing: 1571ms, kb: n/a, model: 245ms, voice: 538ms)
Turn latency: 1833ms (transcriber: 298ms, endpointing: 510ms, kb: n/a, model: 392ms, voice: 613ms)
Turn latency: 1589ms (transcriber: 564ms, endpointing: 258ms, kb: n/a, model: 424ms, voice: 297ms)

Example of assistant id: 9b948733-1307-4f3f-a49c-92ec47af9cc2
Was this page helpful?