future-harlequin•4mo ago

Why do fast GPT models show high latency in Vapi assistant AI model listing?

For example, for "GPT o4 Mini Cluster" is shows 1640ms latency, but its quite faster according to OpenAI. Wanted to understand how are you using these models that makes their latency this high. And, if theres anything we can do on our end to make the latencies lower (by provider our own Clusters, API keys, etc).

12 Replies

Vapi Support Bot•4mo ago

To reduce latency, you can bring your own API key and set up a custom LLM endpoint for Vapi. Here’s a quick example using Python and Flask:

from flask import Flask, request, jsonify
import openai

app = Flask(__name__)
openai.api_key = "YOUR_OPENAI_API_KEY"

@app.route("/chat/completions", methods=["POST"])
def chat_completions():
    data = request.get_json()
    response = openai.ChatCompletion.create(
        model="gpt-4.1-mini",
        messages=data["messages"]
    )
    return jsonify({"text": response.choices[0].message["content"]})

if __name__ == "__main__":
    app.run(port=5000)

from flask import Flask, request, jsonify
import openai

app = Flask(__name__)
openai.api_key = "YOUR_OPENAI_API_KEY"

@app.route("/chat/completions", methods=["POST"])
def chat_completions():
    data = request.get_json()
    response = openai.ChatCompletion.create(
        model="gpt-4.1-mini",
        messages=data["messages"]
    )
    return jsonify({"text": response.choices[0].message["content"]})

if __name__ == "__main__":
    app.run(port=5000)

Expose this server with ngrok and add the public URL in Vapi’s dashboard under Custom LLM. This setup lets you control the connection and may lower latency compared to shared clusters.
See full guide Source: - Connecting Your Custom LLM to Vapi: A Comprehensive Guide

⤵️ Mention @Vapi Support Bot for follow-ups 🐥

future-harlequinOP•4mo ago

If I provide my own OpenAI API keys, then on selecting "GPT o4 Mini Cluster", will it send requests to my OpenAI key? If yes, then why does it still shows the same latency in the listing? @Vapi Support Bot If I provide my own OpenAI API keys, then on selecting "GPT o4 Mini Cluster", will it send requests to my OpenAI key? If yes, then why does it still shows the same latency in the listing?

Vapi Support Bot•4mo ago

Thank you for your question! To clarify, are you adding your OpenAI API key under "Provider Keys" and selecting the default "GPT o4 Mini Cluster" from the model list, or are you configuring a "Custom LLM" endpoint as described in the documentation? This will help determine how requests are routed and why the latency display may not update. Sources: - Connecting Your Custom LLM to Vapi: A Comprehensive Guide - Provider Keys

future-harlequinOP•4mo ago

@Vapi Support Bot i'm just adding the openAi api key in the "provider keys"

Vapi Support Bot•4mo ago

When you add your OpenAI API key under Provider Keys and select "GPT o4 Mini Cluster," Vapi will use your key for requests, and you will be billed directly by OpenAI. However, the latency shown in the model listing is based on Vapi's shared cluster metrics and does not update dynamically to reflect your individual key or connection—it remains the same for all users. If you want more control over latency, consider setting up a Custom LLM endpoint as described in the documentation.
See Provider Keys documentation Source: - Provider Keys

Duckie•4mo ago

Message marked as helpful by @LordGrim! 🎉

future-harlequinOP•4mo ago

@Vapi Support Bot is there a way to send the "metadata" fields of the assisstant to the OpenAI (via the API Keys provided)?

Vapi•4mo ago

future-harlequinOP•4mo ago

@Vapi Support Bot is there a way to send the "metadata" fields of the assisstant to the OpenAI (via the API Keys provided)?

Kyle•4mo ago

Could you provide some more detail about passing the metadata over?

future-harlequinOP•4mo ago

I want to pass the metadata (eg: assisstantId, callId) from Vapi to OpenAI (my own API key). so that I can filter on these fields and log the pricing per call in OpenAI dashboard.

Kyle•4mo ago

If you are using your API key, the API logs on openAI should have the costs for each usage, unless they have some webhook that you can configure to send that data to your account. Otherwise, there may not be a way to send that metadata to openAI directly.

Why do fast GPT models show high latency in Vapi assistant AI model listing?

Did you find this page helpful?