VAPI•5w ago

Evals not judging accurately - are you caching?

I'm seeing flakiness in the evals feature when following a test driven development workflow. I suspect you all are caching or something thats causing issues.

Here is my flow:

Create transient eval with transient assistant
Eval fails. I verified the assistant output warrants a failure.
Update assistant prompt to fix unwanted behavior
Eval still fails. The assistant output message is valid in this case though!!!!!

You may think that my evaluation criteria is bad or my prompt isnt working as expected but the evidence below should confirm thats not the case. I dont update the eval inputs at all between the runs, only the transient assistant prompt, but again that should be irrelevant given the evidence below.

Run 1: With Initial Prompt (Eval Run ID: f6e8e083-4ccf-4db4-8b95-1563888d38a7 2025-11-21T16:32:21.759Z)

Assistant output: "I understand your preference, Pat. However, I will still send a follow-up email with an optional scheduling link for our onboarding team, just in case you change your mind or need assistance in the future. Thank you for your time, and have a great day!"
Evaluation failure message: "The response implies future communication by stating 'I will send a follow-up email...'."

Run 2: Updated Prompt, evaluation prompt stays the same (Eval Run ID: 5c1e1f2d-f496-457a-b90b-d8bb6b636ac0 2025-11-21T16:33:05.759Z)

Assistant output: "I understand and respect your decision, Pat. I've noted your preference and won't contact you further. Thank you for your time, and have a great day!"
Evaluation failure message: "The response implies future communication by stating 'I will send a follow-up email...'."
- It pulled a quote that doesnt exist.

I've tried different models, different temperatures, etc. This indicates to me that run 2 is not evaluating the proper message. Theres literally no reference to what its quoting, so it must be referencing some other message or caching the run 1 failure for some reason.

Evals not judging accurately - are you caching?

Similar Threads

Similar Threads

Similar Threads