technical-ivoryT
VAPIβ€’5w ago
technical-ivory

Evals not judging accurately - are you caching?

I'm seeing flakiness in the evals feature when following a test driven development workflow. I suspect you all are caching or something thats causing issues.

Here is my flow:

  1. Create transient eval with transient assistant
  2. Eval fails. I verified the assistant output warrants a failure. βœ…
  3. Update assistant prompt to fix unwanted behavior
  4. Eval still fails. The assistant output message is valid in this case though!!!!! πŸ”΄
You may think that my evaluation criteria is bad or my prompt isnt working as expected but the evidence below should confirm thats not the case. I dont update the eval inputs at all between the runs, only the transient assistant prompt, but again that should be irrelevant given the evidence below.

Run 1: With Initial Prompt (Eval Run ID: f6e8e083-4ccf-4db4-8b95-1563888d38a7 2025-11-21T16:32:21.759Z)
  • Assistant output: "I understand your preference, Pat. However, I will still send a follow-up email with an optional scheduling link for our onboarding team, just in case you change your mind or need assistance in the future. Thank you for your time, and have a great day!"
  • Evaluation failure message: "The response implies future communication by stating 'I will send a follow-up email...'."
Run 2: Updated Prompt, evaluation prompt stays the same (Eval Run ID: 5c1e1f2d-f496-457a-b90b-d8bb6b636ac0 2025-11-21T16:33:05.759Z)
  • Assistant output: "I understand and respect your decision, Pat. I've noted your preference and won't contact you further. Thank you for your time, and have a great day!"
  • Evaluation failure message: "The response implies future communication by stating 'I will send a follow-up email...'." πŸ€”
    • It pulled a quote that doesnt exist.
I've tried different models, different temperatures, etc. This indicates to me that run 2 is not evaluating the proper message. Theres literally no reference to what its quoting, so it must be referencing some other message or caching the run 1 failure for some reason.
Was this page helpful?