Evals not judging accurately - are you caching?
I'm seeing flakiness in the evals feature when following a test driven development workflow. I suspect you all are caching or something thats causing issues.
Here is my flow:
Run 1: With Initial Prompt (Eval Run ID: f6e8e083-4ccf-4db4-8b95-1563888d38a7 2025-11-21T16:32:21.759Z)
Here is my flow:
- Create transient eval with transient assistant
- Eval fails. I verified the assistant output warrants a failure.
- Update assistant prompt to fix unwanted behavior
- Eval still fails. The assistant output message is valid in this case though!!!!!
Run 1: With Initial Prompt (Eval Run ID: f6e8e083-4ccf-4db4-8b95-1563888d38a7 2025-11-21T16:32:21.759Z)
- Assistant output: "I understand your preference, Pat. However, I will still send a follow-up email with an optional scheduling link for our onboarding team, just in case you change your mind or need assistance in the future. Thank you for your time, and have a great day!"
- Evaluation failure message: "The response implies future communication by stating 'I will send a follow-up email...'."
- Assistant output: "I understand and respect your decision, Pat. I've noted your preference and won't contact you further. Thank you for your time, and have a great day!"
- Evaluation failure message: "The response implies future communication by stating 'I will send a follow-up email...'."
- It pulled a quote that doesnt exist.