What you'll learn: A layered framework for testing voice agents before production deployment, organized as a test pyramid.
Key takeaways:
- The test pyramid has three layers. Unit tests validate next-turn behavior. Boundary tests validate tool contracts and handoffs. End-to-end tests validate complete conversations.
- Tool contract tests are fast and deterministic. Validate schemas, timeouts, retries, idempotency, and error handling. Run these on every change.
- Conversation regression tests use scripted transcripts. 20-50 scenarios per workflow with expected tool calls and outcomes. Run before every deployment.
- Most teams only test the top layer. When something breaks, they know it's broken but not where or why. The pyramid localizes failures to specific components.
- Test environment must support mock backends, deterministic tool responses, and reproducible state. Flaky tests become ignored tests.
The prompt change seemed minor. The screening agent had been confirming candidate availability with a direct question at the end of the call. The product team wanted it to feel more conversational, so they adjusted the prompt to acknowledge availability if the candidate had already mentioned it.
The change passed all twenty end-to-end test scenarios. Happy paths, edge cases, and a few adversarial inputs. The test suite ran green. The change was deployed.
Three days later, a QA reviewer named Tomás noticed a pattern in transcripts. Candidates who volunteered their availability early in the call weren't getting the confirmation step. The agent skipped it entirely. It had heard them say "I'm available weekday mornings" and decided that counted as confirmation. But the tool that recorded availability wasn't being called. Thousands of screening decisions were missing availability data.
The issue wasn't in the conversation flow. It was in how the prompt interacted with a specific tool contract. A unit test that validated "given this conversation state, does the agent call the correct tool?" would have caught it in seconds. But the staffing marketplace didn't have unit tests for its voice agents. They only had end-to-end tests.
Tomás spent the next month building what they should have had from the start.
The test pyramid
Software teams learned decades ago that testing has layers. Unit tests at the bottom, fast and numerous, catching small failures quickly. Integration tests in the middle, validating how components work together. End-to-end tests at the top, slow and expensive, catching interaction effects that the other layers miss.
Voice agent testing needs the same structure, but most teams only build the top layer. They run full conversation tests against realistic scenarios and call it done. When something breaks, they know it's broken but not where or why.
Tomás built a proper pyramid.
Unit tests validated next-turn behavior. Given a specific conversation history and user input, does the agent produce the correct response? Does it call the correct tool with the correct parameters? These tests were fast, deterministic, and ran on every prompt change.
Boundary tests validated tool contracts, handoffs, and integrations. Does the tool receive the right inputs? Does it handle timeouts and errors correctly? Do multi-agent handoffs preserve context? These tests run on every tool or integration change.
End-to-end tests validated full conversations from start to finish. Twenty to fifty scripted scenarios per workflow, covering happy paths, edge cases, and adversarial inputs. These caught interaction effects the other layers missed.
The pyramid shape mattered. Many unit tests, fewer boundary tests, even fewer end-to-end tests. Fast tests at the bottom caught most regressions quickly. Slow tests at the top caught the subtle issues that only emerged across full conversations.
Unit tests
Tomás started with the failure that had triggered everything. The availability confirmation step.
He wrote a unit test. Given a conversation in which the candidate said, "I'm available weekday mornings" in response to a previous question, what happens when the agent reaches the availability confirmation phase? Does it call the record_availability tool? With what parameters?
The test took 200 milliseconds to run. It failed immediately on the problematic prompt. The agent wasn't calling the tool because the prompt told it to skip confirmation if availability had been mentioned. The prompt didn't distinguish between "mentioned" and "confirmed and recorded."
Tomás fixed the prompt to require explicit tool confirmation regardless of what the candidate had said. The unit test passed. He added it to the suite.
Over the next two weeks, he wrote unit tests for every critical decision point in the screening flow. Tool calls, slot confirmations, intent transitions, escalation triggers. Each test defined a specific conversation state and a specific expected behavior.
When a developer changed a prompt, they ran the unit tests first. Most changes that would break something broke a unit test within seconds. The feedback loop tightened from three days to three minutes.
Boundary tests
Unit tests verified that the agent decided to do what it did. Boundary tests validated what happened when it did it.
Tomás built boundary tests for every tool the screening agent used.
The availability_check tool. Does it receive valid parameters? What does it return for candidates with no prior data versus candidates with existing records? What happens when the backend times out?
The schedule_callback tool. Does it handle timezone conversion correctly? What happens when the requested time is already booked? What error code does it return for invalid dates?
The transfer_to_recruiter tool. Does it pass context correctly? What happens when no recruiters are available? Does the handoff preserve the candidate's identity and qualification status?
Each boundary test ran independently of the conversation. Tomás mocked the conversation state and called the tool directly. This lets him test error paths that rarely occurred in real conversations but would cause failures if unhandled.
The timeout test caught a bug they'd never seen in production. The availability_check tool had no timeout handling. In the test environment with mocked slow responses, the agent waited indefinitely. Tomás added timeout handling and a retry policy. The boundary test suite now includes latency simulation for every tool.
End-to-end tests
With unit and boundary tests in place, end-to-end tests could focus on what they did best. Validating complete workflows.
Tomás maintained fifty scripted scenarios for the screening agent.
Ten happy paths covered candidates who answered clearly, confirmed availability, and completed screening successfully. These validated that the baseline flow worked.
Twenty edge cases covered variations. Candidates who asked questions mid-screening. Candidates who changed their answers. Candidates who needed to reschedule. Candidates with unusual availability patterns. This validated flexibility.
Ten adversarial scenarios covered candidates who were uncooperative, confused, or actively trying to break the agent. "I don't want to answer that." "Can you repeat that three more times?" "Actually, forget what I said earlier." These validated resilience.
Ten regression scenarios covered previous bugs. The availability confirmation bug became a permanent test case. Every production failure got added to the suite so it couldn't recur.
Each scenario was a scripted transcript with expected outcomes at key checkpoints. Did the agent collect the right information? Did it call the right tools? Did it reach the correct conclusion? The tests run nightly and before any major deployment.
Test environment
Deterministic tests required a deterministic environment.
Tomás built mock backends for every integration. The availability database returned consistent fake data. The scheduling service accepted any valid request and returned predictable confirmations. The candidate records system held synthetic candidates with known attributes.
Mock backends let tests run without external dependencies. They eliminated flakiness from network latency, data drift, and third-party outages. They also let Tomás simulate failures that couldn't be triggered on demand in production. Timeouts, rate limits, malformed responses, partial outages.
For the LLM itself, Tomás used temperature zero with a fixed seed where the platform supported it. This reduced but didn't eliminate variation in agent responses. Some tests validated exact responses. Others validated response categories or tool calls, which were more deterministic than natural language output.
The conversation state was reproducible. Every test started from a clean state, progressed through defined turns, and validated outcomes at specific checkpoints. No test depended on the outcome of a previous test.
Regression cadence
Different test layers were run at different times.
Unit tests are run on every prompt change, before the change is merged. A developer couldn't deploy a prompt update without the unit tests passing. This gate caught most regressions before they reached production.
Boundary tests are run on every tool or integration change. A change to a tool contract, an API update from a backend team, a new error code from a vendor. These triggered the boundary suite for all affected tools.
End-to-end tests run nightly and before any deployment to production. The full fifty-scenario suite took forty-five minutes. Running it on every change would slow development. Running it nightly caught interaction effects that unit and boundary tests missed.
When an end-to-end test failed, Tomás investigated. If the failure was traced to a specific tool or prompt decision, he added a unit or boundary test to catch it faster next time. The pyramid grew from the bottom up, with end-to-end failures becoming unit tests wherever possible.
What Tomás built
Six months after the availability confirmation incident, the staffing marketplace had a test infrastructure that matched their deployment velocity.
Three hundred unit tests covering every critical decision point. Sixty boundary tests covering every tool contract. Fifty end-to-end scenarios covering every workflow. A CI pipeline that ran the right tests at the right time.
Prompt changes that would have broken production instead broke unit tests. Tool changes that would have caused silent failures broke boundary tests. Interaction effects that would have gone unnoticed for days got caught in the nightly end-to-end run.
The QA review that had caught the original bug was still valuable. Tomás didn't eliminate human review. But he changed what humans reviewed. Instead of scanning transcripts for obvious failures, reviewers looked for subtle quality issues that tests couldn't catch. The tests handled regression. Humans handled refinement.
The team that runs only end-to-end tests is like a software team that runs only integration tests. When something breaks, they know it's broken but not where or why. They debug by reading transcripts and guessing.
Tomás had given his team something better. A pyramid that caught failures at the layer where they occurred, with feedback loops measured in seconds instead of days.

