What you'll learn: The operational discipline required to run voice agents in production without silent degradation.
Key takeaways:
- Voice agents need the same operational rigor as any production service. Daily checks, weekly reviews, monthly business reviews, runbooks, and versioning.
- Daily automated checks catch immediate problems. Weekly human reviews catch quality drift. Monthly business reviews connect performance to outcomes.
- Prompt changes should be treated like code deployments. Versioned, tested, gradually rolled out, and rollback-ready.
- Every operational responsibility needs explicit ownership. Unowned metrics don't get watched.
- The weekly review is the most important operational practice. It's the heartbeat of a program that improves instead of slowly failing.
Three months after launch, Marcus stopped watching the dashboard.
The Install Reminder Agent had been running smoothly. Completion rates were stable. Error rates were low. The team had moved on to building the next agent. The first agent was "done."
Nina noticed the problem during a routine call review. Install rates had dropped from 22% to 16% over two weeks. No alerts had fired because no one had configured alerts for business metrics. The dashboard clearly showed the decline, but no one had looked at it in three weeks.
Marcus investigated. The SMS gateway had added a new required field in an API update. Calls to the old endpoint returned 200 status codes but didn't actually send messages. The agent heard "success" from the API and confirmed delivery. Drivers waited for links that never arrived.
30% of install reminders over 2 weeks failed silently. The agent said, "I've sent you the link," with complete confidence, but the link wasn't being sent.
A weekly operational review that included tool-call success rates would have caught this in a matter of days. Marcus built the review process they should have had from the start.
Daily, weekly, monthly
Marcus established three rhythms.
Daily automated checks run every morning. Tool call success rates across all agents. Transcription confidence averages. Latency percentiles. Error rate trends. These checks ran automatically and posted results to a shared channel. If any metric crossed a threshold, it tagged the relevant owner.
The daily checks caught immediate problems. An API outage at 3am showed up in the morning check as a spike in tool errors. A transcription provider degradation appeared as a confidence drop. The checks didn't require human attention on normal days, but surfaced issues quickly when something went wrong.
Weekly human reviews happened every Thursday. Marcus and Nina sampled calls together. Ten random calls from the week. Five calls with unusually long handle times. Five calls that escalated to humans. They listened, scored against the quality rubric, and identified patterns.
The weekly review caught quality issues that automated checks missed. An agent who technically completed calls but rushed confirmations. A tool that succeeded but returned data that the agent misinterpreted. A new edge case that occurred three times this week and would happen more often at scale.
The weekly review also groomed the improvement backlog. Every issue identified became a ticket. Issues were prioritized by frequency and impact. The highest-priority items got fixed before the next review.
Monthly business reviews connected agent performance to business outcomes. Install completion rates. Cost per successful install. Comparison to human-handled benchmarks. Trend lines over the quarter.
The monthly review involved business stakeholders, not just the technical team. It answered the question leadership cared about: Is this investment working? The review surfaced strategic decisions. Should we expand to new use cases? Should we further optimize the current agent? Should we adjust targets?
Runbooks
When the SMS gateway incident happened, Marcus scrambled to figure out what to do. There was no playbook. He made decisions under pressure without guidance.
He built runbooks afterward.
The tool failure runbook specified what to do when a tool stopped working. Check the provider status page. Run a manual test against the API. Check recent changes to tool contracts or backend systems. If the failure is on the provider side, switch to the backup provider. If it's a contract mismatch, identify the change and deploy a fix.
The quality degradation runbook specified what to do when metrics dropped. Sample recent calls for the affected metric. Identify the common failure pattern. Check for recent prompt changes, provider updates, or backend changes. Isolate the cause before attempting fixes.
The escalation runbook specified who owned which decisions. Operations could disable a single agent autonomously. Disabling multiple agents required engineering approval. Rolling back to a previous version required a fifteen-minute assessment window. A complete shutdown required director approval.
Each runbook existed before the next incident, not in response to it. Marcus added to the runbooks whenever a new failure mode appeared. The runbooks captured organizational learning in actionable form.
Versioning
The SMS gateway incident taught Marcus another lesson. He didn't know what had changed because he wasn't tracking versions carefully.
He established versioning for everything that could change.
Prompt versions tracked every change to system prompts. Each version had a timestamp, an author, a summary of what changed, and a link to the test results that validated the change. Rolling back to a previous prompt version was a one-command operation.
Tool contract versions tracked changes to backend integrations. When the SMS gateway added a required field, the old contract version should have been marked deprecated, and the new contract version should have been validated before the change went live. The version history would have shown exactly when the mismatch began.
Knowledge base versions tracked changes to retrieved content for RAG-enabled agents. New documents, updated documents, deleted documents. If an agent suddenly started giving different answers, the version history showed what content had changed.
Agent configuration versions bundled all components together. This prompt version, with this tool contract version, with this knowledge base version. Rolling back an agent meant reverting to a known-good configuration bundle rather than trying to figure out which individual component caused the problem.
Each version included "what changed" notes. Not just "updated prompt" but "added handling for drivers who report storage issues, adjusted timeout from 5s to 8s for slow network conditions." The notes made debugging faster.
Ownership
The SMS gateway incident happened partly because no one owned the monitoring.
Marcus established explicit ownership for every operational responsibility.
Daily check review was the responsibility of the on-call engineer. They reviewed the automated check results every morning and acknowledged or escalated issues.
The weekly call review was led by Nina. She scheduled the sessions, prepared the sample, facilitated the review, and documented the findings.
Tool health monitoring was owned by Priya. She maintained relationships with backend teams, tracked their change schedules, and ensured tool contracts remained synchronized.
Business metric reporting was owned by Marcus. He prepared the monthly review, maintained the dashboards, and escalated when metrics drifted from targets.
Ownership was written down and visible. When something needed attention, there was no ambiguity about who was responsible. Unowned metrics didn't get watched. The explicit assignments prevented gaps.
Prompt changes as deployments
Early on, Marcus treated prompt changes casually. A quick edit, a manual test, push to production. The feedback cycle was fast and informal.
After a prompt change caused unexpected escalation spikes, he formalized the process.
Prompt changes went through the same rigor as code deployments. Write the change. Run the unit test suite. Run the boundary test suite. Get a review from at least one other person. Deploy to a staging environment. Run end-to-end tests against staging. Deploy to production at 10% traffic. Monitor for an hour. Expand to 100% if metrics are stable.
The process added overhead. A change that used to take ten minutes now takes two hours. But the process caught issues before production. The escalation spike would have appeared in staging, not in front of real callers.
Marcus also established change windows. Prompt changes are deployed on Tuesday through Thursday during business hours. Never on Friday before a weekend. Never during known high-volume periods. The windows ensured that someone was available to respond if a change caused problems.
The agent that had degraded silently for two weeks never degraded silently again.
Marcus checked the install rate dashboard every Thursday before the weekly review. Three months after establishing the operational cadence, the rate held steady at 24%. Higher than before the SMS gateway incident. Higher than when no one was watching.
The weekly reviews had surfaced optimizations that the team missed during development. A confirmation phrasing that reduced confusion. A retry pattern that recovered from transient failures. Small improvements, found because someone was looking.
Nina once asked him why he still checked the dashboard manually when the alerts would catch any real problems. He pulled up the historical chart and pointed to the two-week gap where the line had dropped from 22% to 16%. Nothing had alerted me. The dashboard clearly showed the decline. No one had looked.
He looked now.

