Hacking the IVR
Press 1 for billing. Getting an agent through a phone tree is one of the hardest things you can ask it to do, because the IVR breaks the assumption most agent systems are built on: that you can try again.
- —An IVR is one of the hardest environments for an AI agent because it is partially observable, latency-bound, and irreversible — the assumptions most agent systems rely on (a map, time to think, the freedom to retry) all fail at once.
- —Most agent recovery assumes you can try again; a live phone call has no back button, so the IVR is where “agents that act” stops being a slogan and becomes a measurable, unsolved problem.
- —There is essentially no peer-reviewed benchmark for an agent autonomously driving a live IVR with irreversible side effects — the hardest version of the acting problem is the least measured.
- The IVR is the hardest API
- The idea that phone trees are one of the most difficult interfaces for AI agents: they expose one audio menu node at a time (partial observability), demand action inside a few-second window (hard latency), and execute irreversible real-world actions on a single keypress (no undo).
- Reversibility as a planning primitive
- Treating “can this action be taken back?” as a first-class input to action selection. An agent on a live call should estimate which presses are points of no return and hold those to a higher bar — more confidence, the owner’s okay, or escalation — rather than assuming every action is retriable.
In the last two posts we argued that the hard problem in voice agents is acting over a live call, and that the core of acting is a commit policy: knowing what you can’t take back. The cleanest, most unglamorous test of both claims is one you won’t see in a demo reel — navigating an interactive voice response system on someone’s behalf. Airlines, banks, utilities, insurers, delivery companies. The useful calls are exactly the calls with phone trees. The IVR is where “agents that act” stops being a slogan and starts being hard.
Most agent recovery assumes you can try again
Look at how the field builds agents that recover from mistakes and you find the same primitive under all of it: undo. GUI agents backtrack — BacktrackAgent (Wu et al., 2025) detects when a step went wrong and resets to an earlier decision point to try another path, which works because a screen is a place you can navigate back to. Reasoning agents re-roll. The failure-analysis line — Where LLM Agents Fail and How They Can Learn From Failures (Zhu et al., 2025) — shows a single early error cascading into the whole trajectory, and the fix is to re-run from just before the critical step. Search-based agents simulate before they commit, exploring branches in a tree and keeping the good one. Different mechanisms, same dependency: before a bad action matters, the system needs a rewind point, a fresh rollout, or a simulator where branches can die quietly.
Even with retries on the table, these agents aren’t good yet. τ-bench (Yao et al., Sierra, 2024) puts a tool-using agent in clean, untimed, fully-observed customer-service tasks and scores it against the true end-state of the database. Frontier function-calling agents finish under half the tasks. The number that should worry anyone shipping this: they’re inconsistent. In the retail domain, run the same task eight times and fewer than a quarter pass every time. And that’s the easier setting — text and tools, no live clock, full visibility, and an episode you can simply run again.
Now take those away.
A phone tree is a world that commits
On a live call there is no back button. The apparent recovery move is to hang up and re-dial, but that isn’t backtracking. It’s starting a second episode after the first one already changed the world. A fresh place in the queue, maybe a different rep, every earlier side effect already done. If the agent pressed the option that cancelled the booking, re-dialing doesn’t un-cancel it. The closest thing to undo is a second action that tries to compensate for the first, and compensation is an apology, not a rollback.
The actions are real. A keypress can route money, cancel service, or commit the owner to a callback window. The other party isn’t a deterministic web page. It’s a human, or a menu designed decades ago to deflect you, and it interrupts, mishears, and re-prompts.
There’s a clock. Many IVRs give you a few seconds before they repeat, advance, or time out. So the agent’s hardest action — the irreversible keypress — is also its most time-pressured one. Perceiving the menu, deciding, and committing all happen inside the same few seconds. Production voice stacks encode this directly: the open-source LiveKit IVR recipe waits for the prompt to finish, then enforces a cooldown between key presses, because firing early or fast trips the tree. The timing of the irreversible act is part of getting it right.
And the agent is half-blind. It never sees the tree. It hears one node at a time and infers the structure as it goes — a partially observable problem with no map, where the observation process is lossy, sequential, and built to route humans, not inform agents. A web agent gets the whole DOM. A phone agent gets “for hours and locations, press 2,” and has to guess what lives behind 3.
Web agents operate in a world built to be retried. A phone tree operates in a world that commits.
Reversibility should be a planning primitive
The reinforcement-learning literature already has the concept agent tooling rarely treats as first-class. There Is No Turning Back (Grinsztajn et al., NeurIPS 2021) learns to estimate the reversibility of an action — is this a point of no return? — and uses it two ways: steer exploration away from irreversible transitions, or filter irreversible actions out before they ever reach the environment. That’s the question an IVR agent has to answer before every press. Can I take this back, or am I about to commit the owner to something?
The catch: the method assumes lots of episodes to learn from, and a cold outbound call gives you one. Estimating “is the next press a point of no return?” from priors instead of experience, fast enough to beat a few-second timeout, is genuinely unsolved. So is the recovery question — what’s the IVR analog of “back”? Re-dial, escalate to a human, or perform a compensating action. No benchmark even measures whether an agent does this well.
That’s the gap. There is thorough, cited work benchmarking the text version of customer support: τ-bench, τ²-bench (Barres et al., 2025), and Beyond IVR (Balaji et al., 2026), which scores how well an agent stays on a prescribed multi-step journey. There is essentially no peer-reviewed benchmark for an agent autonomously driving a live phone IVR with irreversible side effects, scored against what actually happened in the world. The real know-how lives in engineering recipes and a pile of granted patents on “automatic IVR navigation on behalf of a user.” The hardest version of the acting problem is the least measured.
What it means for an agent that calls on your behalf
For us this isn’t a thought experiment. It’s the case the product stands on. When an AI phone agent places a call to reschedule a delivery or dispute a charge, every keypress and spoken confirmation is an action taken as the business owner, in the real world, that the owner can’t un-take. The academic stakes and the product stakes are the same stakes, and the design rules follow.
Treat reversibility as a first-class input to action selection, not an afterthought. The agent should estimate which presses are likely points of no return and hold those to a higher bar: more confidence, the owner’s okay, or escalation — the commit point from the last post. Make recovery a real surface, not a hope: re-dial cleanly, escalate to a human, or take a compensating action, and then tell the owner precisely what irreversibly happened. And when the caller is acting as the owner, the callee deserves to know who, or what, is acting. The FCC’s 2024 ruling put AI-generated voices under the robocall rules, and Google’s Duplex opened with “I’m calling on behalf of a client” years before it had to. Irreversible action and accountable agency are the same discipline.
Better audio gave us an agent that can talk. Acting irreversibly, on a clock, against a tree it can’t see is what it takes to get an agent that can do. That’s why the phone tree, dull as it looks, is the hardest API in the world. And whether an agent can be trusted with it is a measurement problem of its own — that’s where we’re headed next.
- Wu, Gao, Liu, Luan — BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism (2025). arXiv:2505.20660
- Zhu et al. — Where LLM Agents Fail and How They Can Learn From Failures (2025). arXiv:2509.25370
- Yao, Heinecke, Niebles, et al. (Sierra) — τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (2024). arXiv:2406.12045
- Barres, Dong, Ray, Si, Narasimhan — τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment (2025). arXiv:2506.07982
- Balaji, Mishra, Sachdeva, Agrawal — Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence (2026). arXiv:2601.00596
- Grinsztajn, Ferret, Pietquin, Preux, Geist — There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning (NeurIPS 2021). arXiv:2106.04480
- LiveKit — Building an Automated IVR Menu Caller (engineering recipe).
- FCC — Declaratory Ruling: TCPA applies to AI-generated voice calls (Feb 8, 2024).