Looking Is Already an Act: The Commit Problem in Voice Agents
To talk at human speed, an agent has to start working before it knows exactly what’s needed. Fine when the work is a database read. A trap when it’s placing a call, sending a text, or canceling a booking. What makes an agent trustworthy isn’t what it does. It’s what it won’t do on a guess.
- —The commit problem is deciding when an AI voice agent may take an irreversible or intent-disclosing action before it is fully certain — the core problem for an agent that acts.
- —Tool safety is not a binary of read-only vs. write: a read can be irreversible in the privacy domain (it discloses intent to a third party) even when it is perfectly reversible in the state domain.
- —The set of actions an agent can’t take back should be small, fixed, and declared outside the model — a model’s in-the-moment judgment about whether an action is safe to commit is exactly the judgment you can’t trust.
- The commit problem
- The decision of when an AI agent may take an irreversible or intent-disclosing action before it is fully certain. For a phone agent, the product is the policy for when to wait, when to act, and what never to do speculatively.
- Intent disclosure
- The way issuing a tool call can leak what an agent is about to do — e.g. a “read-only” question about a vendor’s dispute process that tells the vendor a dispute is coming. A second axis of irreversibility alongside state-reversibility: a read can be costless to undo yet impossible to un-disclose.
- Tool typing by effect
- Annotating every tool, outside the model, by its position on two axes — state-reversibility (can the effect be undone?) and intent-disclosure (does issuing it leak what you’re about to do?) — and attaching its latency and confirmation policy to that type.
- Outbound voice agent
- An AI phone agent that initiates calls to third parties to complete owner-assigned tasks. Placing the call is the one irreversible act, gated behind the owner’s okay — the commit point stated in plain English.
Last post we said the hard part of a voice agent is acting over a live call, not sounding good, and we ended on a claim we didn’t prove: for a phone agent, the product is the policy for when to wait, when to act, and what never to do on a guess. Here it is.
The speed trap
A tool-using agent has a serial dependency baked in: reason, call a tool, wait, reason again. On a screen you can hide the wait behind a spinner. On a phone call you can’t. Natural turn gaps run about 200–450ms, and a pause past a second reads as the agent freezing. A network round-trip to a calendar or a CRM blows that budget on its own.
So you stop waiting. Start the likely tool call while the user is still talking and have the answer ready before your turn comes. That’s speculative execution, and for real-time voice it’s one of the few ways to stay inside the clock. Speculative Interaction Agents (Hooper et al., 2026) splits the reason-and-act thread off from waiting on I/O and fires tool calls early, reporting 1.3–2.2× speedups. Speculative Actions (Ye et al., 2025) does the same and calls itself “lossless” — no task-success regression versus running serially.
The obvious guardrail
Both papers know speculation is dangerous, and both reach for the same rule: only guess on tools that are safe to run early. Reads are cheap to guess. Look up the account, check the table, and if you guessed wrong, throw it away. Writes get held until the agent is sure: book it, charge it, send it. Speculative Interaction Agents makes this its central mechanism — classify tools as safe or unsafe, speculate the safe ones, hold the unsafe ones to a “commit point.”
Reads free, writes gated. That’s what we assumed last post too. It’s wrong, and the way it’s wrong is the best thing we read all week.
Looking is already an act
Ghost Tool Calls (Mohammadi, Klein, Arora, Bindschaedler, 2026) asks what happens to a speculative call the agent later decides it didn’t need. The compute gets thrown away. The disclosure does not. The read still went out to somebody’s server, and that server’s logs now hold whatever the call revealed: the topics, the entities, the destination the agent guessed at. You can’t un-send what someone else already logged.
Usually that’s harmless. A restaurant learning you might book a table can’t do much with it. But say the owner is deciding whether to dispute a vendor’s invoice, and the agent gets ahead by calling the vendor’s billing line to ask how disputes work. Nothing filed, nothing changed — except the vendor now knows a dispute is coming, and gets to prepare before the owner does. Or the agent asks the current supplier about exit terms while the owner is still shopping for a replacement. The supplier now knows you’re leaving before you have anywhere to go. Asking the question shows your hand.
So “read-only is free” is false. A read can be impossible to take back even when it changes nothing: you can undo the effect, you can’t undo the telling. Two questions, not one — can I undo it, and does asking give me away? Every tool call answers both, and the safe-to-guess set is smaller than a read/write flag says it is.
There is no undo for a placed call
The other axis has a hard floor that text agents mostly don’t hit. When a text agent’s write goes wrong, you can often roll the database back. When a phone agent’s action goes wrong, you can’t. You cannot un-ring a phone, un-send a text, or un-leave a voicemail.
The field’s answer to irreversible isn’t undo, it’s make-up calls — the saga pattern, where every forward action ships with a second action meant to cancel it out. Cancel the booking you shouldn’t have made. Call back and apologize. ACRFence (2026) draws the line: checkpoint-and-restore can rewind your own process, not what you already did to someone else’s. A make-up call is an apology, not a rollback. The booking happened. The customer was called. The world moved.
A wrong guess on a read costs a fraction of a cent. A wrong call — to a customer, in the owner’s name, committing to something — costs trust, and you only mostly get to fix it. The commit decision is a bet: act early only if the time you save beats the odds of being wrong times the price of being wrong. Everyone measures the latency. Everyone measures the error rate. Almost nobody has priced being wrong, and on a phone it’s huge and different for every action. As far as we can tell, that price list doesn’t exist yet. Somebody has to write it.
The policy
Put those together and the design falls out.
Type the tools, don’t ask the model. Every tool gets a declared effect — read, reversible write, write you can apologize for, write you can’t — and a disclosure level. You write that down before anything runs. Mind the GAP (Cartagena & Teixeira, 2026) is why: across frontier models and regulated domains, a model that refuses a harmful request in text will still turn around and attempt the same thing as a tool call. Refusing in words says nothing about refusing in deeds. The one judgment you can’t outsource to the model is whether the model should be trusted to act, so the can’t-take-back list stays short, fixed, and outside it. The Model Context Protocol reached the same conclusion from the other side — the spec requires hosts to get explicit user consent before any tool runs. Same instinct, written into the protocol.
Commit late, keep an exit. Guess freely on everything cheap. Hold anything irreversible until the last responsible moment, and keep it cancelable until it’s actually out the door — the owner can say “wait, no” mid-sentence, and the policy has to honor it right up to the instant the call is placed.
Confirm without sounding like a robot. “Shall I proceed? Yes or no?” before every action kills the rhythm you just paid for. Dialogue systems solved this years ago with implicit confirmation: say what you’re doing while you do it. “Calling them now to move it to 7.” The sentence is the confirmation. Interrupting it is the no. You only stop and ask “is that okay?” when the stakes, the cost, and the doubt are all high at the same time.
For an agent that acts on your behalf
This isn’t abstract for us. The thing we build is an AI phone agent that places calls for the owner, on the owner’s okay — and that sentence is the whole policy. “On your okay” is the commit point. Everything before it (who to call, what to say, what’s possible) is guesswork the agent should do early and often, because it can all be taken back. Placing the call is the one act that can’t, and it stays behind the okay.
The list of things the agent can’t take back is four items long: place a call, end or transfer one, leave a voicemail, send a message. Gated, all of them. Same treatment for the lookups that would tip the owner’s hand to a third party — those wait for the okay too.
Answering a call is observation. Placing one is a commit. The okay is where the agent stops guessing and starts being accountable. That policy, more than the voice, is what we’re building.
- Hooper, Kang, Moon, Lee, Wen, Wawrzynek, Mahoney, Shao, Gholami, Keutzer — Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling (2026). arXiv:2605.13360
- Ye, Ahuja, Liargkovas, Lu, Kaffes, Peng — Speculative Actions: A Lossless Framework for Faster Agentic Systems (2025). arXiv:2510.04371
- Mohammadi, Klein, Arora, Bindschaedler — Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools (2026). arXiv:2606.02483
- Cartagena, Teixeira — Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents (2026). arXiv:2602.16943
- Zheng, Yang, Zhang, Quinn — ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore (2026). arXiv:2603.20625
- Model Context Protocol — Specification (2025-11-25).
- Lin, Chen, Chen, Lee — Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency (2026). arXiv:2604.04847