← Glossary

Outbound voice agent

An outbound voice agent is an AI phone agent that places calls to third parties to complete tasks for its owner — with the call itself gated behind the owner’s okay.

Definition

Where an inbound assistant waits to be called, an outbound voice agent initiates the call: it dials a supplier, an airline, a customer, or a booking line and carries out a task the owner assigned. It can wait on hold, navigate a phone tree, state the request, and then report back what happened.

Placing the call is the one irreversible act — you can’t un-ring a phone or un-leave a voicemail — so an outbound voice agent holds that step behind the owner’s explicit okay. Everything before it (deciding who to call, drafting what to say, checking what’s possible) is reversible work it can do eagerly to stay fast.

Why it matters

Outbound is the harder, more valuable half of phone work, and the half almost no “answer your calls” product touches. Sitting on hold, chasing an ETA, confirming a booking, telling a customer you’re running behind — these eat an owner’s day, and they’re exactly what an outbound voice agent removes.

It’s also the harder engineering problem. Almost everything an outbound agent does is a tool call against an API built out of another human, or an IVR pretending to be one — under a live clock with irreversible consequences. Three challenges define the work.

The hardest thing an outbound voice agent does is get through the automated menu on the other end. A phone tree (IVR) gives the agent only partial observability: it hears one menu node at a time, in real time, with no map of the branches below, and a wrong selection usually has no undo. Underneath sits an awkward audio constraint — on most telephony stacks an agent cannot send keypad tones (DTMF) and listen at the same time, and many IVRs ignore digits pressed while the prompt is still playing or drop tones sent too fast. Production tooling encodes the workarounds as guardrails rather than guarantees: LiveKit’s reference IVR recipe wraps tone-sending in a multi-second cooldown, and Vapi’s guidance is to wait for the full prompt, space digits with pause characters, and fall back to speaking the option aloud when tones fail.

How big is the gap between a lab and a live line? The τ-Voice benchmark (Ray et al., 2026) finds full-duplex voice agents complete only 31–51% of grounded tasks under clean audio — 26–38% with noise and accents — against 85% for the same model in text, retaining roughly a third to a half of their text-mode ability, with most failures traced to the agent rather than the test harness. A purpose-built public benchmark for autonomously driving a live IVR with irreversible side effects barely exists yet — a real gap in how the field measures the thing that matters most.

When an agent calls another agent

As both ends of a call automate, a new case appears: an outbound AI agent reaches an AI answering agent, and two digital systems negotiate over the slowest possible medium — synthesized speech decoded back into text — often without either side knowing the other is a machine. The inefficiency is real enough to have been patented: Capital One’s US 11,356,514 B2 (granted 2022) describes detecting that the other side is also an AI and negotiating a switch to a direct digital channel. The most-watched demonstration of the same idea is Gibberlink, built at a February 2025 hackathon hosted by ElevenLabs and a16z, where two agents recognize each other as AI and drop into ggwave, a data-over-sound protocol; its creators estimate an order-of-magnitude compute saving — though it is a demo, not a deployed standard.

Until any such out-of-band handoff is real and interoperable, agent-to-agent calls show predictable failure modes: redundant human-style pleasantries neither side needs, mutual “are you a bot?” probing, and dialogue loops when neither side advances the task. The practical move for an outbound agent is to detect a non-human counterpart quickly and pick a deliberate strategy — escalate, switch channel, or end the call — rather than burn minutes in a polite machine-to-machine stalemate.

How people react to an AI caller

Disclosure is where the engineering meets human behavior, and the evidence is uncomfortable. In a field experiment on more than 6,200 outbound sales calls (Luo, Tong, Fang & Qu, Marketing Science, 2019), undisclosed chatbots were as effective as proficient human agents — but disclosing the bot’s identity before the conversation cut purchase rates by more than 79.7%, because callers turned curt and judged the bot less knowledgeable and less empathetic. The honest behavior is also the costly one, which is exactly why disclosure should be treated as a floor to design around, not a knob to remove.

The surrounding sentiment points the same way: a 2023 Gartner survey of 5,728 customers found 64% would prefer companies didn’t use AI in customer service, and Pew (2025) found 61% of Americans want more control over how AI is used in their lives. The reference case is Google Duplex, which in 2018 placed human-sounding calls — fillers and all — without disclosure, drew an ethics backlash, and shipped months later identifying itself up front. The norm that settled out: an AI placing calls on someone’s behalf should say so.

Outbound voice agent vs. inbound-only assistant

Outbound voice agent
  • Initiates calls to third parties
  • Sits on hold; navigates phone trees
  • Acts on the owner’s okay; reports back faithfully
  • Discloses it’s AI to the callee
Inbound-only assistant
  • Only answers calls placed to the business
  • Waits to be called
  • Takes messages; limited outbound action
  • No third-party calls on the owner’s behalf

Examples

  • Call the supplier, get the delivery ETA, and report back
  • Sit on hold with the airline or utility, then hand off
  • Call a customer to reschedule and keep the slot
  • Confirm or change a reservation you asked it to handle

Safety & disclosure

  • “On your okay” is the commit point: the call is placed only after explicit approval.
  • The callee hears it’s an AI calling on the owner’s behalf.
  • After acting, the agent reports back precisely what happened — including anything irreversible — rather than an optimistic summary (faithful report-back).

FAQ

Why is navigating an automated phone menu (IVR) so hard for an agent?

Because the agent has only partial observability — it hears one menu option at a time, with no map of the tree, and a wrong selection generally can’t be undone. On top of that, sending keypad tones (DTMF) and listening usually can’t happen at once, IVRs often ignore digits pressed before a prompt finishes, and some systems detect and reject automated callers outright.

How much worse are voice agents than text agents at completing tasks?

The τ-Voice benchmark (2026) found full-duplex voice agents complete only 31–51% of grounded tasks under clean audio (26–38% with noise and accents) versus 85% for the same model in text — about a third to a half of text capability retained — with most failures attributed to agent behavior.

What happens when an AI agent calls another AI agent?

Two digital systems end up talking over slow synthesized speech, often without realizing the other is a machine — producing redundant pleasantries, “are you a bot?” probing, and occasional loops. A 2022 Capital One patent and the 2025 Gibberlink demo both propose detecting this and switching to a direct data channel, though Gibberlink is a demonstration, not a deployed standard.

Do people react differently when they know the caller is AI?

Yes, measurably. In a field experiment on 6,200+ outbound calls, disclosing the chatbot’s identity before the conversation cut purchase rates by more than 79.7%, even though undisclosed bots performed as well as proficient human agents. Disclosure is the ethical and legal floor, so the right move is to design around the effect, not remove it.

See it in the product: how Call My Agent works · pricing.

Sources
Last updated June 2026