All Talk, Some Action
Speech-to-speech models are now good enough that the voice is rarely the first thing that fails. Then you ask one to actually do something — check a calendar, call a supplier back, sit through an IVR — and the seams show. The unsolved frontier in voice agents is acting on the world without breaking the conversation.
- —The frontier in voice agents is no longer speech quality — it is real-time tool use and action over a live call without breaking the conversation.
- —Speech-native, full-duplex models won on latency by collapsing the ASR→LLM→TTS pipeline, but in doing so they collapsed the natural seam where a tool call lived — so the systems best at turn-taking are the ones still failing at grounded action.
- —The production fixes that work — predict the tool call early, talk through the wait — are latency-masking workarounds, not fast grounded action, and they must never speculatively run the one action you can’t take back.
- Full-duplex voice agent
- A voice system that listens and speaks at the same time, modeling overlap so interruption (barge-in) and backchanneling are native behaviors the model learns, rather than bolted-on voice-activity detection.
- The action layer
- The part of a voice agent that does things in the world — tool calls, lookups, placing calls — as distinct from the voice layer that handles speech. The voice layer is largely solved; the action layer is what’s still unsolved.
For most of the last decade, a voice agent was three boxes in a row. Speech-to-text turned audio into a string. A language model read the string and wrote a reply. Text-to-speech read the reply aloud. The cascade was easy to reason about and easy to debug, and it’s why voice agents felt like a bad international call: every box added latency, and the latency stacked. Whisper-class ASR, a frontier LLM, and a neural vocoder in series routinely landed at 800ms to 2s of round-trip delay even when each component was individually fast. Human conversation runs on a 300–500ms response window. The cascade missed it by a mile, and you could hear the miss.
The last two years closed that gap, and they closed it by moving more of the loop into speech-native, streaming models.
Speech-native models solved the latency problem
Moshi (Kyutai, 2024) is the cleanest example of the shift. Instead of ASR→LLM→TTS, Moshi is a single speech-text foundation model that generates speech tokens directly from a neural audio codec (Mimi), modeling the user’s stream and its own stream in parallel so it can listen and speak at the same time. It reports a theoretical latency of 160ms and a practical figure around 200ms on an L4 GPU. Hertz-dev (Standard Intelligence, 2024), an 8.5B full-duplex base model, claims 80ms theoretical and 120ms real-world latency on a single RTX 4090. OpenAI’s gpt-realtime presents the same cascade as a single speech-native realtime interface, landing roughly in the 500ms–1.5s range in production (vendor-reported).
These are full-duplex systems in the real sense. They don’t wait for a turn to end. They model overlap, so barge-in (you interrupting the agent) and backchanneling (the agent saying “mhm” while you talk) become native behaviors the model can learn instead of being bolted on with a voice-activity detector. The field even has a yardstick now: Full-Duplex-Bench (Lin et al., 2025) scores models on pause handling, backchanneling, turn-taking, and interruption management. The conversational mechanics, measured automatically.
The hard problems in voice used to be transcription accuracy and natural-sounding TTS. That framing is a couple of years stale. Not solved — real phone audio still breaks transcription — but no longer the interesting bottleneck in the best systems. Here’s where it breaks.
The moment the agent has to do something, the architecture turns against it
A speech-to-speech model is a closed loop: audio in, audio out, no stop. That loop is exactly what makes tool use hard. To check a reservation, look up an account, or place an order, the agent has to stop generating audio, emit a structured call, wait on a network round-trip, and resume — inside a medium where a one-second silence reads as the agent freezing.
The numbers on this are recent and pointed. Full-Duplex-Bench-v3 (Lin, Chen, Chen, Lee, 2026) is the first benchmark to test multi-step tool use inside full-duplex voice agents, using real human audio annotated for disfluency — the “um”, the self-correction, the restart — and scenarios that require chained API calls. GPT-Realtime led on accuracy and interruption avoidance. Gemini Live was fastest but took the most turn-taking errors. Two failure modes survived across every system: user self-corrections, and multi-step reasoning under the hard scenarios.
The same systems that answer a chitchat turn in a few hundred milliseconds get slow and brittle the moment a task touches the outside world. And these are clean benchmark conditions. Not a held line, not an IVR tree, not a human who changed their mind halfway through the sentence.
There’s a structural reason this is hard rather than just slow. A function call is a discrete, text-shaped, blocking event. A full-duplex speech stream is continuous and non-blocking by design. Bolting one onto the other forces a choice: freeze the audio while the tool runs and sound broken, or keep talking without knowing the answer yet and risk lying. Cascaded pipelines had an obvious insertion point — text was already the native currency, so a function call was just another text step. The end-to-end speech models that won on latency lost ground on grounded action, because in collapsing the pipeline they collapsed the natural seam where a tool call used to live. The benchmark evidence points the same way: the systems tuned for fluid turn-taking are the ones still failing on self-correction, chained tool calls, and scenarios where the next utterance depends on external state.
The fixes that work are all forms of hiding the wait
External action still doesn’t reliably fit the conversational clock. What’s emerging instead is a set of techniques for making the wait inaudible — masking latency the way a good receptionist does when they say “let me pull that up for you” and start typing.
Predict the tool call before the user finishes talking. Stream RAG (Arora et al., 2025) post-trains a speech-in/speech-out model to decide when to fire a retrieval call mid-utterance — issuing the query in parallel with the user’s ongoing speech, before they’ve stopped — then to speak a summary that fuses the audio question with the retrieved text. The headline isn’t an accuracy number. It’s that the model learned tool timing as a first-class skill instead of an afterthought.
Cover the gap with speech you were going to say anyway. The production pattern splits the loop the instant the user stops: one track streams a conversational acknowledgement — “Checking that for you...” — to TTS, while a second track silently predicts and executes the tool. The filler buys a second or two, and the result is ready by the time the sentence ends. The catch is real: you only speculatively execute read-only, reversible tools. You never eagerly run a transfer, a purchase, a deletion, or — for an outbound agent — actually placing the call before you’re sure. Speculation that touches the world is a bug, not an optimization. There’s even an early privacy literature forming around exactly this hazard: speculative calls that leak a user’s inferred intent to a third party before the agent commits to that branch (see “Ghost Tool Calls,” Mohammadi et al., 2026).
These two ideas — predict early and talk through the wait — are the dominant production patterns because they attack the part users hear: dead air. They’re good ideas. They’re also workarounds. We don’t have fast grounded action; we have convincing ways to stall.
What this means for an agent that acts on your behalf over the phone
An outbound AI phone agent is the hardest version of this problem, because almost everything it does is a tool call against an API built out of another human, or an IVR pretending to be one.
It has to sit on hold and stay quiet without hanging up. It has to navigate a phone tree — multi-step planning where each step is a real action with a real, sometimes irreversible consequence; press the wrong option and you’re three menus deep in billing. It has to recover when the person on the other end interrupts, corrects themselves, or trails off — the exact disfluencies the newest benchmarks show the best models still fumble. And it has to do all of that while sounding like someone who knows what they’re doing, which means a multi-second tool gap isn’t an option. You can’t say “let me pull that up” to a restaurant host five times. The masking has to be conversational, and it has to be honest about what’s reversible.
The voices are good. The frontier is an agent that can plan and act over a live call: decide when to commit versus when to wait, mask the unavoidable latency without lying, and never speculatively do the one thing it can’t take back. That’s the part the literature is only now learning to benchmark, and it’s the whole job for an agent that places calls on your behalf. For a phone agent, the product is that policy — when to wait, when to act, and what never to do speculatively.
- Défossez et al. — Moshi: a speech-text foundation model for real-time dialogue (Kyutai, 2024). arXiv:2410.00037
- Standard Intelligence — Hertz-dev (2024).
- OpenAI — Introducing gpt-realtime (2025).
- Lin et al. — Full-Duplex-Bench (2025). arXiv:2503.04721
- Lin, Chen, Chen, Lee — Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency (2026). arXiv:2604.04847
- Arora et al. — Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage (2025). arXiv:2510.02044
- Mohammadi, Klein, Arora, Bindschaedler — Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools (2026). arXiv:2606.02483