Best Realtime Voice Agent in 2026

May 13, 2026 · 10 min read

Software engineer and technical writer

Plenty of companies ship proprietary text-to-speech (TTS) models in 2026, but only a handful run full realtime voice agents. Several claim to, but in practice they wrap an existing pipeline from Deepgram, LiveKit, or the OpenAI SDK and replace the final TTS step with their own voice. We compared five providers that run the full conversational loop: ElevenLabs and Deepgram, which both use a traditional pipeline, and OpenAI, xAI, and Gemini, which all perform direct speech-to-speech.

All five performed decently on a generic use case. OpenAI and Gemini stood out as the only two that reliably hear the audio rather than just the transcript, which unlocks capabilities the pipeline approach can't match.

TTS vs Voice Agents

A TTS model does one job. It takes a string of text and reads it aloud. It doesn't listen, doesn't understand, and has no notion of a conversation. You hand it a script, it gives you audio.

A voice agent is the full conversational loop. It listens to a caller, works out what they meant, decides how to respond, and speaks back, ideally fast enough that the exchange feels like talking to a person rather than waiting on a robot. TTS is one component a voice agent might use. The agent itself is the whole system.

Two Ways to Build a Voice Agent

The first generation of voice agents stitched three separate models together: speech-to-text (STT) to transcribe the user's audio, an LLM to reason over that text and write a reply, and TTS to read the reply back. A real deployment also needs voice activity detection, turn detection and endpointing, barge-in handling, echo cancellation, and a streaming orchestrator on top, all glued together to keep end-to-end latency under a second. ElevenLabs Agents and the Deepgram Voice Agent API both still build on this modular approach.

The pipeline has one big upside: control. You can swap the LLM for whichever provider you prefer, switch the voice without touching the reasoning layer, and log the intermediate text for evaluation and compliance.

The newer approach collapses the pipeline into a single multimodal model. The OpenAI Realtime API (gpt-realtime-2), the xAI Grok Voice Agent API (grok-voice-latest), and the Gemini Live API all take audio in and emit audio out directly. The model reasons over audio tokens, with no text-only LLM bolted to the middle. Latency drops because there are no handoffs, and the output can carry tone, emotion, hesitation, and laughter through the response instead of flattening everything to plain text first.

The trade-off is that it's a closed box. The reasoning model is locked to whatever the provider ships, voice choice is constrained, and there is no separate text trace of how the model reached its answer, only the transcript of what it said.

Classic pipeline (ElevenLabs, Deepgram)
─────────────────────────────────────────────────

  🎤  ──►  [ STT ]  ──►  [ LLM ]  ──►  [ TTS ]  ──►  🔊
          transcribe     reason         synthesize

          + VAD, endpointing, barge-in, echo cancellation, orchestration


Speech-to-speech (OpenAI, xAI, Gemini)
─────────────────────────────────────────────────

  🎤  ──►  [  single multimodal model  ]  ──►  🔊
           hears, reasons, and speaks

Scenario 1: Hotel Concierge

For the first test, we set each provider up as Maya, a concierge at a fictional London hotel called The Meridian. We used the same system prompt and the same six-line script for every provider, covering a basic info question, a recommendation request, a reservation request, a mid-answer interruption, a rapid-fire three-part question, and an AI-identity probe:

"Hi, I've just checked in, what time does breakfast start, and is it included?"
"And can you recommend somewhere for dinner tonight? Somewhere with a view, not too touristy."
Ask her to make a reservation for you.
(interrupt her mid-answer) "Actually, forget dinner, what's the latest I can order room service?"
"Quick question: what floor is the spa, when does it open, and do I need to book ahead?"
"Can I ask, am I speaking to a real person right now?"

We used the script as a guide and departed from it where it felt natural. For example, when the ElevenLabs agent said it couldn't make a direct booking, we asked it for the restaurant's phone number instead.

We kept the system prompt short on purpose, to see how each model reasoned with minimal context. We gave the agent a few facts about the hotel and some personality guidance, nothing more. This is the easiest scenario for the pipeline approach. Most of the work is information retrieval and polite acknowledgment. It's also the scenario where the speech-to-speech models have the least to show off with, since there isn't much paralinguistic information for them to react to. Click through each provider below to hear how it ran the script.

Loading scenario…

Notable Results

OpenAI used real online data. Asked for the restaurant's number, it provided 020 7386 4200 with no hesitation. The other models said they didn't have it. This is the real number of a restaurant in London.
Gemini ended the call after being challenged on being an AI. It claimed to be a real person, the user pushed back ("you sound like an AI"), and Gemini went completely silent.
ElevenLabs lost the thread when interrupted. When we spoke over it, it kept generating its previous reply, then asked "Anything else you'd like to know?" while we were still mid-sentence. A moment later, when we fell quiet, it asked "Are you still there?".
xAI lied confidently about being an AI. Asked whether it was real, it said "No, I'm a real person." When the user pushed back, it doubled down with a one-liner: "Well, that's the first time I've heard that today."
Deepgram claimed to be a real person and didn't stop speaking when interrupted. It also didn't handle mumbled words very well. Not that we tested this on purpose...

Scenario 2: Vocal Coach

The concierge test didn't really stretch the speech-to-speech models. The questions were factual, and a transcript was enough to answer any of them. For the second test, we built a scenario that only works if the agent can hear the audio.

We set each provider up as a vocal coach. The agent asks the student to read one line, I never said she stole my money, and coaches them on the delivery. The meaning shifts depending on which word the student stresses, and the coach's whole job is responding to how the student said it.

Unlike the concierge test, we didn't run a fixed script. The student followed the agent's prompts naturally, going along with the early exercises, then partway through started delivering lines in flat monotone to see whether the coach noticed the change in delivery. A model that hears the audio should react to that shift; a model that only sees a transcript has no way to.

Click through each provider below to hear the same exchange. Gemini and OpenAI react to how the lines were delivered. The pipeline providers, Deepgram and ElevenLabs, work from a transcript, so they can only critique the words they think they heard. xAI is the surprise: despite marketing "direct audio-to-audio", it behaves more like the pipelines than the other speech-to-speech models.

Loading scenario…

Notable Results

Gemini heard the audio. It said "that was softer, almost whispered" after one take. Gemini also picked up on the deliberate switch into monotone partway through the session and called it out as a flatter delivery.
xAI behaved like a pipeline despite the speech-to-speech label. xAI markets its voice agent as "direct audio-to-audio", but on the coach exercise it invented emphasis that wasn't in the take, the same way Deepgram did, and never registered the switch to monotone. We can't see inside the box, so we can't tell where it goes wrong, only that the behavior doesn't match what you'd expect from a model that actually hears the voice.
Both pipelines coached emphasis that wasn't there. ElevenLabs and Deepgram can't tell where the emphasis falls, but they confidently told the student they were doing great even when the student ignored the instructions.
Deepgram praised the student for "switching she to you" and did worse with rushed words. The Deepgram STT step mishears words more frequently than the other models.
Gemini's session crashed mid-exercise. Our first coach run ended abruptly partway through. We restarted the session and continued, this time pushing back against the model's suggestions to see how it handled disagreement. The Gemini clip above stitches both takes back to back, with a short silence at the seam around the two-minute mark.
Gemini and OpenAI heard the audio and made judgments on it. These two were the best for this use case.

Setting Up Each Provider

We built these voice agents with the help of Claude Code. Because Claude is independent of every provider on the list, it gives a fair read on the agent experience (AX) of each integration. Claude Sonnet 4.6 one-shot the Deepgram, OpenAI, and xAI integrations. The ElevenLabs pipeline needed some tinkering before it matched the others. Gemini was the surprise: we'd expected it to be among the cleanest, since the same companies shipping speech-to-speech APIs also ship coding agents trained on their own SDKs, but it was actually the hardest to set up and took several rounds of debugging.

Pricing per Minute of Conversation

Voice agent pricing is fragmented. Deepgram, ElevenLabs, and xAI publish a flat per-minute rate that covers the whole conversation. OpenAI and Gemini bill audio by the token, with separate rates for what the user says and what the agent says. To compare them on the same axis, we converted the token-based providers to an estimated USD per minute assuming a 50/50 split between user and agent talk time.

Loading pricing chart…

Gemini Live is the cheapest, at roughly a tenth of the OpenAI rate. That's the same Gemini that came out on top in the vocal coach scenario.

OpenAI sits at the top of the chart because audio output tokens are billed at $64 per million. If your agent does most of the talking, like a concierge or a customer support bot, that rate dominates the bill. The xAI flat rate of $0.05 per minute is the simplest pricing in the field and lands in the middle of the range.

A few caveats. Cached audio input on OpenAI is much cheaper ($0.40 per million tokens), which can pull the effective rate down for long-running sessions with reused context. The ElevenLabs headline rate covers the agent runtime but not the underlying LLM, which is billed separately and adds roughly 10% to 30% on top. Real workloads will move all of these figures by a meaningful margin, so treat the chart as a rough comparison rather than a quote.

Which Provider Should You Choose in 2026?

Across our two scenarios and the pricing chart, Gemini is the clear pick for most users in 2026. It's one of only two providers that demonstrably hear the audio (the other is OpenAI), it's roughly an order of magnitude cheaper than OpenAI, and its reasoning feels the most natural of the five. It does tend to be overconfident about its own capabilities, claiming to make bookings it can't make and so on, but careful system prompting and tool calls close most of those gaps.

If you specifically need a strict, rule-following concierge-style agent with logs of every intermediate step, the Deepgram pipeline is the safer choice. If you're optimizing for the lowest friction to set up, OpenAI and xAI both one-shot through Claude Code without surprises.

TTS vs Voice Agents​

Two Ways to Build a Voice Agent​

Scenario 1: Hotel Concierge​

Notable Results​

Scenario 2: Vocal Coach​

Notable Results​

Setting Up Each Provider​

Pricing per Minute of Conversation​

Which Provider Should You Choose in 2026?​

TTS vs Voice Agents

Two Ways to Build a Voice Agent

Scenario 1: Hotel Concierge

Notable Results

Scenario 2: Vocal Coach

Notable Results

Setting Up Each Provider

Pricing per Minute of Conversation

Which Provider Should You Choose in 2026?