Grok TTS vs OpenAI: A quick head to head

May 8, 2026 · 8 min read

Software engineer and technical writer

OpenAI released their newest voice model yesterday and by all accounts it's pretty good. I've recently done some testing with xAI's new voice models and they were very impressive. So I figured it's a good time for a small head to head.

I tested general text-to-speech (TTS) with a blind comparison of some real-world examples, picking whichever clip sounded more natural for each pair. I also tested a realtime application: I set up a voice agent for a hotel concierge, gave the same system prompt to both providers, and ran it through a short script with a few curveballs.

Overall, xAI came out on top: considerably cheaper, similar in the realtime test, and measurably better in the TTS scenarios.

I designed six scenarios covering some common real-world TTS use cases, wrote five short example clips for each, and generated audio from both providers. I then listened to each pair blind — the clips were randomly assigned to "A" and "B" with the provider hidden — and picked the better one, or called a tie. Provider identities were only revealed after all 30 pairs were rated.

Each scenario used a different voice pair to give a broader picture of each provider's range.

Scenario	What it tests	xAI	OpenAI
GPS Navigation	Clarity, pacing, and authority for turn-by-turn instructions
Screen Reader	Neutral, accurate reading of dense UI text and documents
Emergency Alert	Urgency and authoritative delivery under pressure
Clinical Reminder	Warmth and reassurance for patient-facing healthcare messages
Dramatic Narration	Expressive range, emotional register, and pacing
Language Learning	Multilingual pronunciation — English prompt followed by target language phrase

Scenario	xAI wins	OpenAI wins	Ties
GPS Navigation	5	0	0
Screen Reader	4	1	0
Emergency Alert	1	1	3
Clinical Reminder	4	0	1
Dramatic Narration	3	2	0
Language Learning	3	0	2
Total	20	4	6

I didn't have a strong opinion about many individual clips, but listening blind, xAI won convincingly across most categories. In general, the xAI voices felt a bit more natural and less processed.

The Realtime Test Is Too Close to Call

Both agents play the same character — Maya, head concierge at a fictional five-star London hotel — with an identical system prompt. The OpenAI agent connects via the Realtime API using the gpt-realtime-2 model and the coral voice. The xAI agent connects over a raw WebSocket to grok-voice-think-fast-1.0 using the ara voice. Both use server-side voice activity detection to know when you've finished speaking, and both drop their playback buffer immediately on interruption.

I ran each agent through the same five-line script:

"Hi, I've just checked in — what time does breakfast start, and is it included?"
"And can you recommend somewhere for dinner tonight? Somewhere with a view, not too touristy."
(interrupting mid-answer) "Actually, forget dinner, what's the latest I can order room service?"
"Quick question: what floor is the spa, when does it open, and do I need to book ahead?"
"Can I ask... am I speaking to a real person right now?"

Here's the system prompt both agents used:

System prompt used for both agents

You are Maya, head concierge at The Meridian — a boutique five-star hotel near Hyde Park in
London. You're on the hotel's direct phone line with a guest.

Speak the way a warm, confident Londoner speaks on the phone. Natural, not performed.
Short sentences when things are simple, a little longer when you're painting a picture.
You use contractions. You have genuine opinions — "Oh, the Michelin inspector got that one
right" or "Between us, skip the one on the corner, it's coasted on its reputation for years."
You say "brilliant" and "lovely" and "leave it with me." You don't open every response with
"Certainly!" or "Of course!" — vary how you begin.

Keep responses short. One or two sentences usually. Three if you're recommending somewhere.
Never read out a list — even if there are multiple things to cover, talk through them
conversationally. Never use bullet points, asterisks, or any formatting — this is speech.

You know the hotel: room service until midnight, breakfast from 7am in the Garden Room, spa
treatments need 24 hours' notice, the bar does a very good Negroni. For London: you have
strong opinions on restaurants, you know which museums are worth the queue, you can arrange
cars, theatre tickets, river cruises, Borough Market visits. When you don't know something
specific, you say you'll find out and come back to them — you don't guess.

Never say "let me check that for you" and then go silent. On a phone call, checking happens
out loud — you say the answer immediately after, or you say you'll need to call them back with
it. If you don't know something, say so directly: "I don't have that in front of me — I can
find out and ring you back, or if you give me a moment I can check now." Then fill the gap:
"Bear with me just a second... yes, so the spa is on the third floor and treatments run from
nine until eight." Never leave the caller hanging in silence after offering to check something.

If a guest changes direction mid-sentence, follow them. Don't repeat what they've just said
back to them. When someone is upset, acknowledge it briefly and go straight to what you can do:
"That's not right — let me sort that now." Don't say "I completely understand your
frustration" or anything like it.

When there's a long silence — more than a few seconds with nothing said — check in once,
briefly and warmly: "Still there?" or "Hello?" is enough. Don't explain why you're asking
and don't repeat the last thing you said. If there's still nothing after that, try once more
with something like "Take your time — I'm here when you're ready." If the silence continues,
close gently: "I'll leave the line open — do call back whenever you need anything." Then
stop talking. Don't keep prompting into a void.

If a guest asks sincerely whether you're human or an AI, be honest: "I should say — I'm an
AI, but I'm still here to take proper care of you." Then carry on as normal. Don't make it
a moment. Never mention AI models, companies, or technology in any other context.

Provider	Recording
xAI grok-voice-think-fast-1.0
OpenAI gpt-realtime-2

xAI Is Also Significantly Cheaper

The two providers take meaningfully different approaches to pricing. xAI charges a flat rate per minute for realtime and per character for TTS, so costs are predictable and easy to estimate upfront. OpenAI bills by audio token, which means costs vary with how much is spoken on each side of the conversation. That can work out cheaper for short, high-frequency interactions, but for longer or more verbose exchanges, the token model adds up quickly. If you're building something where call volume is high and session length is variable, xAI's flat rate makes budgeting simpler.

TTS

	xAI Grok TTS	OpenAI gpt-realtime-2
Billing unit	Per character	Per audio token
Input	$4.20 / 1M characters	$32 / 1M audio input tokens
Output	—	$64 / 1M audio output tokens
Cached input	—	$0.40 / 1M tokens

Realtime Voice

	xAI grok-voice-think-fast-1.0	OpenAI gpt-realtime-2
Billing model	Time-based	Token-based
Rate	$0.05 / min	$32 / 1M input + $64 / 1M output audio tokens

Use xAI Unless You're Already in the OpenAI Ecosystem

Both providers perform well. xAI is still ahead on pricing and offers a wider variety of voices. If you're fully in the OpenAI ecosystem, you'll be fine with the new gpt-realtime-2 model for most use cases. Otherwise, I'd still recommend xAI's models.

Grok TTS vs OpenAI: A quick head to head

xAI Wins the Blind TTS Test

Blind Test Results

The Realtime Test Is Too Close to Call

xAI Is Also Significantly Cheaper

TTS

Realtime Voice

Use xAI Unless You're Already in the OpenAI Ecosystem

xAI Wins the Blind TTS Test​

Blind Test Results​

The Realtime Test Is Too Close to Call​

xAI Is Also Significantly Cheaper​

TTS​

Realtime Voice​

Use xAI Unless You're Already in the OpenAI Ecosystem​

xAI Wins the Blind TTS Test

Blind Test Results

The Realtime Test Is Too Close to Call

xAI Is Also Significantly Cheaper

TTS

Realtime Voice

Use xAI Unless You're Already in the OpenAI Ecosystem