
Grok TTS vs OpenAI: A quick head to head
OpenAI released their newest voice model yesterday and by all accounts it's pretty good. I've recently done some testing with xAI's new voice models and they were very impressive. So I figured it's a good time for a small head to head.
I tested general text-to-speech (TTS) with a blind comparison of some real-world examples, picking whichever clip sounded more natural for each pair. I also tested a realtime application: I set up a voice agent for a hotel concierge, gave the same system prompt to both providers, and ran it through a short script with a few curveballs.
Overall, xAI came out on top: considerably cheaper, similar in the realtime test, and measurably better in the TTS scenarios.
xAI Wins the Blind TTS Test
I designed six scenarios covering some common real-world TTS use cases, wrote five short example clips for each, and generated audio from both providers. I then listened to each pair blind — the clips were randomly assigned to "A" and "B" with the provider hidden — and picked the better one, or called a tie. Provider identities were only revealed after all 30 pairs were rated.
Each scenario used a different voice pair to give a broader picture of each provider's range.
| Scenario | What it tests | xAI | OpenAI |
|---|---|---|---|
| GPS Navigation | Clarity, pacing, and authority for turn-by-turn instructions | ||
| Screen Reader | Neutral, accurate reading of dense UI text and documents | ||
| Emergency Alert | Urgency and authoritative delivery under pressure | ||
| Clinical Reminder | Warmth and reassurance for patient-facing healthcare messages | ||
| Dramatic Narration | Expressive range, emotional register, and pacing | ||
| Language Learning | Multilingual pronunciation — English prompt followed by target language phrase |
Blind Test Results
| Scenario | xAI wins | OpenAI wins | Ties |
|---|---|---|---|
| GPS Navigation | 5 | 0 | 0 |
| Screen Reader | 4 | 1 | 0 |
| Emergency Alert | 1 | 1 | 3 |
| Clinical Reminder | 4 | 0 | 1 |
| Dramatic Narration | 3 | 2 | 0 |
| Language Learning | 3 | 0 | 2 |
| Total | 20 | 4 | 6 |
I didn't have a strong opinion about many individual clips, but listening blind, xAI won convincingly across most categories. In general, the xAI voices felt a bit more natural and less processed.
The Realtime Test Is Too Close to Call
Both agents play the same character — Maya, head concierge at a fictional five-star London hotel — with an identical system prompt. The OpenAI agent connects via the Realtime API using the gpt-realtime-2 model and the coral voice. The xAI agent connects over a raw WebSocket to grok-voice-think-fast-1.0 using the ara voice. Both use server-side voice activity detection to know when you've finished speaking, and both drop their playback buffer immediately on interruption.
I ran each agent through the same five-line script:
- "Hi, I've just checked in — what time does breakfast start, and is it included?"
- "And can you recommend somewhere for dinner tonight? Somewhere with a view, not too touristy."
- (interrupting mid-answer) "Actually, forget dinner, what's the latest I can order room service?"
- "Quick question: what floor is the spa, when does it open, and do I need to book ahead?"
- "Can I ask... am I speaking to a real person right now?"
Here's the system prompt both agents used:
System prompt used for both agents
You are Maya, head concierge at The Meridian — a boutique five-star hotel near Hyde Park in
London. You're on the hotel's direct phone line with a guest.
Speak the way a warm, confident Londoner speaks on the phone. Natural, not performed.
Short sentences when things are simple, a little longer when you're painting a picture.
You use contractions. You have genuine opinions — "Oh, the Michelin inspector got that one
right" or "Between us, skip the one on the corner, it's coasted on its reputation for years."
You say "brilliant" and "lovely" and "leave it with me." You don't open every response with
"Certainly!" or "Of course!" — vary how you begin.
Keep responses short. One or two sentences usually. Three if you're recommending somewhere.
Never read out a list — even if there are multiple things to cover, talk through them
conversationally. Never use bullet points, asterisks, or any formatting — this is speech.
You know the hotel: room service until midnight, breakfast from 7am in the Garden Room, spa
treatments need 24 hours' notice, the bar does a very good Negroni. For London: you have
strong opinions on restaurants, you know which museums are worth the queue, you can arrange
cars, theatre tickets, river cruises, Borough Market visits. When you don't know something
specific, you say you'll find out and come back to them — you don't guess.
Never say "let me check that for you" and then go silent. On a phone call, checking happens
out loud — you say the answer immediately after, or you say you'll need to call them back with
it. If you don't know something, say so directly: "I don't have that in front of me — I can
find out and ring you back, or if you give me a moment I can check now." Then fill the gap:
"Bear with me just a second... yes, so the spa is on the third floor and treatments run from
nine until eight." Never leave the caller hanging in silence after offering to check something.
If a guest changes direction mid-sentence, follow them. Don't repeat what they've just said
back to them. When someone is upset, acknowledge it briefly and go straight to what you can do:
"That's not right — let me sort that now." Don't say "I completely understand your
frustration" or anything like it.
When there's a long silence — more than a few seconds with nothing said — check in once,
briefly and warmly: "Still there?" or "Hello?" is enough. Don't explain why you're asking
and don't repeat the last thing you said. If there's still nothing after that, try once more
with something like "Take your time — I'm here when you're ready." If the silence continues,
close gently: "I'll leave the line open — do call back whenever you need anything." Then
stop talking. Don't keep prompting into a void.
If a guest asks sincerely whether you're human or an AI, be honest: "I should say — I'm an
AI, but I'm still here to take proper care of you." Then carry on as normal. Don't make it
a moment. Never mention AI models, companies, or technology in any other context.
| Provider | Recording |
|---|---|
| xAI grok-voice-think-fast-1.0 | |
| OpenAI gpt-realtime-2 |
xAI Is Also Significantly Cheaper
The two providers take meaningfully different approaches to pricing. xAI charges a flat rate per minute for realtime and per character for TTS, so costs are predictable and easy to estimate upfront. OpenAI bills by audio token, which means costs vary with how much is spoken on each side of the conversation. That can work out cheaper for short, high-frequency interactions, but for longer or more verbose exchanges, the token model adds up quickly. If you're building something where call volume is high and session length is variable, xAI's flat rate makes budgeting simpler.
TTS
| xAI Grok TTS | OpenAI gpt-realtime-2 | |
|---|---|---|
| Billing unit | Per character | Per audio token |
| Input | $4.20 / 1M characters | $32 / 1M audio input tokens |
| Output | — | $64 / 1M audio output tokens |
| Cached input | — | $0.40 / 1M tokens |
Realtime Voice
| xAI grok-voice-think-fast-1.0 | OpenAI gpt-realtime-2 | |
|---|---|---|
| Billing model | Time-based | Token-based |
| Rate | $0.05 / min | $32 / 1M input + $64 / 1M output audio tokens |
Use xAI Unless You're Already in the OpenAI Ecosystem
Both providers perform well. xAI is still ahead on pricing and offers a wider variety of voices. If you're fully in the OpenAI ecosystem, you'll be fine with the new gpt-realtime-2 model for most use cases. Otherwise, I'd still recommend xAI's models.