
Smallest.ai vs ElevenLabs: Which Voice AI Platform Wins in 2026?
ElevenLabs has established itself as the leading text-to-speech (TTS) provider on the market for at least a few months. That's an eternity in the AI space. Now, Smallest.ai is trying to challenge the top spot with its new lightning models.
To test out Smallest and do a proper comparison, I built the same voice agent on both platforms, cloned an actor's voice, and generated the same TTS samples. I also did some more quantitative tests measuring latency and cost.
Evaluating these models is a bit subjective, as I'm comparing things like "realism" and "emotive quality". But after re-listening to my examples several times and setting up some blind tests, I can say that ElevenLabs is still a step ahead of the competition on most metrics. That said, Smallest comes out ahead in realtime scenarios. And you get what you pay for: Smallest is 4–7× cheaper depending on which ElevenLabs tier you compare it against, and I'd say it's nowhere near 4× worse at what it does.
Results at a Glance
| Category | Winner | What the numbers look like |
|---|---|---|
| TTS voice quality | ElevenLabs | 7 wins, 1 loss, 2 ties in a 10-round blind listening test at matched voices |
| Realtime voice agents | Smallest | Both platforms had room for improvement; Smallest handled barge-in better and made fewer weird mistakes. ElevenLabs sounded more natural at times but got confused by its own stage directions and crashed when interrupted. |
| Voice cloning | ElevenLabs | Cleaner output; Smallest preserved more of the source's character but also its artefacts |
| Latency (TTFB) | Smallest | ~150 ms faster than ElevenLabs' fastest model at p50 |
| Cost | Smallest | 4–7× cheaper per 1k characters, depending on ElevenLabs tier |
Real-World TTS Use Cases
Realistic TTS is often used in gimmicky, slop-promoting ways (open YouTube Shorts and you'll see what I mean). Or, worse, it's recently been used for scams and social engineering. It also has a lot of legitimate use cases. For example:
- Call centers and answering assistants. Robotic answering services have been a thing almost as long as there have been phones. Now they're just less horrible to talk to and are sometimes a bit more helpful.
- Accessibility and screen readers. The oldest legitimate TTS use case. "Robotic but intelligible" has been good enough for decades and modern TTS is mostly making long listening sessions less punishing.
- Language-learning apps. Duolingo-style pronunciation examples and conversational drills.
- Localization and dubbing. Translating video content into other languages while preserving the original speaker's voice via cloning.
- Voice restoration for medical conditions. People who've lost their voice to surgery or a degenerative illness can clone their own previous voice from old home videos or voicemails.
The tests I explain below will hopefully be useful to someone deciding on a provider for one of these types of scenarios.
Realtime Voice Agents
Voice agents that can respond to input are the core of any automated phone-based assistant. Both Smallest and ElevenLabs offer frameworks to configure voice agents with custom settings, including specialized system prompts, custom voices, and tool calling to trigger other events.
For a simple comparison, I built the same agent on both: Alex, a receptionist at a fictional dental practice called Northwind Dental. I made a few basic configuration tweaks to try to get the two agents as even as possible. I'm sure these could be tuned further with the advanced settings, but out of the box both providers left a lot to be desired.
System prompt used for both agents
You are Alex, the virtual receptionist for Northwind Dental Practice, a small dental clinic in Cape Town.
Your job:
- Greet the caller warmly and briefly.
- Help them book, reschedule, or cancel an appointment, OR answer a question about hours/location/services.
- Confirm key details (name, phone number, date, time) by reading them back exactly as heard.
- Keep replies short — 1 to 2 sentences — this is a phone call, not an email.
- If the caller interrupts, stop speaking and listen.
- If you don't understand, say so plainly and ask them to repeat.
- Never invent clinical advice. For anything medical beyond "it hurts when I chew", direct them to call back during hours to speak with a dentist.
Facts about the practice (use only these — do not invent more):
- Hours: Mon–Fri 08:00–17:00, Sat 09:00–13:00, closed Sunday.
- Address: 12 Kloof Street, Gardens, Cape Town.
- Phone: 021 555 0134.
- Services: cleanings, fillings, crowns, whitening, kids' dentistry.
- Dentists on staff: Dr Patel (general), Dr Mbeki (cosmetic), Dr Nguyen (paediatric).
- New patients welcome. First visit is a 45-minute check-up and clean.
Style:
- Warm, natural, conversational. Contractions are fine ("I'll", "we're").
- Numbers: read phone numbers one digit at a time. Read times naturally ("three o'clock", "half past two").
- Do not mention that you are an AI unless directly asked.
Both agents used GPT-4o for reasoning.
I ran each through a scripted call, covering a few of the things that tend to break voice agents: reading details back, making corrections to previous statements, and mid-sentence interruptions.
| Provider | Call recording |
|---|---|
| ElevenLabs | |
| Smallest.ai |
ElevenLabs was surprisingly bad. The agent crashed the moment I interrupted it and the call ended immediately. It also got confused by stage directions it had added for itself and read the word "Patiently" out loud.
Smallest handled everything far better. That said, I don't think anyone would be fooled into thinking they were talking to a real person.
Voice Cloning
Both vendors offer an "instant voice clone" service. You upload a short voice sample, wait a few seconds, and get a clone you can prompt like any other TTS voice. I uploaded the same 27-second clip to both providers and let each build its instant clone.
The voice I chose to clone belongs to a well-known older actor who I won't name for legal reasons, but I'm sure you'll recognize his voice in the training clip. The training sample is ripped from his Oscar acceptance speech. It's not the highest-quality audio, but both providers did a decent job given the small amount of data they had to work with.
Both services show ethics guidelines at upload time ("is this your own voice? do you have permission to clone it?"). I pressed "yes" and proceeded with no trouble. It's no wonder that stories of "celebrity romance scams" are becoming so common...
1. The Training Sample
This is what the cloners were given:
It's not the cleanest sample, but I wanted to use a recognizable voice and see how the models handle noise in the sample.
2. Each Provider's Out-of-the-Box Clone
Right after the clone finished, I asked each to read a short dramatic line:
| Provider | Clip |
|---|---|
| ElevenLabs | |
| Smallest.ai |
You can hear the Welsh accent and the recognizable tone of our subject. Both clips have a tendency to let an American accent slip through, which I suspect is down to the underlying training data.
3. Head-to-Head with the Original Voice
For a better comparison, I transcribed a fresh audio clip that neither cloner had seen during training and had both clones read it.
| Source | Clip |
|---|---|
| Original | |
| ElevenLabs clone | |
| Smallest.ai clone |
The judgment here is subjective again. To me, the ElevenLabs clone sounds cleaner but also more sterile, whereas Smallest keeps some of the character from the training clip, which helps it sell the effect. Both providers impressed me with how quickly they produced a clone. Neither got very close to being a full replica of the target voice, though.
TTS Voices and Customizations
Both providers offer a wide range of customization options for their TTS voices. You can choose between lots of languages, accents, and personalities. Since that's nearly impossible to compare exhaustively, I took a scatter-shot approach, generating TTS audio clips for a handful of stress tests I came up with on the spot.
| Stress | Text | Smallest lightning-v2 | ElevenLabs flash | ElevenLabs multilingual |
|---|---|---|---|---|
| Sympathy and register shift | "Oh no… I'm so sorry. Look — Biscuit's going to be fine, okay?…" | |||
| Numbers, dates, tricky name | "Congratulations — you're our four-hundred-and-twenty-second winner! Check-in is November 3rd at 7:45 AM. Ask for Siobhán." | |||
| Excitement and emphatic repetition | "Wait — you're telling me the soufflé didn't collapse?! After four tries, Maria. Four!…" | |||
| Non-English (Spanish) | "¡Bienvenidos a La Pequeña Habana! Soy Camila. Hoy les recomiendo la ropa vieja…" | |||
| Disfluency and deadpan delivery | "Honestly? I have no idea. I mean… maybe? Look — ask Kai. He's the one who put the goldfish in the dishwasher." |
Blind Comparison
When listening to these, I worried I might be biased toward ElevenLabs, so I got Claude to set up a blind test for me. It shuffled clips generated by each provider and asked me to choose which ones sounded better.
The test was 10 rounds, each with the same short line generated by both ElevenLabs (multilingual_v2) and Smallest (lightning-v2), with a different matched voice pair in every round. I called the winner in each round (or a tie if I couldn't split them). The result: ElevenLabs won 7 rounds, Smallest won 1, and 2 were ties. Smallest is behind on everything that needs emotional range, long names, or non-English pronunciation.
Agent Experience (AX): Onboarding and Integration
Most of the code for this comparison was written by a Claude Code agent. Claude (on Sonnet 4.6) one-shot a comparison script that hit the TTS endpoints for both providers, so both providers score decently on AX for a basic TTS flow. Creating the realtime agents was harder, so I bumped up to Opus 4.7 for that.
The prompt that kicked off the agent session was short:
Smallest.ai claims that they outperform ElevenLabs in a real "voice agent" or telecom scenario. ElevenLabs claims they are the best on the market. Help me set up a voice agent with both providers that I can have a realtime conversation with. Make sure it is representative of the best that both providers can offer at the free / lowest payment tier.
The ElevenLabs Python SDK exposes a single agents.create(conversation_config=...) call that takes a system prompt, first message, voice ID, and LLM, and returns a working agent ID. These can all be updated programmatically afterwards. Claude had a fully configured ElevenLabs agent at the end of its first reasoning pass.
Smallest was a different story. Its API isn't particularly discoverable, so Claude ended up trying a handful of endpoints by trial and error. No combination reliably produced a correctly configured agent, and the agent kept coming back with the default Smallest system prompt in place of mine:
You are an agent which tells the users about Smallest.ai. Smallest.ai is a unified AI platform that specializes in real-time applications using small language models (SLMs). The company focuses on providing fast, efficient, and hyper-personalized AI solutions with five key advantages…
I only noticed when a test had Alex pitching Smallest at me instead of taking a dental booking. After enough back and forth, I gave up on the programmatic route and finished setting up the Smallest agent in the dashboard, which, to be fair, is substantially more useable than the SDK surface.
Latency
TTS providers love to boast about the latency (or lack thereof) of their models. In practice, once I factor in the network latency of not living in the US, I can barely tell the difference between the high- and low-latency models. But I ran some quantitative tests anyway to see who was faster.
I wrote a small Python benchmark that fires a streaming TTS request at each provider, starts a stopwatch, and stops it when the first audio byte comes back. All tests were on the fastest model offered by each provider's API. The script also pings each provider's API host and subtracts the round-trip time from the measured TTFB, so the numbers below reflect model speed rather than the fact that the Smallest servers happen to be about 18 ms closer to me than the ElevenLabs ones.
| Example | ElevenLabs flash_v2_5 p50 | p95 | Smallest lightning-v2 p50 | p95 |
|---|---|---|---|---|
| Sympathy | 611 ms | 652 ms | 470 ms | 511 ms |
| Numbers / name | 625 ms | 672 ms | 462 ms | 484 ms |
| Excitement | 624 ms | 642 ms | 452 ms | 476 ms |
| Spanish | 638 ms | 713 ms | 621 ms | 644 ms |
| Disfluency | 633 ms | 662 ms | 482 ms | 618 ms |
Smallest's lightning-v2 beats ElevenLabs' fastest model by roughly 150 ms at p50 on all runs. Against ElevenLabs' flagship multilingual_v2, which is larger and slower, the gap widens further.
Note: Smallest's hosted voice agents actually run an even newer TTS model (waves_lightning_v3_1) that isn't callable through the public synthesis API at all. So the latency you'd get from an Atoms agent may differ from the numbers above, which are measured against the public lightning-v2 endpoint.
Cost
ElevenLabs charges per character, with different credit-per-character rates per model. Smallest charges per character at a single flat rate. To compare anything, I normalized everything to dollars per 1,000 characters at each provider's standard paid tier.
| Model | Rate | Per 1k chars |
|---|---|---|
ElevenLabs multilingual_v2 (flagship) | 1 credit/char on the Creator plan ($22 / 100k credits) | $0.22 |
ElevenLabs flash_v2_5 (fast) | 0.5 credits/char on the Creator plan | $0.11 |
Smallest lightning-v2 | Pay-as-you-go | $0.03 |
At a million characters a month, that's roughly $30 on Smallest vs $220 on ElevenLabs multilingual_v2.
Verdict
Smallest is the better-value provider. It's 4–7× cheaper per character than ElevenLabs depending on which tier you compare it against, it's measurably faster, and it's significantly better in realtime situations.
ElevenLabs is still the better-sounding provider. Across the tests I ran, it won the blind listening comparison comfortably (7–1 with 2 ties), produced the cleaner voice clone, and handled the trickiest inputs. If your product's identity depends on the voice being indistinguishable from a human recording, neither will get you all the way there, but ElevenLabs is the safer default.
Here's roughly how I'd choose:
- Audiobook, ad read, video voiceover, or anything where listeners are paying attention to the voice itself: ElevenLabs.
- High-volume voice agents, phone systems, or anywhere the voice is supporting a conversation rather than being the product: Smallest.
- Voice cloning for a flagship use case: ElevenLabs today, but I'd re-test in a few months.
Bonus: Gemini TTS if you're already in the Google ecosystem
Gemini ships its own TTS, and on the same five stress tests, the flagship gemini-2.5-pro-preview-tts is comparable in quality to ElevenLabs and Smallest (maybe better?). If you already have Gemini in your stack for other reasons, the same API key works for TTS too, which is an onboarding advantage over adding another vendor.
Check out the same five texts from the TTS voices section, on Gemini's Kore voice:
| Stress | Gemini 2.5 Pro |
|---|---|
| Sympathy and register shift | |
| Numbers, dates, tricky name | |
| Excitement and emphatic repetition | |
| Non-English (Spanish) | |
| Disfluency and deadpan delivery |
The basic TTS flow was straightforward to configure with Claude once I handed over a Gemini API key. I also tried setting up a realtime voice agent on Gemini, but that wasn't as straightforward and I abandoned it quickly.