On Hacker News

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

mimo.xiaomi.comsubmitted by gainsurierJune 8, 2026

590

points

435

comments

notable voices

The 5-second version

Xiaomi released MiMo-V2.5-Pro-UltraSpeed, a 1-trillion-parameter model achieving 1000+ tokens/s decode speed on commodity GPUs through collaboration with TileRT.
The API costs 3× the standard MiMo-V2.5-Pro price but delivers ~10× generation speed, available by application only from June 9–23, 2026 at platform.xiaomimimo.com/ultraspeed.
Approved users get free Chat access during the trial with limits: 10 queue entries/day, 30-minute sessions, and auto-release after 5 minutes idle.
The speed breakthrough enables parallel reasoning paths (Best-of-N/Tree Search), real-time coding agents, and time-critical applications like trading, anti-fraud, and surgical assistance.
The achievement relies on model-system codesign: FP4 (MXFP4) quantization for bandwidth reduction and DFlash speculative decoding, optimized by TileRT's custom compilation engine and kernels for a single 8-GPU node.

Top voices

Verbatim comments from the thread's most notable / highest-karma participants.

SwellJoenotable33.8k karma4 comments

In recent benchmarking I've been doing, DeepSeek V4 Pro was the fastest of 21 models, by a comfortable margin (https://swelljoe.com/html/bench-report-final.html). Faster than Claude Opus 4.8, which was the second fastest (Mistral doesn't count because it seems to have refused to participate). But, it's a limited data set, just a few benchmark runs of a limited set of tasks. It's entirely possible I happened to be calling the API at its least busy time and maybe Claude got hit during a busy time.

Read on HN ↗

nlnotable32.6k karma5 comments

Their models are much smaller: 1T vs 5T for the frontier models. 1T is Sonnet/Google Flash size, not Opus size. The $0.87/M tokens price for Mimo Pro is probably subsidized. Mimo models aren't widely available on western providers, but Kimi and Deepseek are similar sizes and cost about the same to run. They are priced $3-$4/M tokens (which is right were Google's very confused range of Flash models are priced at: between $0.40/M tokens and $9/M tokens depending on exactly which model - and you…

Read on HN ↗

Terrettanotable20.5k karma3 comments

You think someone is, or even should, special case things like estimates? What else deserves that level of intervention so they look less dumb? Logistics for getting to the car wash next door? In the mean time, alas, no, we can see from actual prompts sent directly or through sub-agents, and actual replies, estimates remain LLM generated. Though, this discussion here could change that, because indeed there is a lot of special casing and context stuffing going on, one of the oldest being toda…

Read on HN ↗

andai14.5k karma9 comments

> ... you have to be very specific about what you want. I found that Opus, for example, is much better at asking me to clear up ambiguity in a request before starting, whereas the Chinese models tend to "fill in the blanks" and make their own assumptions. That's the main thing I've noticed. Small models can follow instructions just fine. If the instructions are very specific. Then I often have to spend more time explaining a task than it would have taken me to do it myself. The bigger models h…

Read on HN ↗

Read the full article →Join the discussion on Hacker News →