xiaomi

Xiaomi: MiMo-V2.5

MiMo-V2.5 is a multimodal model from Xiaomi that accepts text, audio, image, and video inputs, making it one of the broader input stacks available on this site. Its context window reaches 1,048,576 tokens, and it supports tool use and reasoning, which suits multi-step agentic workflows. Structured output support is unconfirmed at this time. At $0.14 per million input tokens and $0.28 per million output tokens, pricing sits at the low end of the multimodal tier, which makes it worth considering for high-volume or cost-sensitive pipelines. The tradeoff is transparency: MiMo-V2.5 carries no independent benchmark coverage yet, so quality relative to peers is unproven. Buyers who can run their own evals on their specific tasks and are drawn by the multimodal input range and competitive price have the clearest reason to shortlist it; those who need third-party validation before committing should wait for coverage to emerge.

Quality Score
100/100
price + capability + benchmarks
Input Price
$0.14
per 1M tokens
Output Price
$0.28
per 1M tokens
Context Window
1,048,576
tokens
Model ID
xiaomi/mimo-v2.5
Vendor
xiaomi
Tokenizer
Other
Input Modalities
text, audio, image, video
Output Modalities
text
Max Output
131,072 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
✓ supported
Vision
✓ accepts images
Audio
✓ accepts audio
Moderated
no

Strong choice for

Category rankings

Where Xiaomi: MiMo-V2.5 places across the 10 categories it ranks in. How we rank →

#CategoryScore
#2 Real-Time ChatLatency · of 25 ranked 118
#3 TranscriptionVoice · of 19 ranked 123
#4 Self-Hosted / LocalCost · of 25 ranked 117
#5 Social Media PostsWriting · of 25 ranked 119
#5 Voice Assistant BackendVoice · of 25 ranked 123
#5 Cheap Bulk InferenceCost · of 25 ranked 137
#6 TTS ReplacementVoice · of 19 ranked 115
#9 Video Auto-TaggingVideo · of 25 ranked 123
#12 Audio SummarizationVoice · of 19 ranked 139
#24 Code CompletionCode · of 25 ranked 131

Similar models