xiaomi

Xiaomi: MiMo-V2-Omni

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step...

Quality Score
100/100
composite of price, context, capability
Input Price
$0.40
per 1M tokens
Output Price
$2.00
per 1M tokens
Context Window
262,144
tokens
Model ID
xiaomi/mimo-v2-omni
Vendor
xiaomi
Tokenizer
Other
Input Modalities
text, audio, image, video
Output Modalities
text
Max Output
65,536 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
✓ supported
Vision
✓ accepts images
Audio
✓ accepts audio
Moderated
no

Strong choice for

Similar models