bytedance

ByteDance: UI-TARS 7B

UI-TARS 7B is a multimodal model from ByteDance that accepts both image and text inputs and supports a 128,000-token context window with up to 2,048 completion tokens per response. It does not support tool use, reasoning modes, or structured output, so its feature set is relatively narrow compared to more full-featured alternatives in its weight class. At $0.10 per million input tokens and $0.20 per million output tokens, the pricing is low, which may appeal to teams running high-volume vision and text workloads on a budget. The significant caveat is that there is currently no independent benchmark coverage, so performance relative to competing models is unverified. Buyers who need documented quality baselines before committing should treat this model as unproven and may want to run their own evaluations rather than relying on published scores.

Quality Score
80/100
price + capability + benchmarks
Input Price
$0.10
per 1M tokens
Output Price
$0.20
per 1M tokens
Context Window
128,000
tokens
Model ID
bytedance/ui-tars-1.5-7b
Vendor
bytedance
Tokenizer
Other
Input Modalities
image, text
Output Modalities
text
Max Output
2,048 tokens
Tool Calling
not supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no