bytedance

ByteDance: UI-TARS 7B

UI-TARS 7B is a multimodal model from ByteDance that accepts both image and text inputs and supports a 128,000-token context window with up to 2,048 completion tokens per response. It does not support tool use, reasoning modes, or structured output, so its feature set is relatively narrow compared to more full-featured alternatives in its weight class. At $0.10 per million input tokens and $0.20 per million output tokens, the pricing is low, which may appeal to teams running high-volume vision and text workloads on a budget. The significant caveat is that there is currently no independent benchmark coverage, so performance relative to competing models is unverified. Buyers who need documented quality baselines before committing should treat this model as unproven and may want to run their own evaluations rather than relying on published scores.

Query via API → View on bytedance → Estimate cost

Quality Score

80/100

price + capability + benchmarks

Input Price

$0.10

per 1M tokens

Output Price

$0.20

per 1M tokens

Context Window

128,000

tokens

Model ID: bytedance/ui-tars-1.5-7b
Vendor: bytedance
Tokenizer: Other
Input Modalities: image, text
Output Modalities: text
Max Output: 2,048 tokens
Tool Calling: not supported
Structured Output: ✓ supported
Reasoning Mode: not supported
Vision: ✓ accepts images
Audio: no
Moderated: no