qwen

Qwen: Qwen3 VL 8B Instruct

Qwen3 VL 8B Instruct is a vision-language model from Qwen that accepts both image and text as inputs and returns text outputs. It supports tool use and carries a 256,000-token context window, which accommodates long documents or extended multi-turn conversations. Maximum output is capped at 32,768 tokens. It does not include a built-in reasoning mode, and structured output support is unconfirmed. At $0.08 per million input tokens and $0.50 per million output tokens, it sits at the lower end of multimodal model pricing, making it worth considering for teams running high-volume image-plus-text workloads on a budget. The tradeoff is transparency: there is no independent benchmark coverage available yet, so performance relative to competing models is unproven. Buyers who need validated accuracy benchmarks before committing should wait or run their own evaluations before deploying Qwen3 VL 8B Instruct in production.

Quality Score
99/100
price + capability + benchmarks
Input Price
$0.08
per 1M tokens
Output Price
$0.50
per 1M tokens
Context Window
256,000
tokens
Model ID
qwen/qwen3-vl-8b-instruct
Vendor
qwen
Tokenizer
Qwen3
Input Modalities
image, text
Output Modalities
text
Max Output
32,768 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no

Similar models