qwen

Qwen: Qwen3 VL 8B Instruct

Qwen3 VL 8B Instruct is a vision-language model from Qwen that accepts both image and text as inputs and returns text outputs. It supports tool use and carries a 256,000-token context window, which accommodates long documents or extended multi-turn conversations. Maximum output is capped at 32,768 tokens. It does not include a built-in reasoning mode, and structured output support is unconfirmed. At $0.08 per million input tokens and $0.50 per million output tokens, it sits at the lower end of multimodal model pricing, making it worth considering for teams running high-volume image-plus-text workloads on a budget. The tradeoff is transparency: there is no independent benchmark coverage available yet, so performance relative to competing models is unproven. Buyers who need validated accuracy benchmarks before committing should wait or run their own evaluations before deploying Qwen3 VL 8B Instruct in production.

Query via API → View on qwen → Estimate cost

Quality Score

99/100

price + capability + benchmarks

Input Price

$0.08

per 1M tokens

Output Price

$0.50

per 1M tokens

Context Window

256,000

tokens

Model ID: qwen/qwen3-vl-8b-instruct
Vendor: qwen
Tokenizer: Qwen3
Input Modalities: image, text
Output Modalities: text
Max Output: 32,768 tokens
Tool Calling: ✓ supported
Structured Output: ✓ supported
Reasoning Mode: not supported
Vision: ✓ accepts images
Audio: no
Moderated: no

Similar models

qwen

Qwen: Qwen3 VL 8B Instruct

Similar models

Qwen: Qwen3 VL 32B Instruct

Qwen: Qwen3 VL 30B A3B Instruct

Qwen: Qwen3 Next 80B A3B Thinking

Qwen: Qwen3 235B A22B Thinking 2507

Qwen: Qwen3 VL 235B A22B Instruct

Qwen: Qwen Plus 0728 (thinking)