meta-llama

Meta: Llama 3.2 11B Vision Instruct

Meta: Llama 3.2 11B Vision Instruct is a text-and-image model from Meta, accepting both modalities as input with a 131,072-token context window and up to 16,384 tokens of output per completion. It does not support tool use, reasoning modes, or structured output, so workflows requiring function calling or guaranteed schema responses will need a different option. At $0.345 per million tokens for both input and output, the pricing is modest, which makes it worth considering for vision tasks where cost efficiency matters more than top-tier accuracy. That said, its blended benchmark score of 3.8 is drawn from only one independent benchmark, so its general capability profile is largely unproven relative to models with broader coverage. Buyers who need a low-cost multimodal option for lighter image-understanding tasks may find it serviceable, but those prioritizing demonstrated reliability across diverse tasks should treat that thin benchmark coverage as a meaningful caution.

Query via API → View on meta-llama → Estimate cost

Quality Score

81/100

price + capability + benchmarks

Input Price

$0.34

per 1M tokens

Output Price

$0.34

per 1M tokens

Context Window

131,072

tokens

Model ID: meta-llama/llama-3.2-11b-vision-instruct
Vendor: meta-llama
Tokenizer: Llama3
Input Modalities: text, image
Output Modalities: text
Max Output: 16,384 tokens
Tool Calling: not supported
Structured Output: ✓ supported
Reasoning Mode: not supported
Vision: ✓ accepts images
Audio: no
Moderated: no

Similar models

meta-llama

Meta: Llama 3.2 11B Vision Instruct

Similar models

Meta: Llama Guard 4 12B

Meta: Llama 3.1 70B Instruct

Meta: Llama 3.3 70B Instruct

Meta: Llama 3.1 8B Instruct

Meta: Llama 3.3 70B Instruct (free)

Meta: Llama 3.2 1B Instruct