meta-llama

Meta: Llama 3.2 11B Vision Instruct

Meta: Llama 3.2 11B Vision Instruct is a text-and-image model from Meta, accepting both modalities as input with a 131,072-token context window and up to 16,384 tokens of output per completion. It does not support tool use, reasoning modes, or structured output, so workflows requiring function calling or guaranteed schema responses will need a different option. At $0.345 per million tokens for both input and output, the pricing is modest, which makes it worth considering for vision tasks where cost efficiency matters more than top-tier accuracy. That said, its blended benchmark score of 3.8 is drawn from only one independent benchmark, so its general capability profile is largely unproven relative to models with broader coverage. Buyers who need a low-cost multimodal option for lighter image-understanding tasks may find it serviceable, but those prioritizing demonstrated reliability across diverse tasks should treat that thin benchmark coverage as a meaningful caution.

Quality Score
81/100
price + capability + benchmarks
Input Price
$0.34
per 1M tokens
Output Price
$0.34
per 1M tokens
Context Window
131,072
tokens
Model ID
meta-llama/llama-3.2-11b-vision-instruct
Vendor
meta-llama
Tokenizer
Llama3
Input Modalities
text, image
Output Modalities
text
Max Output
16,384 tokens
Tool Calling
not supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no

Similar models