google

Google: Gemma 3 12B

Gemma 3 12B is a Google model that accepts both text and image inputs, making it usable for multimodal tasks without requiring a separate vision model. It supports a 131K-token context window, which is sufficient for long documents or extended conversations, and it supports tool use. It does not offer native reasoning mode, and structured output support is unconfirmed based on available data. At $0.05 per million input tokens and $0.15 per million output tokens, Gemma 3 12B sits at the budget end of the pricing spectrum. Its blended benchmark score of 3.9 comes from a single benchmark, so performance claims should be treated as preliminary rather than well-established. Developers running high-volume, cost-sensitive workloads who also need image understanding may find it worth testing, but buyers who require strong benchmark validation before committing should wait for broader coverage.

Quality Score
91/100
price + capability + benchmarks
Input Price
$0.05
per 1M tokens
Output Price
$0.15
per 1M tokens
Context Window
131,072
tokens
Model ID
google/gemma-3-12b-it
Vendor
google
Tokenizer
Gemini
Input Modalities
text, image
Output Modalities
text
Max Output
16,384 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no

Similar models