Google: Gemma 3 4B
Gemma 3 4B is a text-and-image input model from Google with a 131,072-token context window and a 16,384-token output ceiling. It does not support tool use, reasoning modes, or structured output, so workflows that depend on function calling or guaranteed response schemas will need a different option. At $0.05 per million input tokens and $0.10 per million output tokens, it sits at the budget end of the market. Its blended benchmark score of 20.3 comes from only two benchmarks, which is thin coverage, so treating that figure as a reliable signal of general capability would be premature. Buyers who need a low-cost multimodal model for straightforward text and image tasks, and who can tolerate limited third-party validation, may find it worth testing. Teams with stricter performance requirements or tool-dependent pipelines should compare it carefully against better-documented alternatives before committing.
- Model ID
- google/gemma-3-4b-it
- Vendor
- Tokenizer
- Gemini
- Input Modalities
- text, image
- Output Modalities
- text
- Max Output
- 16,384 tokens
- Tool Calling
- not supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no