Gemma 3 27B IT
Released March 12, 2025
About
Gemma-3-27B-IT is Google DeepMind’s 27 billion parameter instruction-tuned model, released March 2025 under the commercially permissive Gemma license. It’s the largest model in the Gemma 3 family and introduces three capabilities new to the series: vision understanding via a SigLIP encoder, 128K token context, and expanded multilingual support. The paper (arXiv:2503.19786) describes it as “comparable to Gemini-1.5-Pro across benchmarks” — and the numbers back this up.
Why it matters
The Gemma 2 → Gemma 3 jump is one of the largest generation-over-generation improvements in open models:
- MATH: 55.6% → 89.0% (+33.4 points)
- HiddenMath: 14.8% → 60.3% (+45.5 points)
- MMLU-Pro: 56.9% → 67.5% (+10.6 points)
- GPQA Diamond: 34.3% → 42.4% (+8.1 points)
- LiveCodeBench: 20.4% → 29.7% (+9.3 points)
On MATH, Gemma 3 27B at 89.0% actually surpasses Gemini 1.5 Pro (86.5%). On Chatbot Arena, it reached Elo 1338 at rank #9 — tied with Qwen2.5-Max and o1-preview, ahead of o3-mini-high and DeepSeek-V3. For a 27B open model, this is exceptional placement.
The model also adds native vision: it processes images through a frozen 400M SigLIP encoder with Pan & Scan for flexible resolutions, scoring 64.9% on MMMU — competitive with Gemini 1.5 Pro (65.9%).
Architecture
| Spec | Value |
|---|---|
| Non-embedding Parameters | 25.6B |
| Embedding Parameters | 1.4B |
| Vision Encoder | SigLIP 400M (frozen, shared across 4B/12B/27B) |
| Attention | GQA with QK-norm (replaces Gemma 2’s soft-capping) |
| Normalization | RMSNorm (post-norm + pre-norm) |
| Layer Pattern | 5:1 local-to-global attention ratio |
| Local Attention Span | 1,024 tokens |
| Local RoPE Base | 10K |
| Global RoPE Base | 1M (for 128K context) |
| Context Length | 128K tokens |
| Vocabulary | 262K entries (SentencePiece, same as Gemini 2.0) |
| License | Gemma (commercially permissive) |
The 5:1 local/global attention pattern
The key architectural innovation is interleaving 5 local sliding-window layers for every 1 global attention layer. Local layers attend to only 1,024 tokens; global layers attend to the full 128K context. This dramatically reduces KV-cache memory — only 1/6 of the layers need to store full-context key-value pairs. The result: 128K context without the memory explosion that makes long-context impractical on consumer hardware.
Vision
Images are processed by a frozen 400M SigLIP Vision Transformer at 896×896 resolution, condensed into 256 vectors per image. Pan & Scan (P&S) handles non-square and high-resolution images by segmenting them into non-overlapping crops at inference time — this is an inference-time optimization that can be disabled for speed. The vision encoder adds zero cost during language model training since embeddings are pre-computed.
Training
Pretraining
- Data: 14T tokens (text + images), with expanded multilingual data vs Gemma 2
- Hardware: 6,144 TPUv5p chips
- Method: Knowledge distillation — sampling 256 logits per token weighted by teacher probabilities, student learns via cross-entropy loss
- Decontamination: Evaluation sets removed from pretraining data
- Safety filtering: CBRN content filters, personal information removal
All Gemma 3 models are trained entirely with knowledge distillation from a larger teacher. The tokenizer is shared with Gemini 2.0, providing better balance for non-English languages.
Post-training
The paper describes a novel post-training recipe that “significantly improves math, chat, instruction-following and multilingual abilities”:
- Knowledge distillation from a large instruction-tuned teacher (improved approach from Agarwal et al., 2024)
- Reinforcement learning using improved versions of BOND, WARM, and WARP algorithms
- Reward signals: weight-averaged reward models trained on human feedback, code execution feedback, ground-truth math rewards
- Data filtering: personal information, unsafe outputs, self-identification errors, duplicates removed. Attribution and hedging data included to reduce hallucinations.
Quantization Aware Training
Google provides quantized checkpoints via 5,000 steps of QAT fine-tuning:
| Format | Weights Only | With KV Cache (32K) |
|---|---|---|
| BF16 | 54.0 GB | 72.7 GB |
| Int4 | 14.1 GB | 32.8 GB |
| Int4 (blocks=32) | 15.3 GB | 34.0 GB |
| SFP8 | 27.4 GB | 46.1 GB |
Benchmarks
All results from the paper’s Table 6, zero-shot evaluation.
vs Gemma 2 and Gemini
| Benchmark | Gemma 3 27B | Gemma 2 27B | Gemini 1.5 Flash | Gemini 1.5 Pro |
|---|---|---|---|---|
| MMLU-Pro | 67.5 | 56.9 | 67.3 | 75.8 |
| MATH | 89.0 | 55.6 | 77.9 | 86.5 |
| HiddenMath | 60.3 | 14.8 | 47.2 | 52.0 |
| GPQA Diamond | 42.4 | 34.3 | 51.0 | 59.1 |
| LiveCodeBench | 29.7 | 20.4 | 30.7 | 34.2 |
| Bird-SQL | 54.4 | 46.7 | 45.6 | 54.4 |
| SimpleQA | 10.0 | 9.2 | 8.6 | 24.9 |
| FACTS Grounding | 74.9 | 62.4 | 82.9 | 80.0 |
| Global MMLU-Lite | 75.1 | 68.6 | 73.7 | 80.8 |
| MMMU (vision) | 64.9 | — | 62.3 | 65.9 |
Gemma 3 27B surpasses Gemini 1.5 Pro on MATH (89.0 vs 86.5), HiddenMath (60.3 vs 52.0), and Bird-SQL (54.4 vs 54.4). It matches Gemini 1.5 Flash on MMLU-Pro. The model remains weaker on GPQA Diamond and SimpleQA — knowledge-heavy tasks where the larger Gemini models benefit from more pretraining data.
Chatbot Arena (human evaluation)
| Model | Elo | Rank |
|---|---|---|
| Grok-3-Preview | 1412 | 1 |
| GPT-4.5-Preview | 1411 | 1 |
| DeepSeek-R1 | 1363 | 6 |
| Gemma-3-27B-IT | 1338 | 9 |
| o1-preview | 1335 | 9 |
| o3-mini-high | 1329 | 13 |
| DeepSeek-V3 | 1318 | 14 |
An open 27B model at rank #9, above o3-mini and DeepSeek-V3 in human preference.
Deployment
Hardware requirements
| Format | Size | Minimum Hardware |
|---|---|---|
| BF16 | 54.0 GB | 2× RTX 4090 or A100-80GB |
| Int4 | 14.1 GB | Single RTX 4090 (24 GB) |
| SFP8 | 27.4 GB | A100-40GB or 2× RTX 3090 |
The Int4 quantized version at 14.1 GB is the practical sweet spot — fits on a single consumer GPU with room for KV cache at shorter contexts.
Transformers
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
model = Gemma3ForConditionalGeneration.from_pretrained(
"google/gemma-3-27b-it",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it")
messages = [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image."},
{"type": "image", "url": "https://example.com/image.jpg"},
]},
]
inputs = processor.apply_chat_template(messages, tokenize=True, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=256)
Serving
Compatible with vLLM, SGLang, and Ollama. Available through Google AI Studio, Vertex AI, and OpenRouter.
The Gemma 3 family
| Model | Vision | Total Params | Pretraining Tokens | Context |
|---|---|---|---|---|
| Gemma 3 1B | No | 1.0B | 2T | 32K |
| Gemma 3 4B | Yes (SigLIP) | 4.3B | 4T | 128K |
| Gemma 3 12B | Yes (SigLIP) | 12.2B | 12T | 128K |
| Gemma 3 27B | Yes (SigLIP) | 27.4B | 14T | 128K |
All models share the same SigLIP vision encoder (frozen), the Gemini 2.0 tokenizer, and the 5:1 local/global attention pattern (except 1B which has no vision and 32K context). All trained with knowledge distillation.
Languages
Expanded multilingual support compared to Gemma 2, using a tokenizer that is “more balanced for non-English languages.” Global MMLU-Lite score of 75.1 (up from 68.6 in Gemma 2) confirms improved cross-lingual performance.
References
- HuggingFace huggingface.co/google/gemma-3-27b-it
- Paper arxiv.org/abs/2503.19786