Google DeepMind 27B other text-generation

Gemma 3 27B IT

Released March 12, 2025

Context Window 131K tokens

≈ 98 pages of text

54.0 GB Disk Size

About

Gemma-3-27B-IT is Google DeepMind’s 27 billion parameter instruction-tuned model, released March 2025 under the commercially permissive Gemma license. It’s the largest model in the Gemma 3 family and introduces three capabilities new to the series: vision understanding via a SigLIP encoder, 128K token context, and expanded multilingual support. The paper (arXiv:2503.19786) describes it as “comparable to Gemini-1.5-Pro across benchmarks” — and the numbers back this up.

Why it matters

The Gemma 2 → Gemma 3 jump is one of the largest generation-over-generation improvements in open models:

MATH: 55.6% → 89.0% (+33.4 points)
HiddenMath: 14.8% → 60.3% (+45.5 points)
MMLU-Pro: 56.9% → 67.5% (+10.6 points)
GPQA Diamond: 34.3% → 42.4% (+8.1 points)
LiveCodeBench: 20.4% → 29.7% (+9.3 points)

On MATH, Gemma 3 27B at 89.0% actually surpasses Gemini 1.5 Pro (86.5%). On Chatbot Arena, it reached Elo 1338 at rank #9 — tied with Qwen2.5-Max and o1-preview, ahead of o3-mini-high and DeepSeek-V3. For a 27B open model, this is exceptional placement.

The model also adds native vision: it processes images through a frozen 400M SigLIP encoder with Pan & Scan for flexible resolutions, scoring 64.9% on MMMU — competitive with Gemini 1.5 Pro (65.9%).

Architecture

Spec	Value
Non-embedding Parameters	25.6B
Embedding Parameters	1.4B
Vision Encoder	SigLIP 400M (frozen, shared across 4B/12B/27B)
Attention	GQA with QK-norm (replaces Gemma 2’s soft-capping)
Normalization	RMSNorm (post-norm + pre-norm)
Layer Pattern	5:1 local-to-global attention ratio
Local Attention Span	1,024 tokens
Local RoPE Base	10K
Global RoPE Base	1M (for 128K context)
Context Length	128K tokens
Vocabulary	262K entries (SentencePiece, same as Gemini 2.0)
License	Gemma (commercially permissive)

The 5:1 local/global attention pattern

The key architectural innovation is interleaving 5 local sliding-window layers for every 1 global attention layer. Local layers attend to only 1,024 tokens; global layers attend to the full 128K context. This dramatically reduces KV-cache memory — only 1/6 of the layers need to store full-context key-value pairs. The result: 128K context without the memory explosion that makes long-context impractical on consumer hardware.

Vision

Images are processed by a frozen 400M SigLIP Vision Transformer at 896×896 resolution, condensed into 256 vectors per image. Pan & Scan (P&S) handles non-square and high-resolution images by segmenting them into non-overlapping crops at inference time — this is an inference-time optimization that can be disabled for speed. The vision encoder adds zero cost during language model training since embeddings are pre-computed.

Training

Pretraining

Data: 14T tokens (text + images), with expanded multilingual data vs Gemma 2
Hardware: 6,144 TPUv5p chips
Method: Knowledge distillation — sampling 256 logits per token weighted by teacher probabilities, student learns via cross-entropy loss
Decontamination: Evaluation sets removed from pretraining data
Safety filtering: CBRN content filters, personal information removal

All Gemma 3 models are trained entirely with knowledge distillation from a larger teacher. The tokenizer is shared with Gemini 2.0, providing better balance for non-English languages.

Post-training

The paper describes a novel post-training recipe that “significantly improves math, chat, instruction-following and multilingual abilities”:

Knowledge distillation from a large instruction-tuned teacher (improved approach from Agarwal et al., 2024)
Reinforcement learning using improved versions of BOND, WARM, and WARP algorithms
Reward signals: weight-averaged reward models trained on human feedback, code execution feedback, ground-truth math rewards
Data filtering: personal information, unsafe outputs, self-identification errors, duplicates removed. Attribution and hedging data included to reduce hallucinations.

Quantization Aware Training

Google provides quantized checkpoints via 5,000 steps of QAT fine-tuning:

Format	Weights Only	With KV Cache (32K)
BF16	54.0 GB	72.7 GB
Int4	14.1 GB	32.8 GB
Int4 (blocks=32)	15.3 GB	34.0 GB
SFP8	27.4 GB	46.1 GB

Benchmarks

All results from the paper’s Table 6, zero-shot evaluation.

vs Gemma 2 and Gemini

Benchmark	Gemma 3 27B	Gemma 2 27B	Gemini 1.5 Flash	Gemini 1.5 Pro
MMLU-Pro	67.5	56.9	67.3	75.8
MATH	89.0	55.6	77.9	86.5
HiddenMath	60.3	14.8	47.2	52.0
GPQA Diamond	42.4	34.3	51.0	59.1
LiveCodeBench	29.7	20.4	30.7	34.2
Bird-SQL	54.4	46.7	45.6	54.4
SimpleQA	10.0	9.2	8.6	24.9
FACTS Grounding	74.9	62.4	82.9	80.0
Global MMLU-Lite	75.1	68.6	73.7	80.8
MMMU (vision)	64.9	—	62.3	65.9

Gemma 3 27B surpasses Gemini 1.5 Pro on MATH (89.0 vs 86.5), HiddenMath (60.3 vs 52.0), and Bird-SQL (54.4 vs 54.4). It matches Gemini 1.5 Flash on MMLU-Pro. The model remains weaker on GPQA Diamond and SimpleQA — knowledge-heavy tasks where the larger Gemini models benefit from more pretraining data.

Chatbot Arena (human evaluation)

Model	Elo	Rank
Grok-3-Preview	1412	1
GPT-4.5-Preview	1411	1
DeepSeek-R1	1363	6
Gemma-3-27B-IT	1338	9
o1-preview	1335	9
o3-mini-high	1329	13
DeepSeek-V3	1318	14

An open 27B model at rank #9, above o3-mini and DeepSeek-V3 in human preference.

Deployment

Hardware requirements

Format	Size	Minimum Hardware
BF16	54.0 GB	2× RTX 4090 or A100-80GB
Int4	14.1 GB	Single RTX 4090 (24 GB)
SFP8	27.4 GB	A100-40GB or 2× RTX 3090

The Int4 quantized version at 14.1 GB is the practical sweet spot — fits on a single consumer GPU with room for KV cache at shorter contexts.

Transformers

from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model = Gemma3ForConditionalGeneration.from_pretrained(
    "google/gemma-3-27b-it",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it")

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Describe this image."},
        {"type": "image", "url": "https://example.com/image.jpg"},
    ]},
]
inputs = processor.apply_chat_template(messages, tokenize=True, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=256)

Serving

Compatible with vLLM, SGLang, and Ollama. Available through Google AI Studio, Vertex AI, and OpenRouter.

The Gemma 3 family

Model	Vision	Total Params	Pretraining Tokens	Context
Gemma 3 1B	No	1.0B	2T	32K
Gemma 3 4B	Yes (SigLIP)	4.3B	4T	128K
Gemma 3 12B	Yes (SigLIP)	12.2B	12T	128K
Gemma 3 27B	Yes (SigLIP)	27.4B	14T	128K

All models share the same SigLIP vision encoder (frozen), the Gemini 2.0 tokenizer, and the 5:1 local/global attention pattern (except 1B which has no vision and 32K context). All trained with knowledge distillation.

Languages

Expanded multilingual support compared to Gemma 2, using a tokenizer that is “more balanced for non-English languages.” Global MMLU-Lite score of 75.1 (up from 68.6 in Gemma 2) confirms improved cross-lingual performance.

References

🤗 HuggingFace huggingface.co/google/gemma-3-27b-it
📄 Paper arxiv.org/abs/2503.19786