Google DeepMind 27B other text-generation

Gemma 3 27B IT

Released March 12, 2025

Context Window 131K tokens
≈ 98 pages of text
54.0 GB Disk Size

About

Gemma-3-27B-IT is Google DeepMind’s 27 billion parameter instruction-tuned model, released March 2025 under the commercially permissive Gemma license. It’s the largest model in the Gemma 3 family and introduces three capabilities new to the series: vision understanding via a SigLIP encoder, 128K token context, and expanded multilingual support. The paper (arXiv:2503.19786) describes it as “comparable to Gemini-1.5-Pro across benchmarks” — and the numbers back this up.

Why it matters

The Gemma 2 → Gemma 3 jump is one of the largest generation-over-generation improvements in open models:

  • MATH: 55.6% → 89.0% (+33.4 points)
  • HiddenMath: 14.8% → 60.3% (+45.5 points)
  • MMLU-Pro: 56.9% → 67.5% (+10.6 points)
  • GPQA Diamond: 34.3% → 42.4% (+8.1 points)
  • LiveCodeBench: 20.4% → 29.7% (+9.3 points)

On MATH, Gemma 3 27B at 89.0% actually surpasses Gemini 1.5 Pro (86.5%). On Chatbot Arena, it reached Elo 1338 at rank #9 — tied with Qwen2.5-Max and o1-preview, ahead of o3-mini-high and DeepSeek-V3. For a 27B open model, this is exceptional placement.

The model also adds native vision: it processes images through a frozen 400M SigLIP encoder with Pan & Scan for flexible resolutions, scoring 64.9% on MMMU — competitive with Gemini 1.5 Pro (65.9%).

Architecture

SpecValue
Non-embedding Parameters25.6B
Embedding Parameters1.4B
Vision EncoderSigLIP 400M (frozen, shared across 4B/12B/27B)
AttentionGQA with QK-norm (replaces Gemma 2’s soft-capping)
NormalizationRMSNorm (post-norm + pre-norm)
Layer Pattern5:1 local-to-global attention ratio
Local Attention Span1,024 tokens
Local RoPE Base10K
Global RoPE Base1M (for 128K context)
Context Length128K tokens
Vocabulary262K entries (SentencePiece, same as Gemini 2.0)
LicenseGemma (commercially permissive)

The 5:1 local/global attention pattern

The key architectural innovation is interleaving 5 local sliding-window layers for every 1 global attention layer. Local layers attend to only 1,024 tokens; global layers attend to the full 128K context. This dramatically reduces KV-cache memory — only 1/6 of the layers need to store full-context key-value pairs. The result: 128K context without the memory explosion that makes long-context impractical on consumer hardware.

Vision

Images are processed by a frozen 400M SigLIP Vision Transformer at 896×896 resolution, condensed into 256 vectors per image. Pan & Scan (P&S) handles non-square and high-resolution images by segmenting them into non-overlapping crops at inference time — this is an inference-time optimization that can be disabled for speed. The vision encoder adds zero cost during language model training since embeddings are pre-computed.

Training

Pretraining

  • Data: 14T tokens (text + images), with expanded multilingual data vs Gemma 2
  • Hardware: 6,144 TPUv5p chips
  • Method: Knowledge distillation — sampling 256 logits per token weighted by teacher probabilities, student learns via cross-entropy loss
  • Decontamination: Evaluation sets removed from pretraining data
  • Safety filtering: CBRN content filters, personal information removal

All Gemma 3 models are trained entirely with knowledge distillation from a larger teacher. The tokenizer is shared with Gemini 2.0, providing better balance for non-English languages.

Post-training

The paper describes a novel post-training recipe that “significantly improves math, chat, instruction-following and multilingual abilities”:

  1. Knowledge distillation from a large instruction-tuned teacher (improved approach from Agarwal et al., 2024)
  2. Reinforcement learning using improved versions of BOND, WARM, and WARP algorithms
  3. Reward signals: weight-averaged reward models trained on human feedback, code execution feedback, ground-truth math rewards
  4. Data filtering: personal information, unsafe outputs, self-identification errors, duplicates removed. Attribution and hedging data included to reduce hallucinations.

Quantization Aware Training

Google provides quantized checkpoints via 5,000 steps of QAT fine-tuning:

FormatWeights OnlyWith KV Cache (32K)
BF1654.0 GB72.7 GB
Int414.1 GB32.8 GB
Int4 (blocks=32)15.3 GB34.0 GB
SFP827.4 GB46.1 GB

Benchmarks

All results from the paper’s Table 6, zero-shot evaluation.

vs Gemma 2 and Gemini

BenchmarkGemma 3 27BGemma 2 27BGemini 1.5 FlashGemini 1.5 Pro
MMLU-Pro67.556.967.375.8
MATH89.055.677.986.5
HiddenMath60.314.847.252.0
GPQA Diamond42.434.351.059.1
LiveCodeBench29.720.430.734.2
Bird-SQL54.446.745.654.4
SimpleQA10.09.28.624.9
FACTS Grounding74.962.482.980.0
Global MMLU-Lite75.168.673.780.8
MMMU (vision)64.962.365.9

Gemma 3 27B surpasses Gemini 1.5 Pro on MATH (89.0 vs 86.5), HiddenMath (60.3 vs 52.0), and Bird-SQL (54.4 vs 54.4). It matches Gemini 1.5 Flash on MMLU-Pro. The model remains weaker on GPQA Diamond and SimpleQA — knowledge-heavy tasks where the larger Gemini models benefit from more pretraining data.

Chatbot Arena (human evaluation)

ModelEloRank
Grok-3-Preview14121
GPT-4.5-Preview14111
DeepSeek-R113636
Gemma-3-27B-IT13389
o1-preview13359
o3-mini-high132913
DeepSeek-V3131814

An open 27B model at rank #9, above o3-mini and DeepSeek-V3 in human preference.

Deployment

Hardware requirements

FormatSizeMinimum Hardware
BF1654.0 GB2× RTX 4090 or A100-80GB
Int414.1 GBSingle RTX 4090 (24 GB)
SFP827.4 GBA100-40GB or 2× RTX 3090

The Int4 quantized version at 14.1 GB is the practical sweet spot — fits on a single consumer GPU with room for KV cache at shorter contexts.

Transformers

from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model = Gemma3ForConditionalGeneration.from_pretrained(
    "google/gemma-3-27b-it",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it")

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Describe this image."},
        {"type": "image", "url": "https://example.com/image.jpg"},
    ]},
]
inputs = processor.apply_chat_template(messages, tokenize=True, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=256)

Serving

Compatible with vLLM, SGLang, and Ollama. Available through Google AI Studio, Vertex AI, and OpenRouter.

The Gemma 3 family

ModelVisionTotal ParamsPretraining TokensContext
Gemma 3 1BNo1.0B2T32K
Gemma 3 4BYes (SigLIP)4.3B4T128K
Gemma 3 12BYes (SigLIP)12.2B12T128K
Gemma 3 27BYes (SigLIP)27.4B14T128K

All models share the same SigLIP vision encoder (frozen), the Gemini 2.0 tokenizer, and the 5:1 local/global attention pattern (except 1B which has no vision and 32K context). All trained with knowledge distillation.

Languages

Expanded multilingual support compared to Gemma 2, using a tokenizer that is “more balanced for non-English languages.” Global MMLU-Lite score of 75.1 (up from 68.6 in Gemma 2) confirms improved cross-lingual performance.

References

  • 🤗 HuggingFace huggingface.co/google/gemma-3-27b-it
  • 📄 Paper arxiv.org/abs/2503.19786