Mistral AI 24B APACHE-2.0 text-generation

Mistral Small 24B Instruct 2501

Released January 30, 2025

Context Window 33K tokens
≈ 25 pages of text

About

Mistral Small 3 is a 24 billion parameter dense model from Mistral AI, released January 2025 under Apache 2.0. It’s designed for one thing above all: latency. Mistral describes it as “on par with Llama 3.3 70B instruct, while being more than 3× faster on the same hardware.” The model uses far fewer layers than competing architectures, achieving over 150 tokens/sec output speed. At 24B parameters, it fits on a single RTX 4090 or a 32 GB RAM MacBook when quantized — positioned as an open-source replacement for GPT-4o-mini.

Why it matters

Mistral Small 3 occupies a specific niche: the fastest competitive model you can run locally. The design philosophy is deliberately different from the reasoning-model trend — this model is not trained with RL or synthetic data. Mistral positions it as “earlier in the model production pipeline” than models like DeepSeek-R1, offering it as a strong base for the community to build reasoning capabilities on top of.

The human evaluation results tell the competitive story. In blind side-by-side tests with 1,000+ coding and generalist prompts evaluated by a third-party vendor:

vs ModelMistral PreferredTieOther Preferred
Gemma-2-27B73.2%5.2%21.6%
Qwen-2.5-32B68.0%6.0%26.0%
Llama-3.3-70B35.6%23.6%40.8%
GPT-4o-mini40.4%16.0%43.6%

It decisively beats same-size competitors (Gemma-2-27B, Qwen-2.5-32B) and roughly ties with models 3× its size (Llama-3.3-70B) and GPT-4o-mini. At the latency it offers, that’s a strong proposition.

Architecture

SpecValue
ArchitectureDense decoder-only Transformer
Parameters24B
Design EmphasisMinimal layer count for latency
TokenizerTekken (131K vocabulary)
Context Length32K tokens
Output Speed150+ tokens/sec
GPU Memory (BF16)~55 GB
LicenseApache 2.0

Mistral has not published a detailed architecture paper for this model. The key design decision: far fewer layers than competing models of similar parameter count, which substantially reduces time per forward pass. The model achieves over 81% MMLU accuracy at 150 tokens/sec — Mistral calls it “the most efficient model of its category.”

Training

Pretraining

No detailed pretraining paper has been published. Known facts:

  • Base model: Mistral-Small-24B-Base-2501
  • Knowledge cutoff: October 2023
  • Not trained with RL — no reinforcement learning from human feedback
  • Not trained with synthetic data — trained on curated natural data only
  • Released as both pretrained and instruction-tuned checkpoints

Post-training

Instruction tuning without RL or synthetic data is notable — most competitive models at this level (Llama 3.3, Qwen 2.5, phi-4) all use RLHF or DPO with synthetic data. Mistral is deliberately offering the SFT-only checkpoint as a base for the community.

Benchmarks

All scores from Mistral’s internal evaluation pipeline (same pipeline applied to all models for fair comparison).

Reasoning and knowledge

BenchmarkMistral Small 3Gemma-2-27BLlama-3.3-70BQwen2.5-32BGPT-4o-mini
MMLU-Pro (5-shot CoT)66.3%53.6%66.6%68.3%61.7%
GPQA Main (5-shot CoT)45.3%34.4%53.1%40.4%37.7%

MMLU-Pro is essentially tied with Llama-3.3-70B (66.3% vs 66.6%) at 3× fewer parameters. GPQA shows a gap — knowledge-heavy science tasks benefit from the larger model’s broader pretraining.

Math and coding

BenchmarkMistral Small 3Gemma-2-27BLlama-3.3-70BQwen2.5-32BGPT-4o-mini
HumanEval (pass@1)84.8%73.2%85.4%90.9%89.0%
MATH70.6%53.5%74.3%81.9%76.1%

Competitive on HumanEval (within 1% of Llama-3.3-70B). MATH shows a clearer gap against Qwen2.5-32B, which benefits from extensive synthetic math data in its training.

Instruction following and chat

BenchmarkMistral Small 3Gemma-2-27BLlama-3.3-70BQwen2.5-32BGPT-4o-mini
MT-Bench8.357.867.968.268.33
WildBench52.2748.2150.0452.7356.13
ArenaHard87.3%78.8%84.0%86.0%89.7%
IFEval82.9%80.7%88.4%84.0%85.0%

MT-Bench leader across all comparisons at 8.35 — beating both Llama-3.3-70B (7.96) and GPT-4o-mini (8.33). ArenaHard at 87.3% is the strongest in the open-weight category here.

Key features

  • Native function calling — built-in support for tool use with JSON schema definitions
  • JSON output mode — structured output for agent workflows
  • Strong system prompt adherence — reliable persona and instruction following
  • Multilingual — English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, and more
  • Recommended temperature: 0.15 for most tasks

Deployment

Hardware

FormatMemory
BF16/FP16~55 GB (2× RTX 4090 or A100-80GB)
8-bit~28 GB (RTX 4090 or A100-40GB)
4-bit (GGUF)~14 GB (single RTX 4090, or MacBook with 32GB RAM)

Ollama

ollama run mistral-small                           # 4-bit default
ollama run mistral-small:24b-instruct-2501-q8_0    # 8-bit
ollama run mistral-small:24b-instruct-2501-fp16    # full precision

vLLM

vllm serve mistralai/Mistral-Small-24B-Instruct-2501 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

API

Available on Mistral’s la Plateforme as mistral-small-latest, plus Together AI, Fireworks AI, IBM WatsonX, Kaggle, and OpenRouter.

Prompt format

Uses the V7-Tekken template:

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]

Use the mistral-common Python library as the canonical reference for tokenization and formatting.

Use cases

Mistral positions the model for four primary scenarios:

  1. Fast-response conversational agents — where latency matters more than maximum capability
  2. Low-latency function calling — rapid tool use in automated/agentic workflows
  3. Domain-specific fine-tuning — base for specialized models in legal, medical, financial, and technical domains
  4. Local inference — for privacy-sensitive data or hobbyist use on consumer hardware

Competitive position

Mistral Small 3 competes in the “efficient mid-range” tier — larger than phi-4 (14B) but smaller than Llama-3.3 (70B). Its distinctive value proposition is latency: 150+ tokens/sec with Llama-3.3-level performance makes it the optimal choice when response speed is the primary constraint. The Apache 2.0 license with no RL in the training pipeline also makes it an attractive fine-tuning base — downstream trainers get a clean SFT checkpoint without inherited RL behaviors.

References

  • 🤗 HuggingFace huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501