Mistral AI 24B APACHE-2.0 text-generation

Mistral Small 24B Instruct 2501

Released January 30, 2025

Context Window 33K tokens

≈ 25 pages of text

About

Mistral Small 3 is a 24 billion parameter dense model from Mistral AI, released January 2025 under Apache 2.0. It’s designed for one thing above all: latency. Mistral describes it as “on par with Llama 3.3 70B instruct, while being more than 3× faster on the same hardware.” The model uses far fewer layers than competing architectures, achieving over 150 tokens/sec output speed. At 24B parameters, it fits on a single RTX 4090 or a 32 GB RAM MacBook when quantized — positioned as an open-source replacement for GPT-4o-mini.

Why it matters

Mistral Small 3 occupies a specific niche: the fastest competitive model you can run locally. The design philosophy is deliberately different from the reasoning-model trend — this model is not trained with RL or synthetic data. Mistral positions it as “earlier in the model production pipeline” than models like DeepSeek-R1, offering it as a strong base for the community to build reasoning capabilities on top of.

The human evaluation results tell the competitive story. In blind side-by-side tests with 1,000+ coding and generalist prompts evaluated by a third-party vendor:

vs Model	Mistral Preferred	Tie	Other Preferred
Gemma-2-27B	73.2%	5.2%	21.6%
Qwen-2.5-32B	68.0%	6.0%	26.0%
Llama-3.3-70B	35.6%	23.6%	40.8%
GPT-4o-mini	40.4%	16.0%	43.6%

It decisively beats same-size competitors (Gemma-2-27B, Qwen-2.5-32B) and roughly ties with models 3× its size (Llama-3.3-70B) and GPT-4o-mini. At the latency it offers, that’s a strong proposition.

Architecture

Spec	Value
Architecture	Dense decoder-only Transformer
Parameters	24B
Design Emphasis	Minimal layer count for latency
Tokenizer	Tekken (131K vocabulary)
Context Length	32K tokens
Output Speed	150+ tokens/sec
GPU Memory (BF16)	~55 GB
License	Apache 2.0

Mistral has not published a detailed architecture paper for this model. The key design decision: far fewer layers than competing models of similar parameter count, which substantially reduces time per forward pass. The model achieves over 81% MMLU accuracy at 150 tokens/sec — Mistral calls it “the most efficient model of its category.”

Training

Pretraining

No detailed pretraining paper has been published. Known facts:

Base model: Mistral-Small-24B-Base-2501
Knowledge cutoff: October 2023
Not trained with RL — no reinforcement learning from human feedback
Not trained with synthetic data — trained on curated natural data only
Released as both pretrained and instruction-tuned checkpoints

Post-training

Instruction tuning without RL or synthetic data is notable — most competitive models at this level (Llama 3.3, Qwen 2.5, phi-4) all use RLHF or DPO with synthetic data. Mistral is deliberately offering the SFT-only checkpoint as a base for the community.

Benchmarks

All scores from Mistral’s internal evaluation pipeline (same pipeline applied to all models for fair comparison).

Reasoning and knowledge

Benchmark	Mistral Small 3	Gemma-2-27B	Llama-3.3-70B	Qwen2.5-32B	GPT-4o-mini
MMLU-Pro (5-shot CoT)	66.3%	53.6%	66.6%	68.3%	61.7%
GPQA Main (5-shot CoT)	45.3%	34.4%	53.1%	40.4%	37.7%

MMLU-Pro is essentially tied with Llama-3.3-70B (66.3% vs 66.6%) at 3× fewer parameters. GPQA shows a gap — knowledge-heavy science tasks benefit from the larger model’s broader pretraining.

Math and coding

Benchmark	Mistral Small 3	Gemma-2-27B	Llama-3.3-70B	Qwen2.5-32B	GPT-4o-mini
HumanEval (pass@1)	84.8%	73.2%	85.4%	90.9%	89.0%
MATH	70.6%	53.5%	74.3%	81.9%	76.1%

Competitive on HumanEval (within 1% of Llama-3.3-70B). MATH shows a clearer gap against Qwen2.5-32B, which benefits from extensive synthetic math data in its training.

Instruction following and chat

Benchmark	Mistral Small 3	Gemma-2-27B	Llama-3.3-70B	Qwen2.5-32B	GPT-4o-mini
MT-Bench	8.35	7.86	7.96	8.26	8.33
WildBench	52.27	48.21	50.04	52.73	56.13
ArenaHard	87.3%	78.8%	84.0%	86.0%	89.7%
IFEval	82.9%	80.7%	88.4%	84.0%	85.0%

MT-Bench leader across all comparisons at 8.35 — beating both Llama-3.3-70B (7.96) and GPT-4o-mini (8.33). ArenaHard at 87.3% is the strongest in the open-weight category here.

Key features

Native function calling — built-in support for tool use with JSON schema definitions
JSON output mode — structured output for agent workflows
Strong system prompt adherence — reliable persona and instruction following
Multilingual — English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, and more
Recommended temperature: 0.15 for most tasks

Deployment

Hardware

Format	Memory
BF16/FP16	~55 GB (2× RTX 4090 or A100-80GB)
8-bit	~28 GB (RTX 4090 or A100-40GB)
4-bit (GGUF)	~14 GB (single RTX 4090, or MacBook with 32GB RAM)

Ollama

ollama run mistral-small                           # 4-bit default
ollama run mistral-small:24b-instruct-2501-q8_0    # 8-bit
ollama run mistral-small:24b-instruct-2501-fp16    # full precision

vLLM

vllm serve mistralai/Mistral-Small-24B-Instruct-2501 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

API

Available on Mistral’s la Plateforme as mistral-small-latest, plus Together AI, Fireworks AI, IBM WatsonX, Kaggle, and OpenRouter.

Prompt format

Uses the V7-Tekken template:

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]

Use the mistral-common Python library as the canonical reference for tokenization and formatting.

Use cases

Mistral positions the model for four primary scenarios:

Fast-response conversational agents — where latency matters more than maximum capability
Low-latency function calling — rapid tool use in automated/agentic workflows
Domain-specific fine-tuning — base for specialized models in legal, medical, financial, and technical domains
Local inference — for privacy-sensitive data or hobbyist use on consumer hardware

Competitive position

Mistral Small 3 competes in the “efficient mid-range” tier — larger than phi-4 (14B) but smaller than Llama-3.3 (70B). Its distinctive value proposition is latency: 150+ tokens/sec with Llama-3.3-level performance makes it the optimal choice when response speed is the primary constraint. The Apache 2.0 license with no RL in the training pipeline also makes it an attractive fine-tuning base — downstream trainers get a clean SFT checkpoint without inherited RL behaviors.

References

🤗 HuggingFace huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501