Mistral Small 24B Instruct 2501
Released January 30, 2025
About
Mistral Small 3 is a 24 billion parameter dense model from Mistral AI, released January 2025 under Apache 2.0. It’s designed for one thing above all: latency. Mistral describes it as “on par with Llama 3.3 70B instruct, while being more than 3× faster on the same hardware.” The model uses far fewer layers than competing architectures, achieving over 150 tokens/sec output speed. At 24B parameters, it fits on a single RTX 4090 or a 32 GB RAM MacBook when quantized — positioned as an open-source replacement for GPT-4o-mini.
Why it matters
Mistral Small 3 occupies a specific niche: the fastest competitive model you can run locally. The design philosophy is deliberately different from the reasoning-model trend — this model is not trained with RL or synthetic data. Mistral positions it as “earlier in the model production pipeline” than models like DeepSeek-R1, offering it as a strong base for the community to build reasoning capabilities on top of.
The human evaluation results tell the competitive story. In blind side-by-side tests with 1,000+ coding and generalist prompts evaluated by a third-party vendor:
| vs Model | Mistral Preferred | Tie | Other Preferred |
|---|---|---|---|
| Gemma-2-27B | 73.2% | 5.2% | 21.6% |
| Qwen-2.5-32B | 68.0% | 6.0% | 26.0% |
| Llama-3.3-70B | 35.6% | 23.6% | 40.8% |
| GPT-4o-mini | 40.4% | 16.0% | 43.6% |
It decisively beats same-size competitors (Gemma-2-27B, Qwen-2.5-32B) and roughly ties with models 3× its size (Llama-3.3-70B) and GPT-4o-mini. At the latency it offers, that’s a strong proposition.
Architecture
| Spec | Value |
|---|---|
| Architecture | Dense decoder-only Transformer |
| Parameters | 24B |
| Design Emphasis | Minimal layer count for latency |
| Tokenizer | Tekken (131K vocabulary) |
| Context Length | 32K tokens |
| Output Speed | 150+ tokens/sec |
| GPU Memory (BF16) | ~55 GB |
| License | Apache 2.0 |
Mistral has not published a detailed architecture paper for this model. The key design decision: far fewer layers than competing models of similar parameter count, which substantially reduces time per forward pass. The model achieves over 81% MMLU accuracy at 150 tokens/sec — Mistral calls it “the most efficient model of its category.”
Training
Pretraining
No detailed pretraining paper has been published. Known facts:
- Base model: Mistral-Small-24B-Base-2501
- Knowledge cutoff: October 2023
- Not trained with RL — no reinforcement learning from human feedback
- Not trained with synthetic data — trained on curated natural data only
- Released as both pretrained and instruction-tuned checkpoints
Post-training
Instruction tuning without RL or synthetic data is notable — most competitive models at this level (Llama 3.3, Qwen 2.5, phi-4) all use RLHF or DPO with synthetic data. Mistral is deliberately offering the SFT-only checkpoint as a base for the community.
Benchmarks
All scores from Mistral’s internal evaluation pipeline (same pipeline applied to all models for fair comparison).
Reasoning and knowledge
| Benchmark | Mistral Small 3 | Gemma-2-27B | Llama-3.3-70B | Qwen2.5-32B | GPT-4o-mini |
|---|---|---|---|---|---|
| MMLU-Pro (5-shot CoT) | 66.3% | 53.6% | 66.6% | 68.3% | 61.7% |
| GPQA Main (5-shot CoT) | 45.3% | 34.4% | 53.1% | 40.4% | 37.7% |
MMLU-Pro is essentially tied with Llama-3.3-70B (66.3% vs 66.6%) at 3× fewer parameters. GPQA shows a gap — knowledge-heavy science tasks benefit from the larger model’s broader pretraining.
Math and coding
| Benchmark | Mistral Small 3 | Gemma-2-27B | Llama-3.3-70B | Qwen2.5-32B | GPT-4o-mini |
|---|---|---|---|---|---|
| HumanEval (pass@1) | 84.8% | 73.2% | 85.4% | 90.9% | 89.0% |
| MATH | 70.6% | 53.5% | 74.3% | 81.9% | 76.1% |
Competitive on HumanEval (within 1% of Llama-3.3-70B). MATH shows a clearer gap against Qwen2.5-32B, which benefits from extensive synthetic math data in its training.
Instruction following and chat
| Benchmark | Mistral Small 3 | Gemma-2-27B | Llama-3.3-70B | Qwen2.5-32B | GPT-4o-mini |
|---|---|---|---|---|---|
| MT-Bench | 8.35 | 7.86 | 7.96 | 8.26 | 8.33 |
| WildBench | 52.27 | 48.21 | 50.04 | 52.73 | 56.13 |
| ArenaHard | 87.3% | 78.8% | 84.0% | 86.0% | 89.7% |
| IFEval | 82.9% | 80.7% | 88.4% | 84.0% | 85.0% |
MT-Bench leader across all comparisons at 8.35 — beating both Llama-3.3-70B (7.96) and GPT-4o-mini (8.33). ArenaHard at 87.3% is the strongest in the open-weight category here.
Key features
- Native function calling — built-in support for tool use with JSON schema definitions
- JSON output mode — structured output for agent workflows
- Strong system prompt adherence — reliable persona and instruction following
- Multilingual — English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, and more
- Recommended temperature: 0.15 for most tasks
Deployment
Hardware
| Format | Memory |
|---|---|
| BF16/FP16 | ~55 GB (2× RTX 4090 or A100-80GB) |
| 8-bit | ~28 GB (RTX 4090 or A100-40GB) |
| 4-bit (GGUF) | ~14 GB (single RTX 4090, or MacBook with 32GB RAM) |
Ollama
ollama run mistral-small # 4-bit default
ollama run mistral-small:24b-instruct-2501-q8_0 # 8-bit
ollama run mistral-small:24b-instruct-2501-fp16 # full precision
vLLM
vllm serve mistralai/Mistral-Small-24B-Instruct-2501 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tool-call-parser mistral \
--enable-auto-tool-choice
API
Available on Mistral’s la Plateforme as mistral-small-latest, plus Together AI, Fireworks AI, IBM WatsonX, Kaggle, and OpenRouter.
Prompt format
Uses the V7-Tekken template:
<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]
Use the mistral-common Python library as the canonical reference for tokenization and formatting.
Use cases
Mistral positions the model for four primary scenarios:
- Fast-response conversational agents — where latency matters more than maximum capability
- Low-latency function calling — rapid tool use in automated/agentic workflows
- Domain-specific fine-tuning — base for specialized models in legal, medical, financial, and technical domains
- Local inference — for privacy-sensitive data or hobbyist use on consumer hardware
Competitive position
Mistral Small 3 competes in the “efficient mid-range” tier — larger than phi-4 (14B) but smaller than Llama-3.3 (70B). Its distinctive value proposition is latency: 150+ tokens/sec with Llama-3.3-level performance makes it the optimal choice when response speed is the primary constraint. The Apache 2.0 license with no RL in the training pipeline also makes it an attractive fine-tuning base — downstream trainers get a clean SFT checkpoint without inherited RL behaviors.
References
- HuggingFace huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501