Hermes 3 Llama 3.1 70B
Released July 29, 2024
Pricing
Benchmarks
Open LLM Leaderboard v2
Average: 38.50
Lineage
About
Hermes-3-Llama-3.1-70B is a 70 billion parameter instruct model from Nous Research, fine-tuned on Meta’s Llama-3.1-70B. Released August 2024 under the Llama 3.1 Community License, it’s the mid-range model in the Hermes 3 family (8B, 70B, 405B). Hermes 3 is a neutrally-aligned generalist — the paper’s stated philosophy is “for Hermes, there is no such thing as latent thoughtcrime.” The model is designed to faithfully follow system prompts and user instructions without refusing on moral grounds, placing responsibility for guardrails at the system level rather than the model level.
Why it matters
Hermes 3 represents a deliberate alternative to the safety-first alignment approach used by Meta’s official instruct tunes. Where Llama-3.1-70B-Instruct is tuned to refuse certain categories of requests, Hermes 3 is trained to be steerable — it adopts whatever persona its system prompt defines and responds accordingly. The paper notes that the 405B version is so sensitive to system prompts that an empty system prompt doesn’t default to “helpful assistant” behavior.
The practical result is a model optimized for:
- Agentic workflows — built-in support for structured reasoning via special tokens (
<SCRATCHPAD>,<REASONING>,<PLAN>,<EXECUTION>,<REFLECTION>,<THINKING>,<SOLUTION>, etc.) - Tool use — the Hermes Function Calling standard uses JSON schemas in
<tools>tags with invocations in<tool_call>and responses in<tool_response> - RAG — trained to cite retrieval sources using
<co>tags - Roleplaying — consistent persona maintenance across long multi-turn conversations using the full 128K context window
- Creative writing — reduced refusal behavior means the model engages with creative scenarios that over-aligned models decline
The benchmark tradeoff is real: Hermes 3 70B beats Llama-3.1-Instruct on AGIEval, HellaSwag, MT-Bench, MuSR, and TruthfulQA, but loses on IFEval, MATH, MMLU, and GPQA. It optimizes for steerability and instruction faithfulness over raw benchmark performance.
Architecture
| Spec | Value |
|---|---|
| Architecture | Dense decoder-only Transformer (LlamaForCausalLM) |
| Parameters | 70B |
| Base Model | Meta Llama-3.1-70B |
| Context Length | 128K tokens |
| Prompt Format | ChatML (<|im_start|> / <|im_end|>) |
| Training Sequence Length | 8,192 tokens |
| Disk Size | 141.1 GB |
| License | Llama 3.1 Community License |
Standard Llama 3.1 architecture with no structural modifications. All improvements from post-training.
Training
Data (from paper Table 1)
The Hermes 3 dataset totals approximately 390 million tokens (270M output tokens contributing to loss, 120M input tokens). Data was curated between March and August 2024.
| Category | Proportion | Tokens (M) |
|---|---|---|
| General Instructions | 60.6% | 236 |
| Domain Expert | 12.8% | 50 |
| Math | 6.7% | 26 |
| Roleplaying | 6.1% | 24 |
| Coding | 4.5% | 18 |
| Tool Use, Agentic, and RAG | 4.3% | 17 |
| Content Generation | 3.0% | 12 |
| Steering and Alignment | 2.5% | 10 |
Data sources include existing curated datasets and domain-specific synthetic data generated using Evol-Instruct-inspired schemes. Filtering removed refusals, improperly formatted responses, empty turns, and prioritized outputs from the strongest generator models.
SFT phase
| Detail | Value |
|---|---|
| Optimizer | AdamW, weight decay 0.01 |
| Learning Rate | 7e-6 (cosine decay after 300-step warmup) |
| Epochs | 4 (selected epoch 3 for 70B based on benchmark scores) |
| Batch Size | 48 |
| GPUs | 48 (6 HGX nodes, 8×H100 SXM5 each) |
| Interconnect | Quantum-2 InfiniBand |
| Training Time | 648 GPU-hours |
| Packing | Flash Attention 2 sample packing at 96% efficiency |
| Sequence Length | 8,192 tokens |
| Framework | Modified Axolotl |
| Loss | Cross-entropy on response/tool-use tokens only (instruction tokens masked) |
Learning rate was selected via hyperparameter sweep on 8B models. The 70B model was distributed across nodes using PyTorch FSDP.
DPO phase
For the 70B model, DPO provided only negligible performance improvements. The final released model is the SFT-only checkpoint — no DPO applied to the 70B.
DPO was applied to the 8B model (LoRA, r=32, α=16, dropout 0.05, RMSProp optimizer, NEFTune α=5) with moderate positive impact, but wasn’t worth the added complexity at 70B scale.
Benchmarks
From the paper’s Table 5. All evaluations performed by the authors.
| Benchmark | Hermes 3 70B | Llama 3.1 Instruct 70B |
|---|---|---|
| AGIEval (0-shot) | 56.18 | 48.26 |
| ARC-C (0-shot) | 65.53 | 63.40 |
| BoolQ (0-shot) | 88.04 | 87.76 |
| BBH (3-shot) | 67.82 | 69.24 |
| GPQA (0-shot) | 37.67 | 40.09 |
| HellaSwag (10-shot) | 88.19 | 86.42 |
| IFEval (Strict) | 81.21 | 87.25 |
| MATH Lvl 5 (4-shot) | 20.80 | 29.24 |
| MMLU (5-shot) | 79.09 | 82.27 |
| MMLU-PRO (5-shot) | 47.24 | 52.94 |
| MT-Bench (Avg) | 8.99 | 8.93 |
| MuSR (0-shot) | 50.67 | 47.08 |
| TruthfulQA (MC2) | 63.29 | 59.91 |
| WinoGrande (5-shot) | 83.19 | 85.00 |
The pattern is consistent with Hermes 3’s design priorities. It wins on reasoning-adjacent benchmarks (AGIEval, MuSR), conversation quality (MT-Bench), and honesty (TruthfulQA +3.4 points). It loses on academic benchmarks (MMLU, MMLU-PRO, MATH) and instruction following (IFEval -6 points). The IFEval gap likely reflects Meta’s specific optimization for structured instruction formats.
Tool use
Hermes 3 uses the Hermes Function Calling standard (documented at NousResearch/Hermes-Function-Calling):
<tools>
[{"type": "function", "function": {"name": "get_weather", "parameters": {...}}}]
</tools>
The model responds with:
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris"}}
</tool_call>
Tool responses are fed back via the tool role wrapped in <tool_response> tags. Multi-turn tool chains are supported. The 4.3% of training data dedicated to tool use, agentic patterns, and RAG (17M tokens) establishes these capabilities.
Deployment
Hardware
| Format | Size | Minimum Hardware |
|---|---|---|
| FP16 | 141.1 GB | 2× A100-80GB |
| FP8 | ~70 GB | A100-80GB |
| 4-bit GGUF | ~35 GB | RTX 4090 (24 GB) |
Quantized versions
vLLM
vllm serve NousResearch/Hermes-3-Llama-3.1-70B
API
Available on OpenRouter at $0.30/M tokens (input and output).
The Hermes 3 family
| Size | GPUs Used | GPU Hours | Selected Epoch | DPO Applied |
|---|---|---|---|---|
| 8B | 48 | 147 | 4 | Yes (LoRA) |
| 70B | 48 | 648 | 3 | No |
| 405B | 128 | 2,086 | 4 | No |
The 405B model required 16 HGX nodes (128 H100s) and a reduced learning rate (3.5e-6 vs 7e-6). CPU parameter offloading was required at minimum 7 nodes, with a 45% drop in training efficiency. 405B evaluations were performed under FP8 quantization.
Nous Research
Nous Research is an independent AI lab founded by Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Their alignment philosophy — put guardrails at the system level, not the model level — has made Hermes a community favorite for use cases where commercial models over-refuse: creative writing, autonomous agents, roleplaying, and research applications. Hermes 4 has since been released as the successor.
References
- HuggingFace huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B
- Paper arxiv.org/abs/2408.11857