Nous Research 70.0B Llama License text-generation

Hermes 3 Llama 3.1 70B

Released July 29, 2024

Context Window 66K tokens

≈ 49 pages of text

1.5K Downloads

122 Likes

141.1 GB Disk Size

Pricing

Input $0.30/M per million tokens

Output $0.30/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 76.6

BBH 53.8

MATH Lvl 5 21.0

GPQA 14.9

MUSR 23.4

MMLU-PRO 41.4

Average: 38.50

Lineage

↑ Base: meta-llama/Llama-3.1-70B

About

Hermes-3-Llama-3.1-70B is a 70 billion parameter instruct model from Nous Research, fine-tuned on Meta’s Llama-3.1-70B. Released August 2024 under the Llama 3.1 Community License, it’s the mid-range model in the Hermes 3 family (8B, 70B, 405B). Hermes 3 is a neutrally-aligned generalist — the paper’s stated philosophy is “for Hermes, there is no such thing as latent thoughtcrime.” The model is designed to faithfully follow system prompts and user instructions without refusing on moral grounds, placing responsibility for guardrails at the system level rather than the model level.

Why it matters

Hermes 3 represents a deliberate alternative to the safety-first alignment approach used by Meta’s official instruct tunes. Where Llama-3.1-70B-Instruct is tuned to refuse certain categories of requests, Hermes 3 is trained to be steerable — it adopts whatever persona its system prompt defines and responds accordingly. The paper notes that the 405B version is so sensitive to system prompts that an empty system prompt doesn’t default to “helpful assistant” behavior.

The practical result is a model optimized for:

Agentic workflows — built-in support for structured reasoning via special tokens (<SCRATCHPAD>, <REASONING>, <PLAN>, <EXECUTION>, <REFLECTION>, <THINKING>, <SOLUTION>, etc.)
Tool use — the Hermes Function Calling standard uses JSON schemas in <tools> tags with invocations in <tool_call> and responses in <tool_response>
RAG — trained to cite retrieval sources using <co> tags
Roleplaying — consistent persona maintenance across long multi-turn conversations using the full 128K context window
Creative writing — reduced refusal behavior means the model engages with creative scenarios that over-aligned models decline

The benchmark tradeoff is real: Hermes 3 70B beats Llama-3.1-Instruct on AGIEval, HellaSwag, MT-Bench, MuSR, and TruthfulQA, but loses on IFEval, MATH, MMLU, and GPQA. It optimizes for steerability and instruction faithfulness over raw benchmark performance.

Architecture

Spec	Value
Architecture	Dense decoder-only Transformer (LlamaForCausalLM)
Parameters	70B
Base Model	Meta Llama-3.1-70B
Context Length	128K tokens
Prompt Format	ChatML (`<\|im_start\|>` / `<\|im_end\|>`)
Training Sequence Length	8,192 tokens
Disk Size	141.1 GB
License	Llama 3.1 Community License

Standard Llama 3.1 architecture with no structural modifications. All improvements from post-training.

Training

Data (from paper Table 1)

The Hermes 3 dataset totals approximately 390 million tokens (270M output tokens contributing to loss, 120M input tokens). Data was curated between March and August 2024.

Category	Proportion	Tokens (M)
General Instructions	60.6%	236
Domain Expert	12.8%	50
Math	6.7%	26
Roleplaying	6.1%	24
Coding	4.5%	18
Tool Use, Agentic, and RAG	4.3%	17
Content Generation	3.0%	12
Steering and Alignment	2.5%	10

Data sources include existing curated datasets and domain-specific synthetic data generated using Evol-Instruct-inspired schemes. Filtering removed refusals, improperly formatted responses, empty turns, and prioritized outputs from the strongest generator models.

SFT phase

Detail	Value
Optimizer	AdamW, weight decay 0.01
Learning Rate	7e-6 (cosine decay after 300-step warmup)
Epochs	4 (selected epoch 3 for 70B based on benchmark scores)
Batch Size	48
GPUs	48 (6 HGX nodes, 8×H100 SXM5 each)
Interconnect	Quantum-2 InfiniBand
Training Time	648 GPU-hours
Packing	Flash Attention 2 sample packing at 96% efficiency
Sequence Length	8,192 tokens
Framework	Modified Axolotl
Loss	Cross-entropy on response/tool-use tokens only (instruction tokens masked)

Learning rate was selected via hyperparameter sweep on 8B models. The 70B model was distributed across nodes using PyTorch FSDP.

DPO phase

For the 70B model, DPO provided only negligible performance improvements. The final released model is the SFT-only checkpoint — no DPO applied to the 70B.

DPO was applied to the 8B model (LoRA, r=32, α=16, dropout 0.05, RMSProp optimizer, NEFTune α=5) with moderate positive impact, but wasn’t worth the added complexity at 70B scale.

Benchmarks

From the paper’s Table 5. All evaluations performed by the authors.

Benchmark	Hermes 3 70B	Llama 3.1 Instruct 70B
AGIEval (0-shot)	56.18	48.26
ARC-C (0-shot)	65.53	63.40
BoolQ (0-shot)	88.04	87.76
BBH (3-shot)	67.82	69.24
GPQA (0-shot)	37.67	40.09
HellaSwag (10-shot)	88.19	86.42
IFEval (Strict)	81.21	87.25
MATH Lvl 5 (4-shot)	20.80	29.24
MMLU (5-shot)	79.09	82.27
MMLU-PRO (5-shot)	47.24	52.94
MT-Bench (Avg)	8.99	8.93
MuSR (0-shot)	50.67	47.08
TruthfulQA (MC2)	63.29	59.91
WinoGrande (5-shot)	83.19	85.00

The pattern is consistent with Hermes 3’s design priorities. It wins on reasoning-adjacent benchmarks (AGIEval, MuSR), conversation quality (MT-Bench), and honesty (TruthfulQA +3.4 points). It loses on academic benchmarks (MMLU, MMLU-PRO, MATH) and instruction following (IFEval -6 points). The IFEval gap likely reflects Meta’s specific optimization for structured instruction formats.

Tool use

Hermes 3 uses the Hermes Function Calling standard (documented at NousResearch/Hermes-Function-Calling):

<tools>
  [{"type": "function", "function": {"name": "get_weather", "parameters": {...}}}]
</tools>

The model responds with:

<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris"}}
</tool_call>

Tool responses are fed back via the tool role wrapped in <tool_response> tags. Multi-turn tool chains are supported. The 4.3% of training data dedicated to tool use, agentic patterns, and RAG (17M tokens) establishes these capabilities.

Deployment

Hardware

Format	Size	Minimum Hardware
FP16	141.1 GB	2× A100-80GB
FP8	~70 GB	A100-80GB
4-bit GGUF	~35 GB	RTX 4090 (24 GB)

Quantized versions

GGUF: NousResearch/Hermes-3-Llama-3.1-70B-GGUF
FP8: NousResearch/Hermes-3-Llama-3.1-70B-FP8

vLLM

vllm serve NousResearch/Hermes-3-Llama-3.1-70B

API

Available on OpenRouter at $0.30/M tokens (input and output).

The Hermes 3 family

Size	GPUs Used	GPU Hours	Selected Epoch	DPO Applied
8B	48	147	4	Yes (LoRA)
70B	48	648	3	No
405B	128	2,086	4	No

The 405B model required 16 HGX nodes (128 H100s) and a reduced learning rate (3.5e-6 vs 7e-6). CPU parameter offloading was required at minimum 7 nodes, with a 45% drop in training efficiency. 405B evaluations were performed under FP8 quantization.

Nous Research

Nous Research is an independent AI lab founded by Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Their alignment philosophy — put guardrails at the system level, not the model level — has made Hermes a community favorite for use cases where commercial models over-refuse: creative writing, autonomous agents, roleplaying, and research applications. Hermes 4 has since been released as the successor.

References

🤗 HuggingFace huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B
📄 Paper arxiv.org/abs/2408.11857