Nous Research 70.0B Llama License text-generation

Hermes 3 Llama 3.1 70B

Released July 29, 2024

Context Window 66K tokens
≈ 49 pages of text
1.5K Downloads
122 Likes
141.1 GB Disk Size

Pricing

Input $0.30/M per million tokens
Output $0.30/M per million tokens

Benchmarks

Open LLM Leaderboard v2

IFEval 76.6
BBH 53.8
MATH Lvl 5 21.0
GPQA 14.9
MUSR 23.4
MMLU-PRO 41.4

Average: 38.50

Lineage

About

Hermes-3-Llama-3.1-70B is a 70 billion parameter instruct model from Nous Research, fine-tuned on Meta’s Llama-3.1-70B. Released August 2024 under the Llama 3.1 Community License, it’s the mid-range model in the Hermes 3 family (8B, 70B, 405B). Hermes 3 is a neutrally-aligned generalist — the paper’s stated philosophy is “for Hermes, there is no such thing as latent thoughtcrime.” The model is designed to faithfully follow system prompts and user instructions without refusing on moral grounds, placing responsibility for guardrails at the system level rather than the model level.

Why it matters

Hermes 3 represents a deliberate alternative to the safety-first alignment approach used by Meta’s official instruct tunes. Where Llama-3.1-70B-Instruct is tuned to refuse certain categories of requests, Hermes 3 is trained to be steerable — it adopts whatever persona its system prompt defines and responds accordingly. The paper notes that the 405B version is so sensitive to system prompts that an empty system prompt doesn’t default to “helpful assistant” behavior.

The practical result is a model optimized for:

  • Agentic workflows — built-in support for structured reasoning via special tokens (<SCRATCHPAD>, <REASONING>, <PLAN>, <EXECUTION>, <REFLECTION>, <THINKING>, <SOLUTION>, etc.)
  • Tool use — the Hermes Function Calling standard uses JSON schemas in <tools> tags with invocations in <tool_call> and responses in <tool_response>
  • RAG — trained to cite retrieval sources using <co> tags
  • Roleplaying — consistent persona maintenance across long multi-turn conversations using the full 128K context window
  • Creative writing — reduced refusal behavior means the model engages with creative scenarios that over-aligned models decline

The benchmark tradeoff is real: Hermes 3 70B beats Llama-3.1-Instruct on AGIEval, HellaSwag, MT-Bench, MuSR, and TruthfulQA, but loses on IFEval, MATH, MMLU, and GPQA. It optimizes for steerability and instruction faithfulness over raw benchmark performance.

Architecture

SpecValue
ArchitectureDense decoder-only Transformer (LlamaForCausalLM)
Parameters70B
Base ModelMeta Llama-3.1-70B
Context Length128K tokens
Prompt FormatChatML (<|im_start|> / <|im_end|>)
Training Sequence Length8,192 tokens
Disk Size141.1 GB
LicenseLlama 3.1 Community License

Standard Llama 3.1 architecture with no structural modifications. All improvements from post-training.

Training

Data (from paper Table 1)

The Hermes 3 dataset totals approximately 390 million tokens (270M output tokens contributing to loss, 120M input tokens). Data was curated between March and August 2024.

CategoryProportionTokens (M)
General Instructions60.6%236
Domain Expert12.8%50
Math6.7%26
Roleplaying6.1%24
Coding4.5%18
Tool Use, Agentic, and RAG4.3%17
Content Generation3.0%12
Steering and Alignment2.5%10

Data sources include existing curated datasets and domain-specific synthetic data generated using Evol-Instruct-inspired schemes. Filtering removed refusals, improperly formatted responses, empty turns, and prioritized outputs from the strongest generator models.

SFT phase

DetailValue
OptimizerAdamW, weight decay 0.01
Learning Rate7e-6 (cosine decay after 300-step warmup)
Epochs4 (selected epoch 3 for 70B based on benchmark scores)
Batch Size48
GPUs48 (6 HGX nodes, 8×H100 SXM5 each)
InterconnectQuantum-2 InfiniBand
Training Time648 GPU-hours
PackingFlash Attention 2 sample packing at 96% efficiency
Sequence Length8,192 tokens
FrameworkModified Axolotl
LossCross-entropy on response/tool-use tokens only (instruction tokens masked)

Learning rate was selected via hyperparameter sweep on 8B models. The 70B model was distributed across nodes using PyTorch FSDP.

DPO phase

For the 70B model, DPO provided only negligible performance improvements. The final released model is the SFT-only checkpoint — no DPO applied to the 70B.

DPO was applied to the 8B model (LoRA, r=32, α=16, dropout 0.05, RMSProp optimizer, NEFTune α=5) with moderate positive impact, but wasn’t worth the added complexity at 70B scale.

Benchmarks

From the paper’s Table 5. All evaluations performed by the authors.

BenchmarkHermes 3 70BLlama 3.1 Instruct 70B
AGIEval (0-shot)56.1848.26
ARC-C (0-shot)65.5363.40
BoolQ (0-shot)88.0487.76
BBH (3-shot)67.8269.24
GPQA (0-shot)37.6740.09
HellaSwag (10-shot)88.1986.42
IFEval (Strict)81.2187.25
MATH Lvl 5 (4-shot)20.8029.24
MMLU (5-shot)79.0982.27
MMLU-PRO (5-shot)47.2452.94
MT-Bench (Avg)8.998.93
MuSR (0-shot)50.6747.08
TruthfulQA (MC2)63.2959.91
WinoGrande (5-shot)83.1985.00

The pattern is consistent with Hermes 3’s design priorities. It wins on reasoning-adjacent benchmarks (AGIEval, MuSR), conversation quality (MT-Bench), and honesty (TruthfulQA +3.4 points). It loses on academic benchmarks (MMLU, MMLU-PRO, MATH) and instruction following (IFEval -6 points). The IFEval gap likely reflects Meta’s specific optimization for structured instruction formats.

Tool use

Hermes 3 uses the Hermes Function Calling standard (documented at NousResearch/Hermes-Function-Calling):

<tools>
  [{"type": "function", "function": {"name": "get_weather", "parameters": {...}}}]
</tools>

The model responds with:

<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris"}}
</tool_call>

Tool responses are fed back via the tool role wrapped in <tool_response> tags. Multi-turn tool chains are supported. The 4.3% of training data dedicated to tool use, agentic patterns, and RAG (17M tokens) establishes these capabilities.

Deployment

Hardware

FormatSizeMinimum Hardware
FP16141.1 GB2× A100-80GB
FP8~70 GBA100-80GB
4-bit GGUF~35 GBRTX 4090 (24 GB)

Quantized versions

vLLM

vllm serve NousResearch/Hermes-3-Llama-3.1-70B

API

Available on OpenRouter at $0.30/M tokens (input and output).

The Hermes 3 family

SizeGPUs UsedGPU HoursSelected EpochDPO Applied
8B481474Yes (LoRA)
70B486483No
405B1282,0864No

The 405B model required 16 HGX nodes (128 H100s) and a reduced learning rate (3.5e-6 vs 7e-6). CPU parameter offloading was required at minimum 7 nodes, with a 45% drop in training efficiency. 405B evaluations were performed under FP8 quantization.

Nous Research

Nous Research is an independent AI lab founded by Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Their alignment philosophy — put guardrails at the system level, not the model level — has made Hermes a community favorite for use cases where commercial models over-refuse: creative writing, autonomous agents, roleplaying, and research applications. Hermes 4 has since been released as the successor.

References

  • 🤗 HuggingFace huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B
  • 📄 Paper arxiv.org/abs/2408.11857