Nous Research released Hermes 4 in August 2025 - a family at 14B, 70B, and 405B parameters - built entirely on Llama 3.1 checkpoints, reaching frontier-level performance through pure post-training. No pretraining run. No dedicated GPU cluster building a foundation from scratch. That is the bet the whole Hermes lineage rests on, and it is worth understanding precisely because it is still unusual.
Nous Research is one of the more quietly respected open-source AI labs. They do not publish paper volume like DeepMind. They do not raise Series C money like Anthropic. What they do is ship well-tuned open-weight models - the Hermes 3 family is considered one of the best fine-tune lineages in 2025-2026 - entirely through post-training rather than building foundations from scratch.
That is the setup. The interesting question is whether the approach has a ceiling, and what Hermes 4 reveals about where that ceiling sits.
What the post-training-only bet actually means
Most teams encounter "open-weight model" as a category and treat all of them similarly: download weights, point your inference stack at them, compare benchmark scores. The Hermes line is worth understanding more specifically than that.
The Hermes 3 models use a simple post-training stack of one large supervised fine-tuning mix followed by direct preference optimization. Nous Research is best known for its large, general chat, SFT data mixes. The consequence of this architecture decision is real: Nous can move fast because they are not building foundations. When Meta releases a new Llama checkpoint, Nous can post-train on top of it within weeks. When Qwen releases a strong 14B base, Nous can port their training approach there too - which is exactly what happened with Hermes 4's 14B variant.
Hermes 4 14B is a frontier, hybrid-mode reasoning model based on Qwen 3 14B by Nous Research. The 70B and 405B variants stay on Llama 3.1. This base-model flexibility - pick the strongest publicly available checkpoint, post-train it with your own data mix - is structurally different from what OpenAI or Anthropic do, and it explains why the Hermes line has been able to keep pace with closed-model capability curves at a fraction of the cost.
The trade-off is also real: Nous inherits whatever the base model gets wrong. If Llama 3.1 has gaps in multilingual instruction following, post-training can soften them but rarely closes them fully.
What changed from Hermes 3 to Hermes 4
Hermes 3 was already a meaningful model.
The primary improvements were advanced agentic capabilities, significantly better multi-turn coherence, and more reliable structured output. The tool-calling format matured: the model learned to emit <tool_call> tagged JSON reliably within a single assistant turn, rather than requiring external parsing hacks. This reliability shift made Hermes 3 the first version of the model family that teams seriously considered for production agentic pipelines.
Hermes 4 builds on that but introduces something structurally new: hybrid reasoning mode.
Hermes 4 models can toggle between standard responses and explicit reasoning using <think>...</think> tags when complex problems require deeper deliberation.
This is not just a UX feature. It means you can run the model in fast mode for routine tool calls and in reasoning mode for multi-step planning - within the same session, without switching models.
The hybrid reasoning mode lets you toggle chain-of-thought on for complex planning steps and off for routine tool calls, balancing quality and cost within the same session.
The training corpus expansion explains why this works as well as it does. Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, with the dataset size increasing from 1M samples and 1.2B tokens to roughly 5M samples and 60B tokens blended across reasoning and non-reasoning data. That is a 50x expansion in fine-tuning token count. Most open-weight model releases change the base model and call it progress. Hermes 4 kept the same Llama 3.1 base and got meaningfully better by investing in data.
The model is instruction-tuned with the expanded corpus emphasizing reasoning traces, improving performance in math, code, STEM, and logical reasoning. It supports structured outputs including JSON mode, schema adherence, function calling, and tool use. Hermes 4 is trained for steerability, lower refusal rates, and alignment toward neutral, user-directed behavior.
The cost math for teams running inference at volume
API costs for frontier closed models can run $5-$15 per million tokens. Hermes 4 70B, available on providers like Nebius via OpenRouter, is priced well below that range. Hermes 4 70B costs $0.13 per 1M input tokens and $0.4 per 1M output tokens at one listed provider - roughly a 10-30x cost reduction against Claude Sonnet or GPT-4o class pricing.
Inference costs are falling at roughly 10x per year for the same capability level , which means the cost argument for open-weight models is not static - it keeps improving. But the per-token price is only half the story. On the most complex reasoning tasks and in long agentic workflows, closed frontier models still hold an edge. The gap has narrowed significantly in 2024-2025, but "catching up" doesn't mean "equivalent across all tasks."
For teams building pipelines where the task is well-defined - structured extraction, tool dispatch, multi-turn document Q&A - the Hermes 70B hitting at $0.40 per million output tokens while reliably emitting structured JSON is a real operational option. For open-ended reasoning, complex code generation with minimal scaffolding, or tasks where a single model failure is expensive, the closed frontier models still justify their price premium.
Managed inference services like Together AI, Fireworks, and Groq make it straightforward to call open-weight models via API without managing your own GPU infrastructure. Running Hermes 4 does not require owning GPUs. A teammate routing Slack questions through Beagle to a Hermes 70B endpoint looks identical in the API layer to routing them through Claude - the choice is a config value and a price point.
Where Hermes is going, and what to watch honestly
The most interesting signal in the Hermes roadmap is not another model size - it is the infrastructure experiment underneath Hermes 4.3. Hermes 4.3 was trained using Nous Research's Psyche decentralized training network rather than a traditional centralized GPU cluster. The model card explicitly calls this out as the first Hermes model trained this way. Whether decentralized training at this scale becomes a repeatable production method or remains an experiment is worth watching. The training results are competitive, which at minimum demonstrates that the approach is viable.
Benchmark scores from the Hermes 4.3 36B Psyche model card show MATH-500 at 93.8%, MMLU at 87.7%, BBH at 86.4%, AIME 24 at 71.9%, and GPQA Diamond at 65.5%. Those numbers from a 36B model trained on a decentralized network are not hype. They are a proof of concept for a different model of who gets to participate in frontier training.
In parallel, Nous Research is training Consilience, a 40B-parameter model designed with multi-head latent attention and trained on roughly 20T tokens. While Hermes validates Nous' ability to ship competitive open-source models, Consilience advances a distinct objective: cultivating original thought and creative synthesis rather than optimizing for conventional leaderboard metrics.
The honest ceiling of the post-training-only bet is this: Nous can improve a base model significantly, but they cannot build capabilities the base model entirely lacks. DeepSeek's open-weight performance in close range with state-of-the-art closed models suggests a fast-paced catch-up dynamic, with open-weight and closed models advancing in a tight loop. Hermes benefits from that loop - when Meta or Qwen releases a stronger base, Nous can post-train it - but they are downstream of it.
For teams evaluating the Hermes line: the 70B is worth a serious trial on any agentic pipeline where reliable tool-call formatting and low refusal rates matter, the inference cost math is genuinely favorable at volume, and Hermes 4's hybrid reasoning mode is a real operational feature rather than a marketing claim. Approach the benchmark comparisons with open-frontier models critically - the gap against closed models on simple tasks is narrow; on complex multi-step reasoning, the gap is real and the honest answer is to test both.
If you are evaluating Hermes 4 for a production pipeline, test it specifically on tool-calling reliability and JSON schema adherence rather than MMLU scores. That is the workload it was optimized for, and the gap between Hermes and closed models is smallest there.