
A specialized 9B dense model tuned specifically for terminal execution, file editing, and precise tool calling within the Hermes Agent harness.
A solid 9B-parameter dense language model from kai-os. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 4.1 GB | Low | |
| Q4_K_MRecommended | 6.0 GB | Good | |
| Q5_K_M | 6.9 GB | Very Good | |
| Q6_K | 8.0 GB | Excellent | |
| Q8_0 | 10.2 GB | Near Perfect | |
| FP16 | 18.8 GB | Full |
See which devices can run this model and at what quality level.
| SS | 57.8 tok/s | 6.0 GB | ||
Intel Arc B580Intel | SS | 61.0 tok/s | 6.0 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 67.5 tok/s | 6.0 GB | |
| SS | 67.5 tok/s | 6.0 GB | ||
| SS | 60.0 tok/s | 6.0 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 89.9 tok/s | 6.0 GB | |
| SS | 38.5 tok/s | 6.0 GB | ||
| SS | 68.5 tok/s | 6.0 GB | ||
| SS | 83.5 tok/s | 6.0 GB | ||
| SS | 85.7 tok/s | 6.0 GB | ||
| SS | 85.7 tok/s | 6.0 GB | ||
Google Cloud TPU v5eGoogle | SS | 109.6 tok/s | 6.0 GB | |
Intel Arc A770 16GBIntel | SS | 74.9 tok/s | 6.0 GB | |
| SS | 128.5 tok/s | 6.0 GB | ||
| SS | 89.9 tok/s | 6.0 GB | ||
| SS | 98.5 tok/s | 6.0 GB | ||
| SS | 60.0 tok/s | 6.0 GB | ||
| SS | 119.9 tok/s | 6.0 GB | ||
| SS | 128.5 tok/s | 6.0 GB | ||
NVIDIA GeForce RTX 4060NVIDIA | SS | 36.4 tok/s | 6.0 GB | |
| SS | 38.5 tok/s | 6.0 GB | ||
| SS | 107.1 tok/s | 6.0 GB | ||
| SS | 128.5 tok/s | 6.0 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | SS | 125.3 tok/s | 6.0 GB | |
| SS | 134.9 tok/s | 6.0 GB |
Energy cost on AMD Radeon RX 7600 8GB (~39 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Carnice-9b for Hermes agent on AMD Radeon RX 7600 8GB · ~39 tok/s · 165W | $0.143 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Carnice-9b for Hermes agent is a 9-billion-parameter dense model built specifically for the Hermes Agent harness. Created by kai-os, it is a standalone merged checkpoint derived from Qwen/Qwen3.5-9B, but it is not a generic chat model. The training objective was to improve behavior inside Hermes Agent—tool calling, terminal execution, file editing, browser use, and multi-step agent workflows. If you are building autonomous agents with Hermes, this model is tuned to speak its language natively.
The model occupies a narrow but critical niche: local agent execution. It competes with other 7B-9B agent-tuned models like NousResearch’s Hermes-3 or Microsoft’s Phi-3.5-mini, but Carnice-9b distinguishes itself by rejecting generic benchmark optimization in favor of harness-native trajectory quality. Its license (Apache 2.0) makes it practical for commercial deployment and derivative work.
Carnice-9b uses a dense transformer architecture with 9 billion parameters. Unlike mixture-of-experts models, every forward pass activates all parameters, which simplifies memory planning and avoids routing overhead. For a 9B dense model, expect deterministic VRAM consumption: roughly 18 GB at 16-bit precision (FP16/BF16), scaling linearly with quantization.
The model is built on Qwen3.5-9B, but kai-os performed a two-stage fine-tuning process. Stage A was a reasoning repair pass using high-signal reasoning data (Bespoke-Stratos-17k, NuminaMath-CoT) to recover general reasoning ability that can degrade during agent specialization. Stage B—the defining step—was a Hermes-specific refresh pass using harness-native traces and action structure from datasets like kai-os/carnice-glm5-hermes-traces and OpenThoughts-Agent-v1-SFT. The result is a checkpoint that expects Hermes-native message formatting and tool-call patterns, not generic OpenAI-style function definitions.
Context length is not officially specified, but as a Qwen3.5 derivative, it should support at least 32K tokens (the base model’s capacity). Practically, agent trajectories rarely exceed a few thousand tokens, so this is not a bottleneck for its intended use.
Carnice-9b excels in three areas: code, reasoning, and function-calling. Its capabilities are not abstract—they are tied to concrete Hermes Agent workflows:
For developers: use this model when you need an agent that reliably calls tools in Hermes format, not when you want a general-purpose chatbot. It is optimized for the Hermes runtime, so if your stack uses Hermes Agent (e.g., for autonomous coding agents, browser agents, or DevOps automation), this is the best 9B option available.
This is a practical choice for local deployment because 9B dense models fit reasonably on consumer hardware with quantization. Here’s what you need to know:
VRAM requirements (GGUF quantization):
| Quant | Size | Minimum VRAM | Recommended VRAM |
|---|---|---|---|
| Q4_K_M (4-bit) | 5.3 GB | 6 GB | 8-12 GB |
| Q6_K (6-bit) | 6.9 GB | 8 GB | 12 GB |
| Q8_0 (8-bit) | 8.9 GB | 10 GB | 16 GB |
For most users, Q4_K_M offers the best tradeoff between quality and local performance. Q6_K provides a meaningful quality bump if you have 12 GB VRAM (e.g., RTX 4070 Ti, RTX 3080 12GB, M4 Max with 16GB unified memory). Q8_0 is only necessary if you are doing evaluation or need maximum fidelity.
Hardware compatibility:
Quickest way to start: Use Ollama. GGUF versions are available (from kai-os/Carnice-9b-GGUF), and you can pull a quantized variant directly. Alternatively, use llama.cpp or LM Studio for full control. The source checkpoint is also loadable via HuggingFace Transformers in BF16 on any GPU with 18 GB+.
Expected tokens per second varies heavily by hardware, quantization, and context size. As a rule of thumb, a 9B dense model at Q4_K_M on a modern GPU provides real-time interactivity (20+ tok/s). For agent execution, the bottleneck is usually tool-call round trips, not raw throughput, so even 10-15 tok/s is acceptable for multi-step tasks.
vs. Hermes 3 (NousResearch, 8B): Hermes 3 is a general-purpose instruct model also tuned for tool use, but its training was broader. Carnice-9b has a tighter focus on Hermes Agent formatting and terminal trajectories. If you use Hermes Agent and find Hermes 3 producing awkward tool outputs or failing on multi-step execution, Carnice-9b is the targeted fix. Hermes 3 may perform better on generic reasoning benchmarks, but that is not the metric that matters here.
vs. Phi-3.5-mini (Microsoft, 3.8B): Phi-3.5-mini is smaller and less capable for complex agent workflows. It lacks dedicated agent training and struggles with multi-turn tool sequences. Carnice-9b is the better choice if you need reliable execution over many steps. Phi-3.5-mini wins on VRAM (can run on 4-6 GB) and speed, but not on agent quality.
vs. Qwen2.5-7B-Instruct: Base Qwen models have strong general reasoning but no agent-specific tuning. Carnice-9b inherits Qwen’s reasoning strength (via the repair stage) but adds harness-native tool behavior. If you are already using Hermes Agent, the tuned version saves you the effort of prompt engineering for tool formatting.
Choose Carnice-9b when your agent stack demands precision in tool invocation and resilience over long execution chains. Do not choose it if you need broad chat performance or multimodal capabilities. For its target use case, it is purpose-built and effective.