kai-os

Carnice-9b for Hermes agent

A specialized 9B dense model tuned specifically for terminal execution, file editing, and precise tool calling within the Hermes Agent harness.

9B paramsDense

View on Hugging Face

Our Take

Best for: Local inference under 16 GB VRAM

A solid 9B-parameter dense language model from kai-os. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~6.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters9B

Active Params9B

ArchitectureDense

ModalityText Only

Providerkai-os

Download Size17.9 GB

Community

Monthly Downloads131

Likes184

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

57.4BB

Benchmark40%

75.0

Popularity25%

3.4

Efficiency20%

97.2

Versatility15%

47.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	4.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	6.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	8.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	10.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	18.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	57.8 tok/s	6.0 GB
Intel Arc B580Intel	SS	61.0 tok/s	6.0 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	67.5 tok/s	6.0 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	67.5 tok/s	6.0 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	60.0 tok/s	6.0 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	89.9 tok/s	6.0 GB
AMD Radeon RX 7600 8GBAMD	SS	38.5 tok/s	6.0 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	68.5 tok/s	6.0 GB
AMD Radeon RX 7800 XTAMD	SS	83.5 tok/s	6.0 GB
AMD Radeon RX 9070AMD	SS	85.7 tok/s	6.0 GB
AMD Radeon RX 9070 XTAMD	SS	85.7 tok/s	6.0 GB
Google Cloud TPU v5eGoogle	SS	109.6 tok/s	6.0 GB
Intel Arc A770 16GBIntel	SS	74.9 tok/s	6.0 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	128.5 tok/s	6.0 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	89.9 tok/s	6.0 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	98.5 tok/s	6.0 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	60.0 tok/s	6.0 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	119.9 tok/s	6.0 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	128.5 tok/s	6.0 GB
NVIDIA GeForce RTX 4060NVIDIA	SS	36.4 tok/s	6.0 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	SS	38.5 tok/s	6.0 GB
AMD Radeon RX 7900 XTAMD	SS	107.1 tok/s	6.0 GB
AMD Radeon RX 7900 XTXAMD	SS	128.5 tok/s	6.0 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	125.3 tok/s	6.0 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	134.9 tok/s	6.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~39 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Carnice-9b for Hermes agent on AMD Radeon RX 7600 8GB · ~39 tok/s · 165W	$0.143
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Carnice-9b for Hermes agent is a 9-billion-parameter dense model built specifically for the Hermes Agent harness. Created by kai-os, it is a standalone merged checkpoint derived from Qwen/Qwen3.5-9B, but it is not a generic chat model. The training objective was to improve behavior inside Hermes Agent—tool calling, terminal execution, file editing, browser use, and multi-step agent workflows. If you are building autonomous agents with Hermes, this model is tuned to speak its language natively.

The model occupies a narrow but critical niche: local agent execution. It competes with other 7B-9B agent-tuned models like NousResearch’s Hermes-3 or Microsoft’s Phi-3.5-mini, but Carnice-9b distinguishes itself by rejecting generic benchmark optimization in favor of harness-native trajectory quality. Its license (Apache 2.0) makes it practical for commercial deployment and derivative work.

Architecture & Technical Details

Carnice-9b uses a dense transformer architecture with 9 billion parameters. Unlike mixture-of-experts models, every forward pass activates all parameters, which simplifies memory planning and avoids routing overhead. For a 9B dense model, expect deterministic VRAM consumption: roughly 18 GB at 16-bit precision (FP16/BF16), scaling linearly with quantization.

The model is built on Qwen3.5-9B, but kai-os performed a two-stage fine-tuning process. Stage A was a reasoning repair pass using high-signal reasoning data (Bespoke-Stratos-17k, NuminaMath-CoT) to recover general reasoning ability that can degrade during agent specialization. Stage B—the defining step—was a Hermes-specific refresh pass using harness-native traces and action structure from datasets like kai-os/carnice-glm5-hermes-traces and OpenThoughts-Agent-v1-SFT. The result is a checkpoint that expects Hermes-native message formatting and tool-call patterns, not generic OpenAI-style function definitions.

Context length is not officially specified, but as a Qwen3.5 derivative, it should support at least 32K tokens (the base model’s capacity). Practically, agent trajectories rarely exceed a few thousand tokens, so this is not a bottleneck for its intended use.

Capabilities & Use Cases

Carnice-9b excels in three areas: code, reasoning, and function-calling. Its capabilities are not abstract—they are tied to concrete Hermes Agent workflows:

Terminal-heavy execution: The model can parse shell commands, execute scripts, and interpret output across multiple turns. It was trained on terminal traces, so it handles real-world edge cases like error recovery and permission prompts.
File editing and structured tool use: It understands read/write/edit operations within a filesystem, making it suitable for coding assistants that modify project files, refactor code, or perform git operations.
Multi-turn tool calling: The model maintains context across tool calls without collapsing into incoherent action sequences. This is critical for browser automation, API chaining, and interactive debugging.
Reasoning for agent decision-making: The reasoning repair stage ensures the model can think through a problem before selecting a tool, rather than blindly guessing tool invocations.

For developers: use this model when you need an agent that reliably calls tools in Hermes format, not when you want a general-purpose chatbot. It is optimized for the Hermes runtime, so if your stack uses Hermes Agent (e.g., for autonomous coding agents, browser agents, or DevOps automation), this is the best 9B option available.

Running Carnice-9b for Hermes Agent Locally

This is a practical choice for local deployment because 9B dense models fit reasonably on consumer hardware with quantization. Here’s what you need to know:

VRAM requirements (GGUF quantization):

Quant	Size	Minimum VRAM	Recommended VRAM
Q4_K_M (4-bit)	5.3 GB	6 GB	8-12 GB
Q6_K (6-bit)	6.9 GB	8 GB	12 GB
Q8_0 (8-bit)	8.9 GB	10 GB	16 GB

For most users, Q4_K_M offers the best tradeoff between quality and local performance. Q6_K provides a meaningful quality bump if you have 12 GB VRAM (e.g., RTX 4070 Ti, RTX 3080 12GB, M4 Max with 16GB unified memory). Q8_0 is only necessary if you are doing evaluation or need maximum fidelity.

Hardware compatibility:

RTX 4090 (24 GB): Runs Q8_0 comfortably with room for context. Expect 25-40 tokens/second depending on context length.
RTX 3090 / 4080 (16 GB): Run Q6_K or Q8_0. 20-35 tok/s.
RTX 4070 / 3080 12GB: Q4_K_M or Q6_K. 15-25 tok/s.
M4 Max 48 GB (unified): Q8_0 with MLX or llama.cpp. 15-25 tok/s.
M3 Pro 18 GB: Q4_K_M works, but expect 10-15 tok/s.
Apple M2 (8 GB): Not recommended; even Q4_K_M may struggle with overhead.

Quickest way to start: Use Ollama. GGUF versions are available (from kai-os/Carnice-9b-GGUF), and you can pull a quantized variant directly. Alternatively, use llama.cpp or LM Studio for full control. The source checkpoint is also loadable via HuggingFace Transformers in BF16 on any GPU with 18 GB+.

Expected tokens per second varies heavily by hardware, quantization, and context size. As a rule of thumb, a 9B dense model at Q4_K_M on a modern GPU provides real-time interactivity (20+ tok/s). For agent execution, the bottleneck is usually tool-call round trips, not raw throughput, so even 10-15 tok/s is acceptable for multi-step tasks.

How It Compares

vs. Hermes 3 (NousResearch, 8B): Hermes 3 is a general-purpose instruct model also tuned for tool use, but its training was broader. Carnice-9b has a tighter focus on Hermes Agent formatting and terminal trajectories. If you use Hermes Agent and find Hermes 3 producing awkward tool outputs or failing on multi-step execution, Carnice-9b is the targeted fix. Hermes 3 may perform better on generic reasoning benchmarks, but that is not the metric that matters here.

vs. Phi-3.5-mini (Microsoft, 3.8B): Phi-3.5-mini is smaller and less capable for complex agent workflows. It lacks dedicated agent training and struggles with multi-turn tool sequences. Carnice-9b is the better choice if you need reliable execution over many steps. Phi-3.5-mini wins on VRAM (can run on 4-6 GB) and speed, but not on agent quality.

vs. Qwen2.5-7B-Instruct: Base Qwen models have strong general reasoning but no agent-specific tuning. Carnice-9b inherits Qwen’s reasoning strength (via the repair stage) but adds harness-native tool behavior. If you are already using Hermes Agent, the tuned version saves you the effort of prompt engineering for tool formatting.

Choose Carnice-9b when your agent stack demands precision in tool invocation and resilience over long execution chains. Do not choose it if you need broad chat performance or multimodal capabilities. For its target use case, it is purpose-built and effective.

Related Models

kai-os

Carnice-V2-27b

27BDense

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

kai-os

Carnice-9b for Hermes agent

A specialized 9B dense model tuned specifically for terminal execution, file editing, and precise tool calling within the Hermes Agent harness.

9B paramsDense

View on Hugging Face

Our Take

Best for: Local inference under 16 GB VRAM

A solid 9B-parameter dense language model from kai-os. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~6.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters9B

Active Params9B

ArchitectureDense

ModalityText Only

Providerkai-os

Download Size17.9 GB

Community

Monthly Downloads131

Likes184

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

57.4BB

Benchmark40%

75.0

Popularity25%

3.4

Efficiency20%

97.2

Versatility15%

47.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	4.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	6.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	8.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	10.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	18.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	57.8 tok/s	6.0 GB
Intel Arc B580Intel	SS	61.0 tok/s	6.0 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	67.5 tok/s	6.0 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	67.5 tok/s	6.0 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	SS	60.0 tok/s	6.0 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	89.9 tok/s	6.0 GB
AMD Radeon RX 7600 8GBAMD	SS	38.5 tok/s	6.0 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	68.5 tok/s	6.0 GB
AMD Radeon RX 7800 XTAMD	SS	83.5 tok/s	6.0 GB
AMD Radeon RX 9070AMD	SS	85.7 tok/s	6.0 GB
AMD Radeon RX 9070 XTAMD	SS	85.7 tok/s	6.0 GB
Google Cloud TPU v5eGoogle	SS	109.6 tok/s	6.0 GB
Intel Arc A770 16GBIntel	SS	74.9 tok/s	6.0 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	128.5 tok/s	6.0 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	89.9 tok/s	6.0 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	98.5 tok/s	6.0 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	60.0 tok/s	6.0 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	119.9 tok/s	6.0 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	128.5 tok/s	6.0 GB
NVIDIA GeForce RTX 4060NVIDIA	SS	36.4 tok/s	6.0 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	SS	38.5 tok/s	6.0 GB
AMD Radeon RX 7900 XTAMD	SS	107.1 tok/s	6.0 GB
AMD Radeon RX 7900 XTXAMD	SS	128.5 tok/s	6.0 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	125.3 tok/s	6.0 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	134.9 tok/s	6.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~39 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Carnice-9b for Hermes agent on AMD Radeon RX 7600 8GB · ~39 tok/s · 165W	$0.143
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Carnice-9b excels in three areas: code, reasoning, and function-calling. Its capabilities are not abstract—they are tied to concrete Hermes Agent workflows:

Terminal-heavy execution: The model can parse shell commands, execute scripts, and interpret output across multiple turns. It was trained on terminal traces, so it handles real-world edge cases like error recovery and permission prompts.
File editing and structured tool use: It understands read/write/edit operations within a filesystem, making it suitable for coding assistants that modify project files, refactor code, or perform git operations.
Multi-turn tool calling: The model maintains context across tool calls without collapsing into incoherent action sequences. This is critical for browser automation, API chaining, and interactive debugging.
Reasoning for agent decision-making: The reasoning repair stage ensures the model can think through a problem before selecting a tool, rather than blindly guessing tool invocations.

Running Carnice-9b for Hermes Agent Locally

This is a practical choice for local deployment because 9B dense models fit reasonably on consumer hardware with quantization. Here’s what you need to know:

VRAM requirements (GGUF quantization):

Quant	Size	Minimum VRAM	Recommended VRAM
Q4_K_M (4-bit)	5.3 GB	6 GB	8-12 GB
Q6_K (6-bit)	6.9 GB	8 GB	12 GB
Q8_0 (8-bit)	8.9 GB	10 GB	16 GB

Hardware compatibility:

RTX 4090 (24 GB): Runs Q8_0 comfortably with room for context. Expect 25-40 tokens/second depending on context length.
RTX 3090 / 4080 (16 GB): Run Q6_K or Q8_0. 20-35 tok/s.
RTX 4070 / 3080 12GB: Q4_K_M or Q6_K. 15-25 tok/s.
M4 Max 48 GB (unified): Q8_0 with MLX or llama.cpp. 15-25 tok/s.
M3 Pro 18 GB: Q4_K_M works, but expect 10-15 tok/s.
Apple M2 (8 GB): Not recommended; even Q4_K_M may struggle with overhead.

How It Compares

Related Models

kai-os

Carnice-V2-27b

27BDense

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.