kai-os

Carnice-V2-27b

A community-tuned, fully merged BF16 supervised fine-tune of the Qwen3.6-27B base model. Optimized specifically for Hermes-style agent traces and tool-oriented workflows.

27B paramsDense262K ctx

View on Hugging Face Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A solid 27B-parameter dense language model from kai-os. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onGoogle Cloud TPU v5pCheapest card in our directory with comfortable headroom (95 GB) for this model at Q4 (~72.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Instruction Following

Model Specifications

Parameters27B

ArchitectureDense

Context Length262K tokens

ModalityText Only

Providerkai-os

Download Size106.9 GB

Community

Monthly Downloads140

Likes67

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

57.3BB

Benchmark40%

89.0

Popularity25%

2.1

Efficiency20%

59.2

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	67.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	72.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	75.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	78.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	85.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	111.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Intel Gaudi 3 AI AcceleratorIntel	SS	40.9 tok/s	72.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	53.1 tok/s	72.8 GB
AMD Instinct MI300XAMD	SS	58.6 tok/s	72.8 GB
Google TPU v7 (Ironwood)Google	SS	81.6 tok/s	72.8 GB
NVIDIA B200 GPUNVIDIA	SS	88.5 tok/s	72.8 GB
AMD Instinct MI325XAMD	SS	66.4 tok/s	72.8 GB
AMD Instinct MI355XAMD	SS	88.5 tok/s	72.8 GB
Google Cloud TPU v5pGoogle	SS	30.6 tok/s	72.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	78.5 tok/s	72.8 GB
Dell Pro Max with GB300Dell	SS	78.5 tok/s	72.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	78.5 tok/s	72.8 GB
HP ZGX Fury AI StationHP	SS	78.5 tok/s	72.8 GB
MSI XpertStation WS300MSI	SS	78.5 tok/s	72.8 GB
SuperMicro Super AI StationSuperMicro	SS	78.5 tok/s	72.8 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	27.1 tok/s	72.8 GB
NVIDIA H100 SXM5 80GBNVIDIA	AA	37.1 tok/s	72.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	8.8 tok/s	72.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	6.8 tok/s	72.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	6.8 tok/s	72.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	6.8 tok/s	72.8 GB
Apple M4 Max (40-core GPU)Apple	BB	6.0 tok/s	72.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	6.0 tok/s	72.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	6.0 tok/s	72.8 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	6.0 tok/s	72.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	5.7 tok/s	72.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~23 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Carnice-V2-27b on NVIDIA A100 SXM4 80GB · ~23 tok/s · 400W	$0.591
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 73 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$0.59
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM	$0.59
NVIDIA H100 SXMVast.ai · Spot · 80 GB VRAM	$1.07

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Carnice-V2-27b is a full-merge BF16 supervised fine-tune of Qwen3.6-27B, created by Kai Stephens (kai-os) specifically for Hermes-style agent traces and tool-oriented workflows. This is not a LoRA adapter or a partial merge — it’s a standalone set of weights you can load directly into Transformers or convert to GGUF for local inference. The model targets practitioners building agent frameworks that rely on function calling, structured instruction following, and multi-turn reasoning, rather than general-purpose chat.

At 27 billion parameters, Carnice-V2-27b occupies a sweet spot between the 7B–9B class models that fit on consumer hardware and the 70B+ models that require enterprise-level memory. It is a dense architecture (no MoE), meaning all 27B parameters are active for every token. This gives predictable VRAM usage and consistent inference latency, but it also means you need enough GPU memory to hold the full model weight footprint — or use quantization to reduce it.

The model is licensed Apache 2.0, making it safe for commercial use and derivative works.

Architecture & Technical Details

Carnice-V2-27b inherits the Qwen3.6 architecture: a dense transformer with 27B parameters, using hybrid attention and SSM layers (selective state-space model) blended into the transformer blocks. This hybrid design appears in the GGUF as the qwen35 architecture type, so you need a recent llama.cpp build (late 2025 or newer) to load Q4+ quantized versions correctly.

Context length: 262,144 tokens. This is the full native context of the base Qwen3.6 model, preserved in the SFT merge. For agent traces that span long conversation histories, tool call chains, or large document contexts, this headroom is valuable. However, on consumer GPUs you will typically run out of VRAM before hitting that limit — practical context for a 16GB card at Q4_K_M is around 8,000–16,000 tokens depending on quantization and batch size.

Training details: The SFT used 3,473 training rows windowed into 6,554 windows of 8,192 tokens (1,024 token overlap). Training was assistant-token-only loss masking. The source mix includes 1,508 Carnice-specific agent traces, 1,015 rows from DJLougen’s Hermes dataset, and 950 rows from Lambda GLM-5.1 Hermes runs. This targeted data selection is the main reason the model outperforms the base Qwen3.6-27B on instruction-following and tool-use benchmarks.

BF16 loading fix: Early weights had an extra Unsloth wrapper prefix in the safetensors. The repo was corrected — current BF16 files load cleanly with AutoModelForCausalLM. If you are using an older snapshot, check the commit history. GGUF exports were never affected because the conversion path normalized those prefixes automatically.

Capabilities & Use Cases

Carnice-V2-27b is built for one thing: running agent loops that require structured tool calls, function definitions, and multi-step reasoning. It is a text-only model — no vision or multimodal support — so it belongs in pure NLP agent pipelines.

Function-calling: The model was trained on Hermes-format agent traces, which means it understands how to emit JSON tool calls, handle parallel function calls, and follow system-level tool definitions. This makes it a drop-in replacement for Hermes-2-Pro or Nous-Hermes-2 in agentic harnesses like, for example, the hermes-agent framework.
Instruction-following: IFEval scores improved by 3–5 points over the base model (90% strict prompt at limit 20). For structured tasks like data extraction, formatting, or schema generation, Carnice holds the instruction better.
Reasoning and code: While not a specialized code model, its reasoning capability is solid for GPT-like agent reasoning chains — e.g., planning a sequence of API calls or decomposing a user request into function-eligible subproblems. Expect competitive performance on standard code generation and explanation tasks relative to other 27B SFTs.
Chat: General conversation is not the primary target. The model can handle chat, but its strengths are in tool-oriented, multi-turn assistant roles. If you need a pure conversationalist, consider the base Qwen3.6 or a general-purpose fine-tune.

Concrete use cases: building a local coding assistant that calls a search API, a calculator tool, and a file reader sequentially; deploying a data pipeline agent that transforms CSV rows based on LLM-written transformations; running a personal research assistant that queries a local RAG database via function calls.

Running Carnice-V2-27b Locally

This is a 27B dense model. VRAM is the primary constraint. Below are the practical configurations for consumer GPUs, based on available GGUF quantizations from the provider’s official repo (kai-os/Carnice-V2-27b-GGUF).

Quantization	File Size	Minimum GPU VRAM	Recommended Hardware	Expected Tokens/sec (estimate)
IQ2_M	9.4 GB	16 GB	RTX 4060 Ti 16GB, M4 Max 24GB (offload)	15–25 t/s
Q2_K	10 GB	16 GB	Same, fallback for older runtimes	15–25 t/s
Q4_K_M	16 GB	20–24 GB	RTX 3090/4090, RTX 4070 Ti Super	20–35 t/s
Q5_K_M	18 GB	24 GB+	RTX 4090, M4 Max 40+GB	15–25 t/s
Q8_0	27 GB	32 GB+	Dual GPU, or CPU offload	10–20 t/s
BF16	51 GB	60 GB+	A6000, multi-GPU, or CPU-only	5–10 t/s

Best starting point: For a single 16GB GPU (RTX 4060 Ti 16GB, M4 Max 24GB unified memory), use carnice-v2-27b-IQ2_M.gguf (9.4 GB) if your runtime supports IQ quants. If not, Q2_K is the safe fallback. For a 24GB GPU (RTX 4090), Q4_K_M at 16 GB leaves 8 GB for context and KV cache. This delivers the best quality‑to‑speed ratio for agent workloads with 8–16K context.

10k-token context on Q4_K_M: With a 24GB card you can often run 32K context by reducing batch size or offloading some layers to CPU, but expect a slowdown. For long agent traces, keep context under 16K unless you have 48GB+.

Software: The fastest way to start is Ollama. As of Ollama 0.6+, you can import the GGUF directly:

1ollama create Carnice-V2-27b -f Modelfile

where Modelfile points to the local GGUF path and sets the system prompt for Hermes-style tool use. Alternatively, use llama-cli from a recent llama.cpp build (commit after July 2025). Ensure you pass -ngl all for full GPU offload.

Performance notes: Tokens per second vary widely with context length, quantization, and GPU memory bandwidth. On an RTX 4090 with Q4_K_M, expect 25–35 t/s for a single-turn prompt. With a 16K context, speed drops to 15–20 t/s due to KV cache pressure. The hybrid SSM/attention layers are slightly more compute-intensive than pure attention, but the difference is marginal in practice.

How It Compares

vs. Qwen3.6-27B (base model): Carnice-V2-27b is a targeted SFT of this base. The IFEval improvements are modest (3–5 points) on a short smoke benchmark, but the real gain is in agentic behavior: the model emits tool calls more reliably, follows function definitions with fewer hallucinations, and maintains chain-of-thought structure over long traces. If you only need general chat and coding, the base Qwen3.6 is a strong choice with slightly lower VRAM overhead (no custom prefix issues). For any deployment that uses a Hermes-style agent harness, Carnice is the better pick.

vs. other 27B SFT models (e.g., Nous-Hermes-2-Mixtral-8x7B): Mixtral-8x7B is a MoE model with ~13B active parameters, so it runs on 12–16GB cards at Q4 and behaves differently in terms of latency and memory. Carnice uses all 27B per token, which gives higher quality per token but requires more VRAM. For agentic tasks, Carnice’s specialized training on tool traces gives it an edge over general-purpose SFTs like Nous-Hermes-2. If VRAM is tight, the MoE alternative may be more practical.

Bottom line: Choose Carnice-V2-27b when you need a 27B dense model that reliably executes tool‑calling workflows on hardware that can handle its memory demand. For generic coding or conversation, consider the base Qwen3.6 or a MoE model.

Related Models

kai-os

Carnice-9b for Hermes agent

9BDense

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

kai-os

Carnice-V2-27b

A community-tuned, fully merged BF16 supervised fine-tune of the Qwen3.6-27B base model. Optimized specifically for Hermes-style agent traces and tool-oriented workflows.

27B paramsDense262K ctx

View on Hugging Face Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A solid 27B-parameter dense language model from kai-os. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onGoogle Cloud TPU v5pCheapest card in our directory with comfortable headroom (95 GB) for this model at Q4 (~72.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Instruction Following

Model Specifications

Parameters27B

ArchitectureDense

Context Length262K tokens

ModalityText Only

Providerkai-os

Download Size106.9 GB

Community

Monthly Downloads140

Likes67

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

57.3BB

Benchmark40%

89.0

Popularity25%

2.1

Efficiency20%

59.2

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	67.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	72.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	75.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	78.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	85.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	111.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Intel Gaudi 3 AI AcceleratorIntel	SS	40.9 tok/s	72.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	53.1 tok/s	72.8 GB
AMD Instinct MI300XAMD	SS	58.6 tok/s	72.8 GB
Google TPU v7 (Ironwood)Google	SS	81.6 tok/s	72.8 GB
NVIDIA B200 GPUNVIDIA	SS	88.5 tok/s	72.8 GB
AMD Instinct MI325XAMD	SS	66.4 tok/s	72.8 GB
AMD Instinct MI355XAMD	SS	88.5 tok/s	72.8 GB
Google Cloud TPU v5pGoogle	SS	30.6 tok/s	72.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	78.5 tok/s	72.8 GB
Dell Pro Max with GB300Dell	SS	78.5 tok/s	72.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	78.5 tok/s	72.8 GB
HP ZGX Fury AI StationHP	SS	78.5 tok/s	72.8 GB
MSI XpertStation WS300MSI	SS	78.5 tok/s	72.8 GB
SuperMicro Super AI StationSuperMicro	SS	78.5 tok/s	72.8 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	27.1 tok/s	72.8 GB
NVIDIA H100 SXM5 80GBNVIDIA	AA	37.1 tok/s	72.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	8.8 tok/s	72.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	6.8 tok/s	72.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	6.8 tok/s	72.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	6.8 tok/s	72.8 GB
Apple M4 Max (40-core GPU)Apple	BB	6.0 tok/s	72.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	6.0 tok/s	72.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	6.0 tok/s	72.8 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	6.0 tok/s	72.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	5.7 tok/s	72.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~23 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Carnice-V2-27b on NVIDIA A100 SXM4 80GB · ~23 tok/s · 400W	$0.591
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 73 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$0.59
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM	$0.59
NVIDIA H100 SXMVast.ai · Spot · 80 GB VRAM	$1.07

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

The model is licensed Apache 2.0, making it safe for commercial use and derivative works.

Architecture & Technical Details

Capabilities & Use Cases

Function-calling: The model was trained on Hermes-format agent traces, which means it understands how to emit JSON tool calls, handle parallel function calls, and follow system-level tool definitions. This makes it a drop-in replacement for Hermes-2-Pro or Nous-Hermes-2 in agentic harnesses like, for example, the hermes-agent framework.
Instruction-following: IFEval scores improved by 3–5 points over the base model (90% strict prompt at limit 20). For structured tasks like data extraction, formatting, or schema generation, Carnice holds the instruction better.
Reasoning and code: While not a specialized code model, its reasoning capability is solid for GPT-like agent reasoning chains — e.g., planning a sequence of API calls or decomposing a user request into function-eligible subproblems. Expect competitive performance on standard code generation and explanation tasks relative to other 27B SFTs.
Chat: General conversation is not the primary target. The model can handle chat, but its strengths are in tool-oriented, multi-turn assistant roles. If you need a pure conversationalist, consider the base Qwen3.6 or a general-purpose fine-tune.

Running Carnice-V2-27b Locally

Quantization	File Size	Minimum GPU VRAM	Recommended Hardware	Expected Tokens/sec (estimate)
IQ2_M	9.4 GB	16 GB	RTX 4060 Ti 16GB, M4 Max 24GB (offload)	15–25 t/s
Q2_K	10 GB	16 GB	Same, fallback for older runtimes	15–25 t/s
Q4_K_M	16 GB	20–24 GB	RTX 3090/4090, RTX 4070 Ti Super	20–35 t/s
Q5_K_M	18 GB	24 GB+	RTX 4090, M4 Max 40+GB	15–25 t/s
Q8_0	27 GB	32 GB+	Dual GPU, or CPU offload	10–20 t/s
BF16	51 GB	60 GB+	A6000, multi-GPU, or CPU-only	5–10 t/s

Software: The fastest way to start is Ollama. As of Ollama 0.6+, you can import the GGUF directly:

1ollama create Carnice-V2-27b -f Modelfile

How It Compares

Related Models

kai-os

Carnice-9b for Hermes agent

9BDense

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.