A community-tuned, fully merged BF16 supervised fine-tune of the Qwen3.6-27B base model. Optimized specifically for Hermes-style agent traces and tool-oriented workflows.
A workable 27B-parameter dense language model from kai-os. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 67.1 GB | Low | |
| Q4_K_MRecommended | 72.8 GB | Good | |
| Q5_K_M | 75.5 GB | Very Good | |
| Q6_K | 78.7 GB | Excellent | |
| Q8_0 | 85.5 GB | Near Perfect | |
| FP16 | 111.1 GB | Full |
See which devices can run this model and at what quality level.
| SS | 40.9 tok/s | 72.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 53.1 tok/s | 72.8 GB | |
| SS | 58.6 tok/s | 72.8 GB | ||
Google TPU v7 (Ironwood)Google | SS | 81.6 tok/s | 72.8 GB | |
NVIDIA B200 GPUNVIDIA | SS | 88.5 tok/s | 72.8 GB | |
| SS | 66.4 tok/s | 72.8 GB | ||
| SS | 88.5 tok/s | 72.8 GB | ||
Google Cloud TPU v5pGoogle | SS | 30.6 tok/s | 72.8 GB | |
| SS | 78.5 tok/s | 72.8 GB | ||
| SS | 78.5 tok/s | 72.8 GB | ||
Gigabyte W775-V10-L01Gigabyte | SS | 78.5 tok/s | 72.8 GB | |
| SS | 78.5 tok/s | 72.8 GB | ||
| SS | 78.5 tok/s | 72.8 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 78.5 tok/s | 72.8 GB | |
| AA | 27.1 tok/s | 72.8 GB | ||
NVIDIA H100 SXM5 80GBNVIDIA | AA | 37.1 tok/s | 72.8 GB | |
| BB | 8.8 tok/s | 72.8 GB | ||
| BB | 6.8 tok/s | 72.8 GB | ||
| BB | 6.8 tok/s | 72.8 GB | ||
| BB | 6.8 tok/s | 72.8 GB | ||
| BB | 6.0 tok/s | 72.8 GB | ||
| BB | 6.0 tok/s | 72.8 GB | ||
| BB | 6.0 tok/s | 72.8 GB | ||
| BB | 6.0 tok/s | 72.8 GB | ||
| BB | 5.7 tok/s | 72.8 GB |
Carnice-V2-27b is a community-tuned 27B parameter dense language model built by kai-os (Kai Stephens) as a full BF16 supervised fine-tune of the Qwen3.6-27B base. It is designed specifically for Hermes-style agentic workflows — think structured tool calls, multi-step reasoning chains, and instruction-heavy interactions where the model must follow precise formatting rules and return function calls in a defined schema.
This is not a general-purpose chat model with a thin instruction layer. It is a merged SFT (not a LoRA adapter) that was trained on a curated mix of 1,508 Carnice-specific agent traces, 1,015 DJLougen Hermes rows, and 950 Lambda GLM-5.1 Hermes rows. The training used assistant-token-only loss masking with 8,192-token windows (1,024 token overlap), yielding 6,554 training windows from 3,473 source rows. The result is a model that competes directly with other 27B-class agent-oriented models (e.g., NousResearch Hermes 2 34B, Qwen3.6-27B base) but with documented gains in instruction-following accuracy and lower held-out loss.
For developers building local agent frameworks — especially those using function-calling or tool-use protocols — Carnice-V2-27b offers a drop-in upgrade over the Qwen3.6 base without changing architecture or hardware requirements.
Carnice-V2-27b is a dense transformer model with 27B active parameters. Unlike mixture-of-experts architectures where only a subset of parameters route per token, dense models use all 27B parameters on every forward pass. The tradeoff is straightforward: you get consistent, predictable inference speeds at the cost of higher VRAM usage relative to MoE models of similar total parameter count (e.g., Mixtral 8x7B has ~47B total but only ~13B active).
Key architectural specs:
qwen35 in GGUF) The 262K context window is a significant capability for agentic workflows that require long conversation histories, multi-turn tool calls, or document-level context. In practice, most local hardware will context-length limited by VRAM, but the model supports full-length context if you have the memory (e.g., 48GB+ GPUs or CPU offload).
A note on the BF16 loading fix: the original merged weights had an extra Unsloth wrapper prefix that caused Hugging Face Transformers to treat real weights as “unexpected”. The repo now has corrected safetensors keys, so AutoModelForImageTextToText (or AutoModelForCausalLM for text-only) loads cleanly. GGUF exports were never affected.
Carnice-V2-27b is optimized for structured, tool-oriented agent interactions. Its primary capabilities — chat, code, reasoning, function-calling, instruction-following — map directly to these use cases:
<functioncall> tags), Carnice-V2-27b was built exactly for that. The SFT data heavily weights properly formatted tool invocations with required parameters.It is not a multimodal model — the image-text-to-text tag on Hugging Face is misleading; the weights are Qwen3.6-27B base (text-only) and the pipeline tag inherited from the base repo. Use it for text-only agent applications.
| Quantization | File Size | Minimum VRAM (16-bit) | Recommended GPU |
|---|---|---|---|
| IQ2_M | 9.4 GB | ~10 GB | RTX 4060 Ti 16GB, RTX 3060 12GB |
| Q2_K | 10.0 GB | ~11 GB | RTX 4060 Ti 16GB |
| Q4_K_M | 16 GB | ~18 GB | RTX 3090 / 4070 Ti Super (16GB with CPU offload) |
| Q5_K_M | 18 GB | ~20 GB | RTX 3090 24GB / A4000 16GB (offload) |
| Q8_0 | 27 GB | ~30 GB | RTX 4090 24GB (with CPU offload or split) |
| BF16 GGUF | 51 GB | ~54 GB | Dual 4090 / 2x A6000 |
For a 16GB GPU (RTX 4060 Ti 16GB, RTX 4080 Super): start with IQ2_M. This quantization uses an imatrix calibration pass tuned for Carnice/Hermes data, so it retains more quality than generic Q2_K. If your runtime (e.g., older llama.cpp) doesn’t support IQ quants, fall back to Q2_K (10GB). You can run with -c 8192 comfortably.
For a 24GB GPU (RTX 4090, 3090): Q4_K_M is the sweet spot — 16GB file, fits entirely on VRAM with 8K context. You can push to Q5_K_M (18GB) if you reduce context to 4K-6K or enable partial CPU offload for the KV cache.
For multi-GPU or 48GB+: Q8_0 or BF16 GGUF offers near-lossless quality.
These are rough estimates on an RTX 4090 (24GB) with q4_K_M quantization, 8K context:
Actual throughput varies by backend (llama.cpp, Ollama, ExLlamaV2), prompt length, and context size. Longer contexts will slow down due to KV cache overhead.
The quickest way: Ollama supports this model via the GGUF quantizations at kai-os/Carnice-V2-27b-GGUF.
1ollama run kai-os/carnice-v2-27b:q4_K_M
If you need function-calling support, ensure your Ollama version is recent enough to handle the qwen35 GGUF architecture (hybrid attention/SSM). For direct Python usage, use Hugging Face Transformers with AutoModelForCausalLM and the corrected safetensors.
vs. Qwen3.6-27B (base)
The base model is a capable general-purpose agent, but Carnice-V2-27b shows clear improvements in instruction-following (IFEval +5% strict) and lower assistant-token perplexity (1.513 vs 1.835). If you’re already using Qwen3.6-27B for agent workloads, expect fewer formatting errors and better adherence to tool-calling schemas. The tradeoff: the SFT is specialized — for pure code completion or open-ended chat, the base may feel more “creative” or less constrained.
vs. NousResearch Hermes 2 34B
Hermes 2 34B (based on Yi-34B) is a 34B dense model with similar agentic focus. Carnice-V2-27b is smaller (27B vs 34B) and runs on lower VRAM (Q4_K_M 16GB vs ~20GB for Hermes 2 34B Q4_K_M). Benchmarks are not directly comparable, but the held-out loss metric suggests Carnice-V2-27b is better regularized for agent traces. If you’re constrained to 16GB GPUs, Carnice is a better fit; if you have 24GB+ and want maximum capability, Hermes 2 34B is worth trying side-by-side.
vs. Mistral Small 3.1 24B
A 24B dense model with 128K context. Carnice-V2-27b has more parameters and a larger context window (262K vs 128K). For pure instruction-following, Carnice’s Hermes-focused training likely outperforms Mistral’s general SFT. However, Mistral Small 3.1 is more multilingual and works better out-of-box for general chat. Choose Carnice if your primary use case is structured tool use; choose Mistral Small 3.1 if you need broader language coverage.
Energy cost on NVIDIA A100 SXM4 80GB (~23 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Carnice-V2-27b on NVIDIA A100 SXM4 80GB · ~23 tok/s · 400W | $0.591 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 73 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM | $0.27 |
AMD Instinct MI300XRunPod · Community · 192 GB VRAM | $0.50 |
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM | $0.50 |
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM | $0.53 |
NVIDIA H100 SXMVast.ai · Spot · 80 GB VRAM | $1.07 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.