Cohere's first model built for developers, a sparse Mixture-of-Experts coder with 30B total parameters and 3B active per token (128 experts, 8 activated). It handles a 256K-token context with up to 64K tokens of output and ships open-weight under Apache 2.0. The model targets agentic software engineering and scores 67.6 on SWE-Bench Verified, 40.2 on SWE-Bench Pro, 36 on Terminal-Bench v2, and 33.4 on the Artificial Analysis Coding Index. Cohere reports it runs on a single H100 and delivers up to 2.8x higher output throughput than Devstral Small 2.
A workable 30B-parameter MoE language model from Cohere. Pulls ahead on graduate-level reasoning (GPQA) (76/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 7.8 GB | Low | |
| Q4_K_MRecommended | 8.4 GB | Good | |
| Q5_K_M | 8.7 GB | Very Good | |
| Q6_K | 9.0 GB | Excellent | |
| Q8_0 | 9.8 GB | Near Perfect | |
| FP16 | 12.6 GB | Full |
See which devices can run this model and at what quality level.
| SS | 49.2 tok/s | 8.4 GB | ||
| SS | 41.5 tok/s | 8.4 GB | ||
| SS | 59.9 tok/s | 8.4 GB | ||
| SS | 61.5 tok/s | 8.4 GB | ||
| SS | 61.5 tok/s | 8.4 GB | ||
Google Cloud TPU v5eGoogle | SS | 78.6 tok/s | 8.4 GB | |
Intel Arc A770 16GBIntel | SS | 53.8 tok/s | 8.4 GB | |
Intel Arc B580Intel | SS | 43.8 tok/s | 8.4 GB | |
| SS | 92.2 tok/s | 8.4 GB | ||
NVIDIA GeForce RTX 4070NVIDIA | SS | 48.4 tok/s | 8.4 GB | |
| SS | 48.4 tok/s | 8.4 GB | ||
| SS | 64.5 tok/s | 8.4 GB | ||
| SS | 70.7 tok/s | 8.4 GB | ||
| SS | 43.0 tok/s | 8.4 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 64.5 tok/s | 8.4 GB | |
| SS | 86.0 tok/s | 8.4 GB | ||
| SS | 92.2 tok/s | 8.4 GB | ||
| SS | 76.8 tok/s | 8.4 GB | ||
| SS | 92.2 tok/s | 8.4 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | SS | 89.9 tok/s | 8.4 GB | |
| SS | 96.8 tok/s | 8.4 GB | ||
| SS | 157.5 tok/s | 8.4 GB | ||
| SS | 172.1 tok/s | 8.4 GB | ||
Origin PC M-CLASS v2Origin PC | SS | 172.1 tok/s | 8.4 GB | |
NVIDIA L40SNVIDIA | SS | 83.0 tok/s | 8.4 GB |
Energy cost on AMD Radeon RX 7600 8GB (~28 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)North Mini Code on AMD Radeon RX 7600 8GB · ~28 tok/s · 165W | $0.199 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 8 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA L4Vast.ai · Spot · 24 GB VRAM | $0.03 |
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM | $0.04 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.10 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
North Mini Code is Cohere’s first model built specifically for developers who need to run agentic coding workloads locally or in private infrastructure. At 30B total parameters with only 3B active per token, it’s a sparse Mixture-of-Experts (MoE) model that delivers coding performance competitive with much larger dense models while keeping inference cost low. Released under Apache 2.0, the weights are fully open and available on Hugging Face in bf16, FP8, and W4A16 quantizations.
This model targets a specific gap: local code agents that can reason about software engineering tasks, orchestrate sub-agents, and handle 256K-token contexts without relying on cloud APIs. It scores 67.6 on SWE-Bench Verified, 40.2 on SWE-Bench Pro, and 36 on Terminal-Bench v2. On the Artificial Analysis Coding Index, it posts 33.4 — competitive with Qwen3.6 35B A3B and ahead of Devstral Small 2, Gemma 4, and even larger models like Mistral Small 4.
North Mini Code is not a general-purpose chatbot. It is a coding agent engine.
The model uses a MoE architecture with 128 experts, of which 8 are activated per token. The 30B total parameters are sparsely distributed — only 3B participate in any forward pass. This has two practical consequences:
Context length is 256K tokens total, with a maximum generation of 64K tokens. That’s enough to dump an entire codebase into context and still have headroom for multi-turn debugging and patch generation. The architecture pair 30B total / 3B active is a deliberate tradeoff: you get the capacity of a 30B model (knowledge of libraries, language idioms, reasoning patterns) while paying the compute cost of a 3B model only at inference time.
The model is text-only. No vision, no audio. It’s optimized for code generation, function-calling, and structured reasoning.
Cohere recommends running on a single H100 at FP8 (minimum), with FP4 as a lighter option. The weights are available in bf16, FP8, and W4A16 formats on Hugging Face.
North Mini Code excels at tasks that require tool use, long-context reasoning, and multi-step code changes. Its primary strengths:
Where it underperforms: non-coding agentic tasks. On GDPval-AA (a generalist agent benchmark) it scores 14%, and on τ²-Bench Telecom it scores 37%. If your workload is purely conversational or domain-specific outside software engineering, this isn’t the best model for the job.
For developers building code agents — PR reviewers, bug-fixing bots, CI/CD assistants, or autonomous refactoring tools — North Mini Code is purpose-built.
The key question for practitioners: can I run this on my hardware? The answer depends on quantization.
| Quantization | Approximate VRAM | Notes |
|---|---|---|
| bf16 (full) | ~60 GB | 30B parameters × 2 bytes. Requires datacenter GPU (H100, A100). |
| FP8 | ~30 GB | 1 byte per parameter. Single H100 at FP8 is the minimum recommendation. |
| W4A16 (4-bit weights, 16-bit activations) | ~16 GB | Fits on an RTX 4090 (24 GB). This is the most practical setup for a consumer GPU. |
| Q4_K_M (GGUF format, 4-bit) | ~16–18 GB | Typical GGUF memory usage. Works on RTX 4090, M4 Max (64 GB unified), or dual RTX 3090s. |
Start with Ollama. The North Mini Code model is available in GGUF format and can be pulled with ollama pull cohere/north-mini-code. This is the fastest way to test on consumer GPUs. For production agent workloads, consider deploying the FP8 version on a single H100 via vLLM or SGLang.
At FP8 on an H100, Cohere reported ~199 output tokens/sec on the Artificial Analysis evaluation suite. On an RTX 4090 with Q4_K_M, you’ll get roughly 10–20 tok/s — slower, but still viable for non-real-time agent loops where latency is less critical than throughput over many calls.
Devstral Small 2 is a dense 24B model from the same agentic coding niche. Cohere’s internal tests show North Mini Code delivers up to 2.8x higher output throughput under identical concurrency, with 30% better inter-token latency. On the Artificial Analysis Coding Index, North Mini Code (33.4) beats Devstral Small 2 (slightly above 25, per Cohere’s figure). If you’re choosing between the two and have a single H100, North Mini Code gives better throughput and coding benchmarks.
Tradeoff: Devstral Small 2 may have better general-purpose reasoning or broader knowledge, since it’s a larger dense model. For agentic coding specifically, North Mini Code is ahead.
Qwen3.6 35B A3B is the same MoE size class (35B total, 3.6B active). It scores 35.2 on the Artificial Analysis Coding Index — slightly above North Mini Code’s 33.4. It also has broad multilingual support and strong general reasoning. North Mini Code is faster (199 tok/s on Cohere’s API vs Qwen3.6’s typical 120–150), and it’s specifically tuned for agentic tool use.
Tradeoff: Choose Qwen3.6 if you need better general coding scores and multilingual coverage. Choose North Mini Code if you need maximum throughput for a single-agent loop, especially on constrained hardware.
This model does not replace a 70B dense coder like DeepSeek-Coder-V2, but it occupies a sweet spot: near-state-of-the-art coding performance at a fraction of the resource cost.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Cohere model we track.