Cohere

North Mini Code

Cohere's first model built for developers, a sparse Mixture-of-Experts coder with 30B total parameters and 3B active per token (128 experts, 8 activated). It handles a 256K-token context with up to 64K tokens of output and ships open-weight under Apache 2.0. The model targets agentic software engineering and scores 67.6 on SWE-Bench Verified, 40.2 on SWE-Bench Pro, 36 on Terminal-Bench v2, and 33.4 on the Artificial Analysis Coding Index. Cohere reports it runs on a single H100 and delivers up to 2.8x higher output throughput than Devstral Small 2.

30B paramsMoE256K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A workable 30B-parameter MoE language model from Cohere. Pulls ahead on graduate-level reasoning (GPQA) (76/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Function Calling

Instruction Following

Model Specifications

Parameters30B

Active Params3B

ArchitectureMoE

Context Length256K tokens

ModalityText Only

ProviderCohere

Download Size61.0 GB

Community

Monthly Downloads26.0K

Likes488

Last Updated11 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

75.7

67.6

9.9

36.0

40.2

AA Intelligence Index

20.6

38.2

31.1

57.5

32.3

MBA Open Score

49.2CC

Benchmark40%

40.9

Popularity25%

23.9

Efficiency20%

87.3

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	49.2 tok/s	8.4 GB
AMD Radeon RX 7700 XTAMD	SS	41.5 tok/s	8.4 GB
AMD Radeon RX 7800 XTAMD	SS	59.9 tok/s	8.4 GB
AMD Radeon RX 9070AMD	SS	61.5 tok/s	8.4 GB
AMD Radeon RX 9070 XTAMD	SS	61.5 tok/s	8.4 GB
Google Cloud TPU v5eGoogle	SS	78.6 tok/s	8.4 GB
Intel Arc A770 16GBIntel	SS	53.8 tok/s	8.4 GB
Intel Arc B580Intel	SS	43.8 tok/s	8.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	92.2 tok/s	8.4 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	48.4 tok/s	8.4 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	48.4 tok/s	8.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	64.5 tok/s	8.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	70.7 tok/s	8.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	43.0 tok/s	8.4 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	64.5 tok/s	8.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	86.0 tok/s	8.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	92.2 tok/s	8.4 GB
AMD Radeon RX 7900 XTAMD	SS	76.8 tok/s	8.4 GB
AMD Radeon RX 7900 XTXAMD	SS	92.2 tok/s	8.4 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	89.9 tok/s	8.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	96.8 tok/s	8.4 GB
Google Cloud TPU v6e (Trillium)Google	SS	157.5 tok/s	8.4 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	172.1 tok/s	8.4 GB
Origin PC M-CLASS v2Origin PC	SS	172.1 tok/s	8.4 GB
NVIDIA L40SNVIDIA	SS	83.0 tok/s	8.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~28 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)North Mini Code on AMD Radeon RX 7600 8GB · ~28 tok/s · 165W	$0.199
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 8 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

North Mini Code is Cohere’s first model built specifically for developers who need to run agentic coding workloads locally or in private infrastructure. At 30B total parameters with only 3B active per token, it’s a sparse Mixture-of-Experts (MoE) model that delivers coding performance competitive with much larger dense models while keeping inference cost low. Released under Apache 2.0, the weights are fully open and available on Hugging Face in bf16, FP8, and W4A16 quantizations.

This model targets a specific gap: local code agents that can reason about software engineering tasks, orchestrate sub-agents, and handle 256K-token contexts without relying on cloud APIs. It scores 67.6 on SWE-Bench Verified, 40.2 on SWE-Bench Pro, and 36 on Terminal-Bench v2. On the Artificial Analysis Coding Index, it posts 33.4 — competitive with Qwen3.6 35B A3B and ahead of Devstral Small 2, Gemma 4, and even larger models like Mistral Small 4.

North Mini Code is not a general-purpose chatbot. It is a coding agent engine.

Architecture & Technical Details

The model uses a MoE architecture with 128 experts, of which 8 are activated per token. The 30B total parameters are sparsely distributed — only 3B participate in any forward pass. This has two practical consequences:

VRAM requirements are lower than a 30B dense model. You still need to load all parameters into memory, but because the compute graph is sparse, you can use aggressively quantized variants (e.g., W4A16) to fit on consumer GPUs.
Inference throughput is high. Cohere reports up to 2.8x higher output throughput than Devstral Small 2 under identical concurrency, and a 30% advantage in inter-token latency. In practice, that means faster iteration on agent rollouts.

Context length is 256K tokens total, with a maximum generation of 64K tokens. That’s enough to dump an entire codebase into context and still have headroom for multi-turn debugging and patch generation. The architecture pair 30B total / 3B active is a deliberate tradeoff: you get the capacity of a 30B model (knowledge of libraries, language idioms, reasoning patterns) while paying the compute cost of a 3B model only at inference time.

The model is text-only. No vision, no audio. It’s optimized for code generation, function-calling, and structured reasoning.

Cohere recommends running on a single H100 at FP8 (minimum), with FP4 as a lighter option. The weights are available in bf16, FP8, and W4A16 formats on Hugging Face.

Capabilities & Use Cases

North Mini Code excels at tasks that require tool use, long-context reasoning, and multi-step code changes. Its primary strengths:

Agentic software engineering. It can interpret bug reports, navigate repository structures, write patches, and run tests. The SWE-Bench scores reflect real-world pull-request resolution, not toy problems.
Terminal-based agents. Terminal-Bench v2 and Terminal-Bench Hard measure the ability to chain shell commands, read logs, install dependencies, and recover from errors. North Mini Code scores 36 on v2 and 15.7 on Hard.
Complex code generation. LiveCodeBench v6 and SciCode test the model’s grasp of algorithms, data structures, and scientific computing. It handles these without needing to invoke external tools.
Function-calling and instruction following. The model supports structured tool definitions and can route between tool calls and open-ended responses. This makes it suitable for frameworks like LangChain, CrewAI, or custom agent harnesses.

Where it underperforms: non-coding agentic tasks. On GDPval-AA (a generalist agent benchmark) it scores 14%, and on τ²-Bench Telecom it scores 37%. If your workload is purely conversational or domain-specific outside software engineering, this isn’t the best model for the job.

For developers building code agents — PR reviewers, bug-fixing bots, CI/CD assistants, or autonomous refactoring tools — North Mini Code is purpose-built.

Running North Mini Code Locally

The key question for practitioners: can I run this on my hardware? The answer depends on quantization.

VRAM Requirements by Quantization

Quantization	Approximate VRAM	Notes
bf16 (full)	~60 GB	30B parameters × 2 bytes. Requires datacenter GPU (H100, A100).
FP8	~30 GB	1 byte per parameter. Single H100 at FP8 is the minimum recommendation.
W4A16 (4-bit weights, 16-bit activations)	~16 GB	Fits on an RTX 4090 (24 GB). This is the most practical setup for a consumer GPU.
Q4_K_M (GGUF format, 4-bit)	~16–18 GB	Typical GGUF memory usage. Works on RTX 4090, M4 Max (64 GB unified), or dual RTX 3090s.

Consumer Hardware Scenarios

RTX 4090 (24 GB). Run the W4A16 variant or a GGUF Q4_K_M. Expect 15–30 tokens per second depending on context length and prompt size. This is functional for interactive agent sessions but not real-time.
RTX 3090 / 3090 Ti (24 GB). Same VRAM but slower memory bandwidth. Expect 10–20 tok/s. Still usable for background agent tasks.
Apple Silicon (M4 Max with 64 GB unified). Can run the bf16 variant via on-device memory, but inference is slower than an H100. The unified memory model makes it easy to run large contexts without swapping.
Dual consumer GPUs (e.g., two RTX 3090). With NVLink or tensor parallelism, you can load the FP8 variant and get close to H100-level throughput.

Recommended Approach

Start with Ollama. The North Mini Code model is available in GGUF format and can be pulled with ollama pull cohere/north-mini-code. This is the fastest way to test on consumer GPUs. For production agent workloads, consider deploying the FP8 version on a single H100 via vLLM or SGLang.

Performance Expectations

At FP8 on an H100, Cohere reported ~199 output tokens/sec on the Artificial Analysis evaluation suite. On an RTX 4090 with Q4_K_M, you’ll get roughly 10–20 tok/s — slower, but still viable for non-real-time agent loops where latency is less critical than throughput over many calls.

How It Compares

vs. Devstral Small 2 (24B Dense)

Devstral Small 2 is a dense 24B model from the same agentic coding niche. Cohere’s internal tests show North Mini Code delivers up to 2.8x higher output throughput under identical concurrency, with 30% better inter-token latency. On the Artificial Analysis Coding Index, North Mini Code (33.4) beats Devstral Small 2 (slightly above 25, per Cohere’s figure). If you’re choosing between the two and have a single H100, North Mini Code gives better throughput and coding benchmarks.

Tradeoff: Devstral Small 2 may have better general-purpose reasoning or broader knowledge, since it’s a larger dense model. For agentic coding specifically, North Mini Code is ahead.

vs. Qwen3.6 35B A3B

Qwen3.6 35B A3B is the same MoE size class (35B total, 3.6B active). It scores 35.2 on the Artificial Analysis Coding Index — slightly above North Mini Code’s 33.4. It also has broad multilingual support and strong general reasoning. North Mini Code is faster (199 tok/s on Cohere’s API vs Qwen3.6’s typical 120–150), and it’s specifically tuned for agentic tool use.

Tradeoff: Choose Qwen3.6 if you need better general coding scores and multilingual coverage. Choose North Mini Code if you need maximum throughput for a single-agent loop, especially on constrained hardware.

When to Pick North Mini Code

You need an open-weight model that can run locally under Apache 2.0.
Your workload is agentic software engineering (SWE-Bench style) with long contexts.
You have a single H100 or a 24 GB consumer GPU and want the best coding/throughput ratio in this size class.
You want to avoid cloud API vendor lock-in.

This model does not replace a 70B dense coder like DeepSeek-Coder-V2, but it occupies a sweet spot: near-state-of-the-art coding performance at a fraction of the resource cost.

Explore the Provider

See all Cohere models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Cohere model we track.

Open Cohere

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

30B

Cohere

North Mini Code

30B paramsMoE256K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Function Calling

Instruction Following

Model Specifications

Parameters30B

Active Params3B

ArchitectureMoE

Context Length256K tokens

ModalityText Only

ProviderCohere

Download Size61.0 GB

Community

Monthly Downloads26.0K

Likes488

Last Updated11 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

75.7

67.6

9.9

36.0

40.2

AA Intelligence Index

20.6

38.2

31.1

57.5

32.3

MBA Open Score

49.2CC

Benchmark40%

40.9

Popularity25%

23.9

Efficiency20%

87.3

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	49.2 tok/s	8.4 GB
AMD Radeon RX 7700 XTAMD	SS	41.5 tok/s	8.4 GB
AMD Radeon RX 7800 XTAMD	SS	59.9 tok/s	8.4 GB
AMD Radeon RX 9070AMD	SS	61.5 tok/s	8.4 GB
AMD Radeon RX 9070 XTAMD	SS	61.5 tok/s	8.4 GB
Google Cloud TPU v5eGoogle	SS	78.6 tok/s	8.4 GB
Intel Arc A770 16GBIntel	SS	53.8 tok/s	8.4 GB
Intel Arc B580Intel	SS	43.8 tok/s	8.4 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	92.2 tok/s	8.4 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	48.4 tok/s	8.4 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	48.4 tok/s	8.4 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	64.5 tok/s	8.4 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	70.7 tok/s	8.4 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	43.0 tok/s	8.4 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	64.5 tok/s	8.4 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	86.0 tok/s	8.4 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	92.2 tok/s	8.4 GB
AMD Radeon RX 7900 XTAMD	SS	76.8 tok/s	8.4 GB
AMD Radeon RX 7900 XTXAMD	SS	92.2 tok/s	8.4 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	89.9 tok/s	8.4 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	96.8 tok/s	8.4 GB
Google Cloud TPU v6e (Trillium)Google	SS	157.5 tok/s	8.4 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	172.1 tok/s	8.4 GB
Origin PC M-CLASS v2Origin PC	SS	172.1 tok/s	8.4 GB
NVIDIA L40SNVIDIA	SS	83.0 tok/s	8.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~28 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)North Mini Code on AMD Radeon RX 7600 8GB · ~28 tok/s · 165W	$0.199
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 8 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

North Mini Code is not a general-purpose chatbot. It is a coding agent engine.

Architecture & Technical Details

VRAM requirements are lower than a 30B dense model. You still need to load all parameters into memory, but because the compute graph is sparse, you can use aggressively quantized variants (e.g., W4A16) to fit on consumer GPUs.
Inference throughput is high. Cohere reports up to 2.8x higher output throughput than Devstral Small 2 under identical concurrency, and a 30% advantage in inter-token latency. In practice, that means faster iteration on agent rollouts.

The model is text-only. No vision, no audio. It’s optimized for code generation, function-calling, and structured reasoning.

Cohere recommends running on a single H100 at FP8 (minimum), with FP4 as a lighter option. The weights are available in bf16, FP8, and W4A16 formats on Hugging Face.

Capabilities & Use Cases

North Mini Code excels at tasks that require tool use, long-context reasoning, and multi-step code changes. Its primary strengths:

Agentic software engineering. It can interpret bug reports, navigate repository structures, write patches, and run tests. The SWE-Bench scores reflect real-world pull-request resolution, not toy problems.
Terminal-based agents. Terminal-Bench v2 and Terminal-Bench Hard measure the ability to chain shell commands, read logs, install dependencies, and recover from errors. North Mini Code scores 36 on v2 and 15.7 on Hard.
Complex code generation. LiveCodeBench v6 and SciCode test the model’s grasp of algorithms, data structures, and scientific computing. It handles these without needing to invoke external tools.
Function-calling and instruction following. The model supports structured tool definitions and can route between tool calls and open-ended responses. This makes it suitable for frameworks like LangChain, CrewAI, or custom agent harnesses.

For developers building code agents — PR reviewers, bug-fixing bots, CI/CD assistants, or autonomous refactoring tools — North Mini Code is purpose-built.

Running North Mini Code Locally

The key question for practitioners: can I run this on my hardware? The answer depends on quantization.

VRAM Requirements by Quantization

Quantization	Approximate VRAM	Notes
bf16 (full)	~60 GB	30B parameters × 2 bytes. Requires datacenter GPU (H100, A100).
FP8	~30 GB	1 byte per parameter. Single H100 at FP8 is the minimum recommendation.
W4A16 (4-bit weights, 16-bit activations)	~16 GB	Fits on an RTX 4090 (24 GB). This is the most practical setup for a consumer GPU.
Q4_K_M (GGUF format, 4-bit)	~16–18 GB	Typical GGUF memory usage. Works on RTX 4090, M4 Max (64 GB unified), or dual RTX 3090s.

Consumer Hardware Scenarios

RTX 4090 (24 GB). Run the W4A16 variant or a GGUF Q4_K_M. Expect 15–30 tokens per second depending on context length and prompt size. This is functional for interactive agent sessions but not real-time.
RTX 3090 / 3090 Ti (24 GB). Same VRAM but slower memory bandwidth. Expect 10–20 tok/s. Still usable for background agent tasks.
Apple Silicon (M4 Max with 64 GB unified). Can run the bf16 variant via on-device memory, but inference is slower than an H100. The unified memory model makes it easy to run large contexts without swapping.
Dual consumer GPUs (e.g., two RTX 3090). With NVLink or tensor parallelism, you can load the FP8 variant and get close to H100-level throughput.

Recommended Approach

Performance Expectations

How It Compares

vs. Devstral Small 2 (24B Dense)

Tradeoff: Devstral Small 2 may have better general-purpose reasoning or broader knowledge, since it’s a larger dense model. For agentic coding specifically, North Mini Code is ahead.

vs. Qwen3.6 35B A3B

When to Pick North Mini Code

You need an open-weight model that can run locally under Apache 2.0.
Your workload is agentic software engineering (SWE-Bench style) with long contexts.
You have a single H100 or a 24 GB consumer GPU and want the best coding/throughput ratio in this size class.
You want to avoid cloud API vendor lock-in.

This model does not replace a 70B dense coder like DeepSeek-Coder-V2, but it occupies a sweet spot: near-state-of-the-art coding performance at a fraction of the resource cost.

Explore the Provider

See all Cohere models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Cohere model we track.

Open Cohere

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.