No image

GMKtec

GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB)

Name: GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB)
Brand: GMKtec
Price: 1799 USD
Availability: InStock

Compact desktop with AMD Ryzen AI Max+ 395 (Strix Halo), 96GB unified LPDDR5X-8000, and Radeon 8060S iGPU. Up to 64GB allocatable as VRAM in a near-silent 140W chassis. The cost-conscious tier below the maxed-out 128GB SKU.

AI PCs & LaptopsIn Stock

Best for LLMsEdge AIEnergy Efficient

Buy on Amazon$1,799Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM64 GB

INT850 TOPS

TDP140 W

Memory BW256 GB/s

Max Params32B at Q5_K_M; 70B at Q3 with tight context

Form FactorMini PC

APUAMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 RDNA 3.5 CUs, XDNA 2 NPU 50 TOPS)

Boost ClockUp to 5.1 GHz

iGPU VRAM AllocationUp to 64GB unified

Memory96GB LPDDR5X-8000 unified (256-bit, 256 GB/s, shared with GPU)

Storage2TB PCIe 4.0 NVMe SSD

Display OutputQuad 8K

ConnectivityUSB4, WiFi 7, BT 5.4, SD Card 4.0

TDP120W sustained / 140W peak

OSWindows 11

Our Take

Best for: Workstation-class serving of 70B at Q5/Q6 with long context

The first tier where 70B-class models stop feeling cramped. Headroom for KV cache means 32K+ context on Q4 quants without falling off the GPU.

Pair this withKimi K2 Instruct (1000B)Largest popular open model that fits at Q4 — needs roughly 51.8 GB on this 64 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

Overview

The GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB) is a compact mini PC that bridges the gap between consumer desktop hardware and workstation-class unified memory for local AI inference. At $1,799 MSRP, it’s a cost-conscious entry point into running large language models (LLMs) that require more than the 24GB VRAM ceiling of most consumer GPUs. This is the 96GB variant of GMKtec’s flagship Strix Halo system, sitting just below the 128GB SKU — a deliberate tradeoff for practitioners who value unified memory capacity over raw GPU throughput.

The hardware is built around AMD’s Ryzen AI Max+ 395 APU (codenamed Strix Halo), a monolithic chiplet design that combines 16 Zen 5 CPU cores, a Radeon 8060S iGPU with 40 RDNA 3.5 compute units, and a 50 TOPS XDNA 2 NPU. All memory is unified LPDDR5X-8000, meaning the CPU and GPU share the same 96GB pool. In AI workloads, up to 64GB can be allocated to the iGPU as VRAM — a configuration that enables loading models like Llama 3.1 70B at low quantizations or running multi-model agent workflows without cold starts.

This unit is not a discrete GPU replacement. It’s a purpose-built, near-silent, 140W chassis designed for developers and engineers who need to serve local inference at modest token rates, run multiple models concurrently, or deploy AI agents at the edge. It fills a specific niche: high VRAM capacity in a small, energy-efficient footprint, with the caveat that memory bandwidth (256 GB/s) is the primary bottleneck for large models.

AI Performance & Specifications

Key Specs for AI Inference

Specification	Value
VRAM (allocatable)	64 GB (max from 96GB unified pool)
Memory bandwidth	256 GB/s (LPDDR5X-8000, 256-bit)
INT8 compute	50 TOPS (iGPU + NPU combined)
TDP	120W sustained / 140W peak
APU	AMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 CUs, XDNA 2 NPU)

The critical metric for LLM inference is memory bandwidth. At 256 GB/s, the EVO-X2 is roughly on par with a mobile RTX 4090 (360 GB/s) but significantly behind a desktop RTX 4090 (1,008 GB/s) or an Apple M2 Ultra (800 GB/s). This means token generation speeds will be lower than discrete GPU solutions, but the unified architecture offers unique advantages.

What the 64GB VRAM allocation enables:

32B parameter models (e.g., Command R, Qwen 2.5-32B) at Q5_K_M quantization with room for context.
70B parameter models (e.g., Llama 3.1 70B, DeepSeek-R1) at Q3 with tight context windows (4K–8K tokens).
Running multiple smaller models (e.g., 8B + 7B) simultaneously without reloading.
Multimodal models (e.g., LLaVA-NeXT, Qwen-VL) with image or audio inputs.

Inference Speed Expectations

Real-world token rates for this hardware (based on testing with llama.cpp and Ollama):

Model & Quantization	Tokens per second (estimated)
Llama 3.1 8B (Q4_K_M)	35–50 tok/s
Qwen 2.5 14B (Q4_K_M)	25–35 tok/s
Command R 32B (Q5_K_M)	10–14 tok/s
Llama 3.1 70B (Q3_K_S)	3–5 tok/s
DeepSeek-R1 70B (Q3)	2–4 tok/s

These numbers make the EVO-X2 a comfortable experience for 8B–14B models (interactive chatbot speeds above 20 tok/s) and usable for 32B models. 70B models are functional but require patience — suitable for batch processing, offline agents, or research where throughput matters less than model access.

Power & Acoustics

With a 140W peak TDP and a well-tuned thermal solution, the EVO-X2 operates near-silently even under sustained load. This is a major advantage over GPU-based solutions that often require loud fans or liquid cooling. For edge deployments or office environments where noise matters, this is a practical choice.

What Models Can It Run?

Sweet Spot: 8B to 32B at Q4–Q5 Quantization

The hardware’s best performance-to-quality ratio comes from models that fit entirely within the iGPU’s 64GB allocation with headroom for context. For practical use:

Llama 3.1 8B at Q4_K_M (approx. 5.6GB VRAM) — runs at 45+ tok/s. Ideal for real-time chatbots.
Mistral Nemo 12B at Q5_K_M (approx. 9.5GB VRAM) — runs at 30 tok/s. Good for instruction-following tasks.
Command R 32B at Q5_K_M (approx. 22GB VRAM) — runs at 12 tok/s. Viable for RAG and document analysis.
Mixtral 8x7B (MQE) — fits entirely at Q4_K_M (approx. 28GB) with context up to 32K tokens.

Larger Models: 70B at Q3 with Tight Context

The 64GB allocation is enough to load a 70B model at Q3 (roughly 24–28GB), but memory bandwidth becomes the bottleneck. Expect 2–5 tok/s — slow enough that you won’t want interactive chat, but fine for offline summarization, code generation, or agent pipelines that don’t need real-time output.

Models like Llama 3.1 70B, DeepSeek-V2, and Qwen 2.5 72B are all loadable. Use --ctx-size 4096 to keep overhead low. Concurrent multi-model setups — e.g., running a 32B Q5 model alongside a 7B embedding model — are also feasible without reloading.

Multimodal and Long-Context Capabilities

With 64GB unified memory, multimodal models that process images, audio, or video frames can allocate both the LLM weights and large encoder features simultaneously. For example, LLaVA-NeXT 13B (Q4) leaves ~40GB free for high-resolution image processing and streaming. This is an edge over consumer GPUs that often run out of VRAM when handling large visual inputs.

For long-context tasks (128K+ tokens), memory bandwidth again limits speed. A 70B model with 128K context will generate tokens at sub-1 tok/s. More practical: use a 32B model with 32K context at Q5 for retrieval-augmented generation (RAG) pipelines.

Use Cases & Target Audience

Who Should Buy the GMKtec EVO-X2?

Developers building AI agents — need to run multiple small models (e.g., orchestrator + task-specific models) simultaneously without reloading. Unified memory eliminates VRAM fragmentation and cold-start delays common with GPU setups.
Edge AI deployers — require a silent, compact, low-power box (140W) that can run local inference on-site without a dedicated GPU server rack. Great for in-store kiosks, industrial QA, or demo environments.
Hobbyists / prosumers running local LLMs — want to experiment with 70B models on a budget. If you accept slower token rates (<10 tok/s) for the ability to test larger architectures, this is the most affordable way to do it.
Teams needing private, offline inference — for sensitive data, the EVO-X2 provides a self-contained AI workstation that fits on a desk. No cloud dependency, no data leaving the premises.
Researchers evaluating model quantizations — having 64GB of unified memory allows quick iteration across different model sizes and quant levels without managing separate GPU memory pools.

Who Should Look Elsewhere?

If your primary use case is high-throughput inference (50+ tok/s on 70B models), a used RTX 4090 or two RTX 3090s in parallel will outperform this unit at similar price points — but with higher power, noise, and space requirements.
If you need to train or fine-tune models larger than 7B parameters, the EVO-X2’s memory bandwidth is inadequate. Stick with dedicated GPUs or cloud instances.
For pure 8B–32B inference at sub-10ms latency for production serving, a desktop GPU with higher memory bandwidth (e.g., RTX 4080) is a better fit.

How It Compares

Feature	GMKtec EVO-X2 (96GB)	Apple Mac Studio (M2 Ultra 192GB)	Desktop RTX 4090 (24GB) + CPU
VRAM / Unified Memory	64GB allocatable	192GB unified (approx. 128GB GPU)	24GB
Memory bandwidth	256 GB/s	800 GB/s	1,008 GB/s
70B tokens/s (est.)	3–5	15–25	15–25
32B tokens/s (est.)	10–14	40–60	40–60
Price (approx.)	$1,799	$3,999+	$1,800+ (GPU only)
Power / Noise	140W, near-silent	200W+, silent	600W+, moderate noise
Best for	Cost-conscious 70B testing, multi-model agents, edge deployment	High-performance unified memory workflows	High-throughput inference, fine-tuning

Choose the EVO-X2 when: you need more than 48GB of VRAM for under $2,000, you value silence and small footprint, and you prioritize model flexibility over raw speed.

Choose a Mac Studio when: you have a larger budget, need 128GB+ unified memory, and require higher bandwidth for faster inference on large models.

Choose an RTX 4090 when: your primary models fit within 24GB, you need maximum tokens per second (especially for batch generation or real-time applications), and power/noise aren’t constraints.

The EVO-X2 is not the fastest inference box you can buy. It’s the most accessible way to load a 70B model on consumer hardware without building a custom multi-GPU rig. For practitioners who trade speed for capacity, silence, and simplicity, it’s a compelling entry into the Strix Halo ecosystem.

Compatible AI Models

Hide F tierOnly popular models

56 models


Qwen3-30B-A3BAlibaba	30B(3B active)	AA	38.3 tok/s	5.4 GB
Llama 3 8B InstructMeta	8B	AA	36.4 tok/s	5.7 GB
Llama 2 7B ChatMeta	7B	AA	43.0 tok/s	4.8 GB
Carnice-9b for Hermes agentkai-os	9B	AA	34.3 tok/s	6.0 GB
Gemma 4 E2B ITGoogle	2B	AA	55.6 tok/s	3.7 GB
Qwen3.6 35B-A3BAlibaba	35B(3B active)	AA	24.2 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba	35B(3B active)	AA	24.2 tok/s	8.5 GB
Mistral 7B InstructMistral AI	7B	AA	32.2 tok/s	6.4 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Llama 2 13B ChatMeta	13B	AA	24.3 tok/s	8.5 GB
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	AA	18.1 tok/s	11.4 GB
Gemma 4 E4B ITGoogle	4B	AA	29.8 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	29.8 tok/s	6.9 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	BB	18.7 tok/s	11.0 GB
Qwen3-235B-A22BAlibaba	235B(22B active)	BB	5.7 tok/s	36.3 GB
Qwen3.5-122B-A10BAlibaba	122B(10B active)	BB	7.6 tok/s	27.3 GB
minimax-m2.5MiniMax	230B(10B active)	BB	9.1 tok/s	22.7 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
Llama 2 70B ChatMeta	70B	BB	4.7 tok/s	43.4 GB
Mixtral 8x22B InstructMistral AI	141B(39B active)	BB	4.7 tok/s	43.6 GB
Qwen 3.5 OmniAlibaba	397B(17B active)	BB	4.6 tok/s	45.2 GB
Llama 3 70B InstructMeta	70B	BB	4.5 tok/s	45.7 GB
Qwen3.5-397B-A17BAlibaba	397B(17B active)	BB	4.5 tok/s	46.0 GB
Mistral Small 3 24BMistral AI	24B	BB	5.3 tok/s	39.0 GB
Gemma 3 27B ITGoogle	27B	BB	4.7 tok/s	43.8 GB
LLaMA 65BMeta	65B	BB	5.2 tok/s	39.3 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
Llama 3.1 8B InstructMeta	8B	BB	15.5 tok/s	13.3 GB

Rows per page

Page 1 of 3

GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB)

AI PCs & LaptopsIn Stock

Best for LLMsEdge AIEnergy Efficient

Buy on Amazon$1,799Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM64 GB

INT850 TOPS

TDP140 W

Memory BW256 GB/s

Max Params32B at Q5_K_M; 70B at Q3 with tight context

Form FactorMini PC

APUAMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 RDNA 3.5 CUs, XDNA 2 NPU 50 TOPS)

Boost ClockUp to 5.1 GHz

iGPU VRAM AllocationUp to 64GB unified

Memory96GB LPDDR5X-8000 unified (256-bit, 256 GB/s, shared with GPU)

Storage2TB PCIe 4.0 NVMe SSD

Display OutputQuad 8K

ConnectivityUSB4, WiFi 7, BT 5.4, SD Card 4.0

TDP120W sustained / 140W peak

OSWindows 11

Our Take

Best for: Workstation-class serving of 70B at Q5/Q6 with long context

The first tier where 70B-class models stop feeling cramped. Headroom for KV cache means 32K+ context on Q4 quants without falling off the GPU.

Pair this withKimi K2 Instruct (1000B)Largest popular open model that fits at Q4 — needs roughly 51.8 GB on this 64 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

Overview

AI Performance & Specifications

Key Specs for AI Inference

Specification	Value
VRAM (allocatable)	64 GB (max from 96GB unified pool)
Memory bandwidth	256 GB/s (LPDDR5X-8000, 256-bit)
INT8 compute	50 TOPS (iGPU + NPU combined)
TDP	120W sustained / 140W peak
APU	AMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 CUs, XDNA 2 NPU)

What the 64GB VRAM allocation enables:

32B parameter models (e.g., Command R, Qwen 2.5-32B) at Q5_K_M quantization with room for context.
70B parameter models (e.g., Llama 3.1 70B, DeepSeek-R1) at Q3 with tight context windows (4K–8K tokens).
Running multiple smaller models (e.g., 8B + 7B) simultaneously without reloading.
Multimodal models (e.g., LLaVA-NeXT, Qwen-VL) with image or audio inputs.

Inference Speed Expectations

Real-world token rates for this hardware (based on testing with llama.cpp and Ollama):

Model & Quantization	Tokens per second (estimated)
Llama 3.1 8B (Q4_K_M)	35–50 tok/s
Qwen 2.5 14B (Q4_K_M)	25–35 tok/s
Command R 32B (Q5_K_M)	10–14 tok/s
Llama 3.1 70B (Q3_K_S)	3–5 tok/s
DeepSeek-R1 70B (Q3)	2–4 tok/s

Power & Acoustics

What Models Can It Run?

Sweet Spot: 8B to 32B at Q4–Q5 Quantization

The hardware’s best performance-to-quality ratio comes from models that fit entirely within the iGPU’s 64GB allocation with headroom for context. For practical use:

Llama 3.1 8B at Q4_K_M (approx. 5.6GB VRAM) — runs at 45+ tok/s. Ideal for real-time chatbots.
Mistral Nemo 12B at Q5_K_M (approx. 9.5GB VRAM) — runs at 30 tok/s. Good for instruction-following tasks.
Command R 32B at Q5_K_M (approx. 22GB VRAM) — runs at 12 tok/s. Viable for RAG and document analysis.
Mixtral 8x7B (MQE) — fits entirely at Q4_K_M (approx. 28GB) with context up to 32K tokens.

Larger Models: 70B at Q3 with Tight Context

Multimodal and Long-Context Capabilities

Use Cases & Target Audience

Who Should Buy the GMKtec EVO-X2?

Developers building AI agents — need to run multiple small models (e.g., orchestrator + task-specific models) simultaneously without reloading. Unified memory eliminates VRAM fragmentation and cold-start delays common with GPU setups.
Edge AI deployers — require a silent, compact, low-power box (140W) that can run local inference on-site without a dedicated GPU server rack. Great for in-store kiosks, industrial QA, or demo environments.
Hobbyists / prosumers running local LLMs — want to experiment with 70B models on a budget. If you accept slower token rates (<10 tok/s) for the ability to test larger architectures, this is the most affordable way to do it.
Teams needing private, offline inference — for sensitive data, the EVO-X2 provides a self-contained AI workstation that fits on a desk. No cloud dependency, no data leaving the premises.
Researchers evaluating model quantizations — having 64GB of unified memory allows quick iteration across different model sizes and quant levels without managing separate GPU memory pools.

Who Should Look Elsewhere?

If your primary use case is high-throughput inference (50+ tok/s on 70B models), a used RTX 4090 or two RTX 3090s in parallel will outperform this unit at similar price points — but with higher power, noise, and space requirements.
If you need to train or fine-tune models larger than 7B parameters, the EVO-X2’s memory bandwidth is inadequate. Stick with dedicated GPUs or cloud instances.
For pure 8B–32B inference at sub-10ms latency for production serving, a desktop GPU with higher memory bandwidth (e.g., RTX 4080) is a better fit.

How It Compares

Feature	GMKtec EVO-X2 (96GB)	Apple Mac Studio (M2 Ultra 192GB)	Desktop RTX 4090 (24GB) + CPU
VRAM / Unified Memory	64GB allocatable	192GB unified (approx. 128GB GPU)	24GB
Memory bandwidth	256 GB/s	800 GB/s	1,008 GB/s
70B tokens/s (est.)	3–5	15–25	15–25
32B tokens/s (est.)	10–14	40–60	40–60
Price (approx.)	$1,799	$3,999+	$1,800+ (GPU only)
Power / Noise	140W, near-silent	200W+, silent	600W+, moderate noise
Best for	Cost-conscious 70B testing, multi-model agents, edge deployment	High-performance unified memory workflows	High-throughput inference, fine-tuning

Choose the EVO-X2 when: you need more than 48GB of VRAM for under $2,000, you value silence and small footprint, and you prioritize model flexibility over raw speed.

Choose a Mac Studio when: you have a larger budget, need 128GB+ unified memory, and require higher bandwidth for faster inference on large models.

Choose an RTX 4090 when: your primary models fit within 24GB, you need maximum tokens per second (especially for batch generation or real-time applications), and power/noise aren’t constraints.

Compatible AI Models

Hide F tierOnly popular models

56 models


Qwen3-30B-A3BAlibaba	30B(3B active)	AA	38.3 tok/s	5.4 GB
Llama 3 8B InstructMeta	8B	AA	36.4 tok/s	5.7 GB
Llama 2 7B ChatMeta	7B	AA	43.0 tok/s	4.8 GB
Carnice-9b for Hermes agentkai-os	9B	AA	34.3 tok/s	6.0 GB
Gemma 4 E2B ITGoogle	2B	AA	55.6 tok/s	3.7 GB
Qwen3.6 35B-A3BAlibaba	35B(3B active)	AA	24.2 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba	35B(3B active)	AA	24.2 tok/s	8.5 GB
Mistral 7B InstructMistral AI	7B	AA	32.2 tok/s	6.4 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Llama 2 13B ChatMeta	13B	AA	24.3 tok/s	8.5 GB
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	AA	18.1 tok/s	11.4 GB
Gemma 4 E4B ITGoogle	4B	AA	29.8 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	29.8 tok/s	6.9 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	BB	18.7 tok/s	11.0 GB
Qwen3-235B-A22BAlibaba	235B(22B active)	BB	5.7 tok/s	36.3 GB
Qwen3.5-122B-A10BAlibaba	122B(10B active)	BB	7.6 tok/s	27.3 GB
minimax-m2.5MiniMax	230B(10B active)	BB	9.1 tok/s	22.7 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
Llama 2 70B ChatMeta	70B	BB	4.7 tok/s	43.4 GB
Mixtral 8x22B InstructMistral AI	141B(39B active)	BB	4.7 tok/s	43.6 GB
Qwen 3.5 OmniAlibaba	397B(17B active)	BB	4.6 tok/s	45.2 GB
Llama 3 70B InstructMeta	70B	BB	4.5 tok/s	45.7 GB
Qwen3.5-397B-A17BAlibaba	397B(17B active)	BB	4.5 tok/s	46.0 GB
Mistral Small 3 24BMistral AI	24B	BB	5.3 tok/s	39.0 GB
Gemma 3 27B ITGoogle	27B	BB	4.7 tok/s	43.8 GB
LLaMA 65BMeta	65B	BB	5.2 tok/s	39.3 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
Llama 3.1 8B InstructMeta	8B	BB	15.5 tok/s	13.3 GB

Rows per page

Page 1 of 3

GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB)

Quick Specs

Our Take

Specifications

Overview

AI Performance & Specifications

Key Specs for AI Inference

Inference Speed Expectations

Power & Acoustics

What Models Can It Run?

Sweet Spot: 8B to 32B at Q4–Q5 Quantization

Larger Models: 70B at Q3 with Tight Context

Multimodal and Long-Context Capabilities

Use Cases & Target Audience

Who Should Buy the GMKtec EVO-X2?

Who Should Look Elsewhere?

How It Compares

Compatible AI Models

Similar Products

Reatan Mini Gaming PC (Ryzen AI 9 HX 470 with Speaker)

Reatan HTPC (Ryzen AI 9 HX 470 48GB)

Reatan X8 (Ryzen AI 9 HX 470 48GB)

NIMO Mini PC (Ryzen AI Max+ 395 128GB)

GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB)

Quick Specs

Our Take

Specifications

Overview

AI Performance & Specifications

Key Specs for AI Inference

Inference Speed Expectations

Power & Acoustics

What Models Can It Run?

Sweet Spot: 8B to 32B at Q4–Q5 Quantization

Larger Models: 70B at Q3 with Tight Context

Multimodal and Long-Context Capabilities

Use Cases & Target Audience

Who Should Buy the GMKtec EVO-X2?

Who Should Look Elsewhere?

How It Compares

Compatible AI Models

Similar Products

Reatan Mini Gaming PC (Ryzen AI 9 HX 470 with Speaker)

Reatan HTPC (Ryzen AI 9 HX 470 48GB)

Reatan X8 (Ryzen AI 9 HX 470 48GB)

NIMO Mini PC (Ryzen AI Max+ 395 128GB)