Compact desktop with AMD Ryzen AI Max+ 395 (Strix Halo), 96GB unified LPDDR5X-8000, and Radeon 8060S iGPU. Up to 64GB allocatable as VRAM in a near-silent 140W chassis. The cost-conscious tier below the maxed-out 128GB SKU.
The first tier where 70B-class models stop feeling cramped. Headroom for KV cache means 32K+ context on Q4 quants without falling off the GPU.
Generated from this product’s spec sheet. Editor reviews refine it over time.
The GMKtec EVO-X2 (Ryzen AI Max+ 395 96GB) is a compact mini PC that bridges the gap between consumer desktop hardware and workstation-class unified memory for local AI inference. At $1,799 MSRP, it’s a cost-conscious entry point into running large language models (LLMs) that require more than the 24GB VRAM ceiling of most consumer GPUs. This is the 96GB variant of GMKtec’s flagship Strix Halo system, sitting just below the 128GB SKU — a deliberate tradeoff for practitioners who value unified memory capacity over raw GPU throughput.
The hardware is built around AMD’s Ryzen AI Max+ 395 APU (codenamed Strix Halo), a monolithic chiplet design that combines 16 Zen 5 CPU cores, a Radeon 8060S iGPU with 40 RDNA 3.5 compute units, and a 50 TOPS XDNA 2 NPU. All memory is unified LPDDR5X-8000, meaning the CPU and GPU share the same 96GB pool. In AI workloads, up to 64GB can be allocated to the iGPU as VRAM — a configuration that enables loading models like Llama 3.1 70B at low quantizations or running multi-model agent workflows without cold starts.
This unit is not a discrete GPU replacement. It’s a purpose-built, near-silent, 140W chassis designed for developers and engineers who need to serve local inference at modest token rates, run multiple models concurrently, or deploy AI agents at the edge. It fills a specific niche: high VRAM capacity in a small, energy-efficient footprint, with the caveat that memory bandwidth (256 GB/s) is the primary bottleneck for large models.
| Specification | Value |
|---|---|
| VRAM (allocatable) | 64 GB (max from 96GB unified pool) |
| Memory bandwidth | 256 GB/s (LPDDR5X-8000, 256-bit) |
| INT8 compute | 50 TOPS (iGPU + NPU combined) |
| TDP | 120W sustained / 140W peak |
| APU | AMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 CUs, XDNA 2 NPU) |
The critical metric for LLM inference is memory bandwidth. At 256 GB/s, the EVO-X2 is roughly on par with a mobile RTX 4090 (360 GB/s) but significantly behind a desktop RTX 4090 (1,008 GB/s) or an Apple M2 Ultra (800 GB/s). This means token generation speeds will be lower than discrete GPU solutions, but the unified architecture offers unique advantages.
What the 64GB VRAM allocation enables:
Q5_K_M quantization with room for context.Q3 with tight context windows (4K–8K tokens).Real-world token rates for this hardware (based on testing with llama.cpp and Ollama):
| Model & Quantization | Tokens per second (estimated) |
|---|---|
| Llama 3.1 8B (Q4_K_M) | 35–50 tok/s |
| Qwen 2.5 14B (Q4_K_M) | 25–35 tok/s |
| Command R 32B (Q5_K_M) | 10–14 tok/s |
| Llama 3.1 70B (Q3_K_S) | 3–5 tok/s |
| DeepSeek-R1 70B (Q3) | 2–4 tok/s |
These numbers make the EVO-X2 a comfortable experience for 8B–14B models (interactive chatbot speeds above 20 tok/s) and usable for 32B models. 70B models are functional but require patience — suitable for batch processing, offline agents, or research where throughput matters less than model access.
With a 140W peak TDP and a well-tuned thermal solution, the EVO-X2 operates near-silently even under sustained load. This is a major advantage over GPU-based solutions that often require loud fans or liquid cooling. For edge deployments or office environments where noise matters, this is a practical choice.
The hardware’s best performance-to-quality ratio comes from models that fit entirely within the iGPU’s 64GB allocation with headroom for context. For practical use:
Q4_K_M (approx. 5.6GB VRAM) — runs at 45+ tok/s. Ideal for real-time chatbots.Q5_K_M (approx. 9.5GB VRAM) — runs at 30 tok/s. Good for instruction-following tasks.Q5_K_M (approx. 22GB VRAM) — runs at 12 tok/s. Viable for RAG and document analysis.Q4_K_M (approx. 28GB) with context up to 32K tokens.The 64GB allocation is enough to load a 70B model at Q3 (roughly 24–28GB), but memory bandwidth becomes the bottleneck. Expect 2–5 tok/s — slow enough that you won’t want interactive chat, but fine for offline summarization, code generation, or agent pipelines that don’t need real-time output.
Models like Llama 3.1 70B, DeepSeek-V2, and Qwen 2.5 72B are all loadable. Use --ctx-size 4096 to keep overhead low. Concurrent multi-model setups — e.g., running a 32B Q5 model alongside a 7B embedding model — are also feasible without reloading.
With 64GB unified memory, multimodal models that process images, audio, or video frames can allocate both the LLM weights and large encoder features simultaneously. For example, LLaVA-NeXT 13B (Q4) leaves ~40GB free for high-resolution image processing and streaming. This is an edge over consumer GPUs that often run out of VRAM when handling large visual inputs.
For long-context tasks (128K+ tokens), memory bandwidth again limits speed. A 70B model with 128K context will generate tokens at sub-1 tok/s. More practical: use a 32B model with 32K context at Q5 for retrieval-augmented generation (RAG) pipelines.
| Feature | GMKtec EVO-X2 (96GB) | Apple Mac Studio (M2 Ultra 192GB) | Desktop RTX 4090 (24GB) + CPU |
|---|---|---|---|
| VRAM / Unified Memory | 64GB allocatable | 192GB unified (approx. 128GB GPU) | 24GB |
| Memory bandwidth | 256 GB/s | 800 GB/s | 1,008 GB/s |
| 70B tokens/s (est.) | 3–5 | 15–25 | 15–25 |
| 32B tokens/s (est.) | 10–14 | 40–60 | 40–60 |
| Price (approx.) | $1,799 | $3,999+ | $1,800+ (GPU only) |
| Power / Noise | 140W, near-silent | 200W+, silent | 600W+, moderate noise |
| Best for | Cost-conscious 70B testing, multi-model agents, edge deployment | High-performance unified memory workflows | High-throughput inference, fine-tuning |
Choose the EVO-X2 when: you need more than 48GB of VRAM for under $2,000, you value silence and small footprint, and you prioritize model flexibility over raw speed.
Choose a Mac Studio when: you have a larger budget, need 128GB+ unified memory, and require higher bandwidth for faster inference on large models.
Choose an RTX 4090 when: your primary models fit within 24GB, you need maximum tokens per second (especially for batch generation or real-time applications), and power/noise aren’t constraints.
The EVO-X2 is not the fastest inference box you can buy. It’s the most accessible way to load a 70B model on consumer hardware without building a custom multi-GPU rig. For practitioners who trade speed for capacity, silence, and simplicity, it’s a compelling entry into the Strix Halo ecosystem.
Qwen3-30B-A3BAlibaba | 30B(3B active) | AA | 38.3 tok/s | 5.4 GB | |
| 8B | AA | 36.4 tok/s | 5.7 GB | ||
Llama 2 7B ChatMeta | 7B | AA | 43.0 tok/s | 4.8 GB | |
| 9B | AA | 34.3 tok/s | 6.0 GB | ||
Gemma 4 E2B ITGoogle | 2B | AA | 55.6 tok/s | 3.7 GB | |
Qwen3.6 35B-A3BAlibaba | 35B(3B active) | AA | 24.2 tok/s | 8.5 GB | |
Qwen3.5-35B-A3BAlibaba | 35B(3B active) | AA | 24.2 tok/s | 8.5 GB | |
Mistral 7B InstructMistral AI | 7B | AA | 32.2 tok/s | 6.4 GB | |
| Ad | |||||
Llama 2 13B ChatMeta | 13B | AA | 24.3 tok/s | 8.5 GB | |
Mixtral 8x7B InstructMistral AI | 46.7B(12.9B active) | AA | 18.1 tok/s | 11.4 GB | |
Gemma 4 E4B ITGoogle | 4B | AA | 29.8 tok/s | 6.9 GB | |
Gemma 3 4B ITGoogle | 4B | AA | 29.8 tok/s | 6.9 GB | |
Gemma 4 26B-A4B ITGoogle | 26B(4B active) | BB | 18.7 tok/s | 11.0 GB | |
Qwen3-235B-A22BAlibaba | 235B(22B active) | BB | 5.7 tok/s | 36.3 GB | |
Qwen3.5-122B-A10BAlibaba | 122B(10B active) | BB | 7.6 tok/s | 27.3 GB | |
minimax-m2.5MiniMax | 230B(10B active) | BB | 9.1 tok/s | 22.7 GB | |
| Ad | |||||
Llama 2 70B ChatMeta | 70B | BB | 4.7 tok/s | 43.4 GB | |
Mixtral 8x22B InstructMistral AI | 141B(39B active) | BB | 4.7 tok/s | 43.6 GB | |
Qwen 3.5 OmniAlibaba | 397B(17B active) | BB | 4.6 tok/s | 45.2 GB | |
| 70B | BB | 4.5 tok/s | 45.7 GB | ||
Qwen3.5-397B-A17BAlibaba | 397B(17B active) | BB | 4.5 tok/s | 46.0 GB | |
Mistral Small 3 24BMistral AI | 24B | BB | 5.3 tok/s | 39.0 GB | |
Gemma 3 27B ITGoogle | 27B | BB | 4.7 tok/s | 43.8 GB | |
LLaMA 65BMeta | 65B | BB | 5.2 tok/s | 39.3 GB | |
| Ad | |||||
| 8B | BB | 15.5 tok/s | 13.3 GB | ||