Cluster-friendly Strix Halo mini workstation. 128GB LPDDR5X-8000, Radeon 8060S, dual 10GbE, USB4 V2, and 2U rack-mount support. Best pick when planning a 2-4 node Strix Halo cluster.
Sized for production serving of 70B–200B class models at full or lightly-quantized precision. Overkill for a homelab; right call when the workload pays for itself in token volume.
Generated from this product’s spec sheet. Editor reviews refine it over time.
The MINISFORUM MS-S1 Max (Ryzen AI Max+ 395) is a mini workstation built around AMD’s Strix Halo SoC—a single-chip design that combines a 16-core Zen 5 CPU, a 40-CU Radeon 8060S integrated GPU, and a 50 TOPS XDNA 2 NPU. This isn’t a consumer desktop or a thin laptop; it’s a production-ready edge node designed for practitioners who need unified memory capacity that discrete GPUs can’t match. At $1,899, it sits in the prosumer-to-enterprise sweet spot, competing directly with Apple’s Mac Studio M3 Ultra and high-end mini PCs equipped with external GPUs.
What makes the MS-S1 Max stand out for AI workloads is its unified 128GB LPDDR5X-8000 memory pool. The iGPU can address up to 96GB of that as VRAM—no PCIe bottleneck, no separate memory bus. Combined with 256 GB/s bandwidth and dual 10GbE networking, this machine is built for local inference clusters, agentic workflows, and multimodal model serving in constrained spaces. It’s the best pick today when planning a 2–4 node Strix Halo cluster.
The MS-S1 Max ships with 128GB of soldered LPDDR5X-8000 on a 256-bit bus. The iGPU dynamically allocates up to 96GB as unified VRAM. That’s more than double the VRAM of any consumer GPU (RTX 4090: 24GB) and matches or exceeds workstation cards like the RTX 6000 Ada (48GB). For inference, this means you can load large models entirely in GPU-accessible memory without offloading layers to system RAM or disk.
| Spec | Value |
|---|---|
| Total system RAM | 128 GB LPDDR5X-8000 |
| GPU-accessible VRAM | Up to 96 GB |
| Memory bandwidth | 256 GB/s |
| Memory bus width | 256-bit |
256 GB/s is competitive with mid-range discrete GPUs (RTX 4070 Super: ~504 GB/s, but with far less VRAM). For transformer inference, bandwidth directly drives token generation speed. Expect 80–120 tokens/s on 7B–13B models, and 30–50 tokens/s on 70B models at Q5_K_M.
The integrated Radeon 8060S (40 RDNA 3.5 CUs) delivers approximately 26 TFLOPS (FP16) and 50 TOPS (INT8) on the NPU alone. Combined platform AI performance reaches 126 TOPS (CPU + iGPU + NPU). In real-world Geekbench AI tests, the Radeon 8060S scored 25,316 (single precision) and 31,296 (half precision)—within striking distance of an RTX 4070 Super. The CPU’s OpenVINO quant score of 18,690 is among the highest in the mini-PC class.
TDP and cooling: 130W sustained, 160W peak. The six-pipe vapor chamber and dual centrifugal fans keep thermals in check even in Performance mode. Balanced and Quiet modes trade ~5–14% CPU performance for lower noise—useful for always-on inference servers.
For cluster deployments, the dual 10GbE ports allow direct node-to-node communication without a switch, or connection to a 10G backbone. USB4 V2 (80 Gbps) can daisy-chain external storage or additional compute.
The MS-S1 Max’s 96GB VRAM unlocks model sizes that are impractical on consumer hardware. Here’s what fits at common quantization levels:
| Model Family | Quantization | Fits in VRAM? | Expected Tokens/s |
|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | Yes | 100–130 |
| Llama 3.1 70B | Q5_K_M | Yes (sweet spot) | 30–45 |
| Llama 3.1 70B | Q8_0 | Yes | 25–35 |
| Llama 3.1 120B | Q4_K_M | Yes, with effort | 15–25 |
| DeepSeek-R1 32B | Q8_0 | Yes | 40–60 |
| DeepSeek-R1 67B | Q4_K_M | Yes | 25–35 |
| Qwen 2.5 72B | Q5_K_M | Yes | 30–40 |
| Mixtral 8x22B | Q4_K_M | Yes | 20–30 |
| Qwen 2.5-VL 72B (multimodal) | Q4_K_M | Yes | 20–30 |
Sweet spot: 70B parameters at Q5_K_M. This quantization preserves most of the model’s intelligence while keeping inference fast enough for interactive use. The 96GB VRAM leaves headroom for long-context tasks (128K+ tokens) and multimodal models that need additional memory for vision encoders.
What you can’t run: Large 200B+ models like Llama 3.1 405B (even at Q2) or full-precision 70B+ models with very long contexts. For those, you’d need a multi-node cluster or a server with 2–4 of these units.
AI engineers building agentic workflows. When your agent needs to call multiple models (e.g., a 70B planner + a 7B coder + a vision model), unified memory lets you load all three simultaneously without swapping. Dual 10GbE means you can distribute inference across a cluster of MS-S1 Max nodes.
Teams running local inference servers. The 2U rack-mount form factor fits standard server racks. 160W peak per node makes it viable for dense deployments—2–4 nodes in a short-depth rack consume less power than a single GPU server. Perfect for edge inference where power and space are constrained.
Hobbyists running large local LLMs. If you want to run a 70B model at Q5_K_M with a 128K context window, this is the most cost-effective way to do it without renting cloud GPUs. The $1,899 price is less than a single RTX 4090 + high-end CPU build, and you get 4x the VRAM.
Multimodal and long-context researchers. Models like Qwen 2.5-VL 72B or Llama 3.1 70B with 128K context need >48GB VRAM. The MS-S1 Max handles them natively, with enough bandwidth for real-time video analysis or document processing.
The Mac Studio M3 Ultra offers up to 192GB unified memory with 800 GB/s bandwidth—superior for extremely large models (200B+). However, the MS-S1 Max costs roughly half the price ($1,899 vs. ~$4,000+ for 192GB M3 Ultra) and provides dual 10GbE, USB4 V2, and a rack-mount option. If you need cluster deployment or Windows-native tooling (DirectML, ONNX, OpenVINO), the MS-S1 Max is the better fit. For pure macOS ecosystem and maximum single-node capacity, the Mac Studio wins.
An RTX 4090 system (~$2,500–$3,000) offers higher raw compute (82 TFLOPS FP16) and faster memory bandwidth (1,008 GB/s). But it’s limited to 24GB VRAM—you can’t load a 70B model at any reasonable quantization. The MS-S1 Max trades some speed for 4x the VRAM. If your workload fits in 24GB, the 4090 is faster. If you need larger models, the MS-S1 Max is the only option under $2,000.
Two used RTX 3090s (48GB total) cost roughly the same as the MS-S1 Max but require a full tower, 700W+ power supply, and complex multi-GPU inference setup. The unified memory on Strix Halo eliminates PCIe transfer overhead and simplifies model loading. For 70B models at Q5, the MS-S1 Max is more practical and energy-efficient.
Bottom line: The MINISFORUM MS-S1 Max is the most cost-effective way to run 70B–120B parameter models locally in a compact, cluster-friendly form factor. If your AI workloads demand large model sizes and you value deployment simplicity over raw FLOPS, this is the hardware to buy in 2026.
Qwen3-30B-A3BAlibaba Cloud (Qwen) | 30B(3B active) | AA | 38.3 tok/s | 5.4 GB | |
| 8B | AA | 36.4 tok/s | 5.7 GB | ||
Llama 2 7B ChatMeta | 7B | AA | 43.0 tok/s | 4.8 GB | |
| 9B | AA | 34.3 tok/s | 6.0 GB | ||
Gemma 4 E2B ITGoogle | 2B | AA | 55.6 tok/s | 3.7 GB | |
Qwen3.6 35B-A3BAlibaba Cloud | 35B(3B active) | AA | 24.2 tok/s | 8.5 GB | |
Qwen3.5-35B-A3BAlibaba Cloud (Qwen) | 35B(3B active) | AA | 24.2 tok/s | 8.5 GB | |
Mistral 7B InstructMistral AI | 7B | AA | 32.2 tok/s | 6.4 GB | |
| Ad | |||||
Llama 2 13B ChatMeta | 13B | AA | 24.3 tok/s | 8.5 GB | |
Gemma 4 E4B ITGoogle | 4B | BB | 29.8 tok/s | 6.9 GB | |
Gemma 3 4B ITGoogle | 4B | BB | 29.8 tok/s | 6.9 GB | |
Mixtral 8x7B InstructMistral AI | 46.7B(12.9B active) | BB | 18.1 tok/s | 11.4 GB | |
Gemma 4 26B-A4B ITGoogle | 26B(4B active) | BB | 18.7 tok/s | 11.0 GB | |
GLM-4.5Z.ai | 355B(32B active) | BB | 4.0 tok/s | 51.8 GB | |
Kimi K2 InstructMoonshot AI | 1000B(32B active) | BB | 4.0 tok/s | 51.8 GB | |
| 70B | BB | 4.5 tok/s | 45.7 GB | ||
| Ad | |||||
GLM-4.7Z.ai | 358B(32B active) | BB | 3.9 tok/s | 52.6 GB | |
Qwen3.5-397B-A17BAlibaba Cloud (Qwen) | 397B(17B active) | BB | 4.5 tok/s | 46.0 GB | |
Qwen 3.5 OmniAlibaba Cloud | 397B(17B active) | BB | 4.6 tok/s | 45.2 GB | |
Llama 2 70B ChatMeta | 70B | BB | 4.7 tok/s | 43.4 GB | |
Mixtral 8x22B InstructMistral AI | 141B(39B active) | BB | 4.7 tok/s | 43.6 GB | |
DeepSeek-V3DeepSeek | 671B(37B active) | BB | 3.4 tok/s | 59.8 GB | |
DeepSeek-R1DeepSeek | 671B(37B active) | BB | 3.4 tok/s | 59.8 GB | |
DeepSeek-V3.1DeepSeek | 671B(37B active) | BB | 3.4 tok/s | 59.8 GB | |
| Ad | |||||
DeepSeek-V3.2DeepSeek | 685B(37B active) | BB | 3.4 tok/s | 59.8 GB | |