No image

MINISFORUM

MINISFORUM MS-S1 Max (Ryzen AI Max+ 395)

Name: MINISFORUM MS-S1 Max (Ryzen AI Max+ 395)
Brand: MINISFORUM
Price: 1899 USD
Availability: InStock

Cluster-friendly Strix Halo mini workstation. 128GB LPDDR5X-8000, Radeon 8060S, dual 10GbE, USB4 V2, and 2U rack-mount support. Best pick when planning a 2-4 node Strix Halo cluster.

AI PCs & LaptopsIn Stock

Best for LLMsEdge AIEnterpriseProduction Ready

Buy on Amazon$1,899Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Our Take

Best for: Datacenter inference for flagship dense models

Sized for production serving of 70B–200B class models at full or lightly-quantized precision. Overkill for a homelab; right call when the workload pays for itself in token volume.

Pair this with

Specifications

Overview

The MINISFORUM MS-S1 Max (Ryzen AI Max+ 395) is a mini workstation built around AMD’s Strix Halo SoC—a single-chip design that combines a 16-core Zen 5 CPU, a 40-CU Radeon 8060S integrated GPU, and a 50 TOPS XDNA 2 NPU. This isn’t a consumer desktop or a thin laptop; it’s a production-ready edge node designed for practitioners who need unified memory capacity that discrete GPUs can’t match. At $1,899, it sits in the prosumer-to-enterprise sweet spot, competing directly with Apple’s Mac Studio M3 Ultra and high-end mini PCs equipped with external GPUs.

What makes the MS-S1 Max stand out for AI workloads is its unified 128GB LPDDR5X-8000 memory pool. The iGPU can address up to 96GB of that as VRAM—no PCIe bottleneck, no separate memory bus. Combined with 256 GB/s bandwidth and dual 10GbE networking, this machine is built for local inference clusters, agentic workflows, and multimodal model serving in constrained spaces. It’s the best pick today when planning a 2–4 node Strix Halo cluster.

AI Performance & Specifications

Memory and VRAM

The MS-S1 Max ships with 128GB of soldered LPDDR5X-8000 on a 256-bit bus. The iGPU dynamically allocates up to 96GB as unified VRAM. That’s more than double the VRAM of any consumer GPU (RTX 4090: 24GB) and matches or exceeds workstation cards like the RTX 6000 Ada (48GB). For inference, this means you can load large models entirely in GPU-accessible memory without offloading layers to system RAM or disk.

Spec	Value
Total system RAM	128 GB LPDDR5X-8000
GPU-accessible VRAM	Up to 96 GB
Memory bandwidth	256 GB/s
Memory bus width	256-bit

256 GB/s is competitive with mid-range discrete GPUs (RTX 4070 Super: ~504 GB/s, but with far less VRAM). For transformer inference, bandwidth directly drives token generation speed. Expect 80–120 tokens/s on 7B–13B models, and 30–50 tokens/s on 70B models at Q5_K_M.

Compute and NPU

The integrated Radeon 8060S (40 RDNA 3.5 CUs) delivers approximately 26 TFLOPS (FP16) and 50 TOPS (INT8) on the NPU alone. Combined platform AI performance reaches 126 TOPS (CPU + iGPU + NPU). In real-world Geekbench AI tests, the Radeon 8060S scored 25,316 (single precision) and 31,296 (half precision)—within striking distance of an RTX 4070 Super. The CPU’s OpenVINO quant score of 18,690 is among the highest in the mini-PC class.

TDP and cooling: 130W sustained, 160W peak. The six-pipe vapor chamber and dual centrifugal fans keep thermals in check even in Performance mode. Balanced and Quiet modes trade ~5–14% CPU performance for lower noise—useful for always-on inference servers.

Networking and Expandability

Dual 10GbE LAN (real 10G, not aggregated 2.5G)
WiFi 7 + Bluetooth 5.4
2x USB4 (40 Gbps), 2x USB4 V2 (80 Gbps)
1x HDMI 8K
2x M.2 PCIe 4.0 NVMe (RAID 0/1 supported; second slot up to 8TB)
2U rack-mount kit included

For cluster deployments, the dual 10GbE ports allow direct node-to-node communication without a switch, or connection to a 10G backbone. USB4 V2 (80 Gbps) can daisy-chain external storage or additional compute.

What Models Can It Run?

The MS-S1 Max’s 96GB VRAM unlocks model sizes that are impractical on consumer hardware. Here’s what fits at common quantization levels:

Model Family	Quantization	Fits in VRAM?	Expected Tokens/s
Llama 3.1 8B	Q4_K_M	Yes	100–130
Llama 3.1 70B	Q5_K_M	Yes (sweet spot)	30–45
Llama 3.1 70B	Q8_0	Yes	25–35
Llama 3.1 120B	Q4_K_M	Yes, with effort	15–25
DeepSeek-R1 32B	Q8_0	Yes	40–60
DeepSeek-R1 67B	Q4_K_M	Yes	25–35
Qwen 2.5 72B	Q5_K_M	Yes	30–40
Mixtral 8x22B	Q4_K_M	Yes	20–30
Qwen 2.5-VL 72B (multimodal)	Q4_K_M	Yes	20–30

Sweet spot: 70B parameters at Q5_K_M. This quantization preserves most of the model’s intelligence while keeping inference fast enough for interactive use. The 96GB VRAM leaves headroom for long-context tasks (128K+ tokens) and multimodal models that need additional memory for vision encoders.

What you can’t run: Large 200B+ models like Llama 3.1 405B (even at Q2) or full-precision 70B+ models with very long contexts. For those, you’d need a multi-node cluster or a server with 2–4 of these units.

Use Cases & Target Audience

Who Should Buy the MS-S1 Max

AI engineers building agentic workflows. When your agent needs to call multiple models (e.g., a 70B planner + a 7B coder + a vision model), unified memory lets you load all three simultaneously without swapping. Dual 10GbE means you can distribute inference across a cluster of MS-S1 Max nodes.

Teams running local inference servers. The 2U rack-mount form factor fits standard server racks. 160W peak per node makes it viable for dense deployments—2–4 nodes in a short-depth rack consume less power than a single GPU server. Perfect for edge inference where power and space are constrained.

Hobbyists running large local LLMs. If you want to run a 70B model at Q5_K_M with a 128K context window, this is the most cost-effective way to do it without renting cloud GPUs. The $1,899 price is less than a single RTX 4090 + high-end CPU build, and you get 4x the VRAM.

Multimodal and long-context researchers. Models like Qwen 2.5-VL 72B or Llama 3.1 70B with 128K context need >48GB VRAM. The MS-S1 Max handles them natively, with enough bandwidth for real-time video analysis or document processing.

Not for

Training large models from scratch (no CUDA, limited TFLOPS vs. H100)
Ultra-low-latency inference (<10ms per token) – the 256 GB/s bandwidth is good but not competitive with HBM2e on server GPUs
Single-node inference above 120B parameters at Q4 – you’ll need multiple nodes

How It Compares

vs. Apple Mac Studio M3 Ultra (192GB unified memory)

The Mac Studio M3 Ultra offers up to 192GB unified memory with 800 GB/s bandwidth—superior for extremely large models (200B+). However, the MS-S1 Max costs roughly half the price ($1,899 vs. ~$4,000+ for 192GB M3 Ultra) and provides dual 10GbE, USB4 V2, and a rack-mount option. If you need cluster deployment or Windows-native tooling (DirectML, ONNX, OpenVINO), the MS-S1 Max is the better fit. For pure macOS ecosystem and maximum single-node capacity, the Mac Studio wins.

vs. Desktop with RTX 4090 (24GB VRAM)

An RTX 4090 system (~$2,500–$3,000) offers higher raw compute (82 TFLOPS FP16) and faster memory bandwidth (1,008 GB/s). But it’s limited to 24GB VRAM—you can’t load a 70B model at any reasonable quantization. The MS-S1 Max trades some speed for 4x the VRAM. If your workload fits in 24GB, the 4090 is faster. If you need larger models, the MS-S1 Max is the only option under $2,000.

vs. Dual RTX 3090 (48GB VRAM)

Two used RTX 3090s (48GB total) cost roughly the same as the MS-S1 Max but require a full tower, 700W+ power supply, and complex multi-GPU inference setup. The unified memory on Strix Halo eliminates PCIe transfer overhead and simplifies model loading. For 70B models at Q5, the MS-S1 Max is more practical and energy-efficient.

Bottom line: The MINISFORUM MS-S1 Max is the most cost-effective way to run 70B–120B parameter models locally in a compact, cluster-friendly form factor. If your AI workloads demand large model sizes and you value deployment simplicity over raw FLOPS, this is the hardware to buy in 2026.

Compatible AI Models

Hide F tierOnly popular models

56 models


Qwen3-30B-A3BAlibaba Cloud (Qwen)	30B(3B active)	AA	38.3 tok/s	5.4 GB
Llama 3 8B InstructMeta	8B	AA	36.4 tok/s	5.7 GB
Llama 2 7B ChatMeta	7B	AA	43.0 tok/s	4.8 GB
Carnice-9b for Hermes agentkai-os	9B	AA	34.3 tok/s	6.0 GB
Gemma 4 E2B ITGoogle	2B	AA	55.6 tok/s	3.7 GB
Qwen3.6 35B-A3BAlibaba Cloud	35B(3B active)	AA	24.2 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba Cloud (Qwen)	35B(3B active)	AA	24.2 tok/s	8.5 GB
Mistral 7B InstructMistral AI	7B	AA	32.2 tok/s	6.4 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Llama 2 13B ChatMeta	13B	AA	24.3 tok/s	8.5 GB
Gemma 4 E4B ITGoogle	4B	BB	29.8 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	BB	29.8 tok/s	6.9 GB
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	BB	18.1 tok/s	11.4 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	BB	18.7 tok/s	11.0 GB
GLM-4.5Z.ai	355B(32B active)	BB	4.0 tok/s	51.8 GB
Kimi K2 InstructMoonshot AI	1000B(32B active)	BB	4.0 tok/s	51.8 GB
Llama 3 70B InstructMeta	70B	BB	4.5 tok/s	45.7 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
GLM-4.7Z.ai	358B(32B active)	BB	3.9 tok/s	52.6 GB
Qwen3.5-397B-A17BAlibaba Cloud (Qwen)	397B(17B active)	BB	4.5 tok/s	46.0 GB
Qwen 3.5 OmniAlibaba Cloud	397B(17B active)	BB	4.6 tok/s	45.2 GB
Llama 2 70B ChatMeta	70B	BB	4.7 tok/s	43.4 GB
Mixtral 8x22B InstructMistral AI	141B(39B active)	BB	4.7 tok/s	43.6 GB
DeepSeek-V3DeepSeek	671B(37B active)	BB	3.4 tok/s	59.8 GB
DeepSeek-R1DeepSeek	671B(37B active)	BB	3.4 tok/s	59.8 GB
DeepSeek-V3.1DeepSeek	671B(37B active)	BB	3.4 tok/s	59.8 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
DeepSeek-V3.2DeepSeek	685B(37B active)	BB	3.4 tok/s	59.8 GB

Rows per page

Page 1 of 3

Quick Specs

VRAM96 GB

INT850 TOPS

TDP160 W

Memory BW256 GB/s

Max Params70B at Q5_K_M; 120B at Q4 with effort

Form FactorMini Workstation (2U rack-mount option)

APUAMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 RDNA 3.5 CUs, XDNA 2 NPU 50 TOPS)

Combined Platform AI126 TOPS (CPU + iGPU + NPU)

iGPU VRAM AllocationUp to 96GB unified

Memory128GB LPDDR5X-8000 unified (256-bit, 256 GB/s, shared with GPU)

Storage2TB PCIe 4.0 NVMe SSD (second M.2 up to 8TB, RAID 0/1)

NetworkingDual 10GbE LAN, WiFi 7, BT 5.4

Display Output1x HDMI 8K, 2x USB4 (40 Gbps), 2x USB4 V2 (80 Gbps)

CoolingSix-pipe pure-copper VC + dual centrifugal fans

TDP130W sustained / 160W peak

OSWindows 11