No image

NIMO

NIMO Mini PC (Ryzen AI Max+ 395 128GB)

Name: NIMO Mini PC (Ryzen AI Max+ 395 128GB)
Brand: NIMO
Price: 2299 USD
Availability: InStock

Ultra-compact 9.7-inch Strix Halo desktop with 128GB LPDDR5X-8000 8-channel memory and Radeon 8060S iGPU. Triple-fan, 5 heat-pipe cooling sustains 120W for local 70B model inference.

AI PCs & LaptopsIn Stock

Best for LLMsEdge AIEnergy Efficient

Buy on Amazon$2,299Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Our Take

Best for: Datacenter inference for flagship dense models

Sized for production serving of 70B–200B class models at full or lightly-quantized precision. Overkill for a homelab; right call when the workload pays for itself in token volume.

Pair this with

Specifications

Overview

The NIMO Mini PC (Ryzen AI Max+ 395 128GB) is a compact desktop built around AMD’s Strix Halo APU—the Ryzen AI Max+ 395. It combines a 16-core Zen 5 CPU, a Radeon 8060S iGPU with 40 RDNA 3.5 compute units, and a dedicated XDNA 2 NPU rated at 50 TOPS. The headline feature is 128 GB of unified LPDDR5X-8000 memory, configurable as up to 96 GB of VRAM for GPU workloads. At $2,299, this is a prosumer-grade machine that targets a specific niche: local inference of large language models (70B–120B parameters) on a desktop that fits in a 9.7 × 7.4 × 3.8-inch chassis.

It competes with systems like the Framework Desktop (with AMD Strix Halo) and high-end Mini PCs using discrete GPUs (e.g., Intel NUC 13 Extreme with RTX 4060), but the NIMO’s advantage is unified memory that sidesteps the PCIe bottleneck between CPU and GPU. For AI engineers and hobbyists who need to run quantized 70B models at usable speeds without a tower-sized GPU rig, this is a clean, efficient alternative.

AI Performance & Specifications

What matters for AI inference on this machine:

VRAM: Up to 96 GB allocated to the GPU (via unified memory). This is the key number. It enables loading models that would otherwise require a workstation GPU like an A6000 or dual RTX 3090s.
Memory Bandwidth: 256 GB/s over 8-channel LPDDR5X-8000. While lower than GDDR6X (e.g., 1 TB/s on RTX 4090), the bandwidth is sufficient for 70B models at Q5_K_M to generate 6–10 tokens/second. For 120B models at Q4, expect 3–5 tokens/second depending on prompt size.
INT8 Performance: 50 TOPS from the NPU and GPU combined. However, most LLM inference uses FP16 or quantized INT4/INT8 via llama.cpp or similar. The NPU is not the primary engine—the GPU does the heavy lifting.
Thermal Design: Triple-fan, 5 heat-pipe cooler sustains 120W TDP continuously. No throttling during long prompts. Three performance modes let you cap power to reduce noise or maximize throughput.
Compute: The Radeon 8060S has 40 CUs at up to 2.9 GHz. At FP16, it delivers roughly 15 TFLOPS—comparable to a desktop RTX 3060 Ti. For pure matrix math, it’s enough for batch-size-1 inference, but don’t expect training speeds.

Compared to alternatives: A Mac Mini M4 Pro with 48 GB unified memory costs similar but maxes at 48 GB VRAM and has lower memory bandwidth (273 GB/s on M4 Pro vs. 256 GB/s here). The NIMO offers double the VRAM headroom for model sizes. Against a NUC 13 Extreme with RTX 4060 (8 GB VRAM), it’s no contest—the NIMO can run models the NUC can’t touch.

What Models Can It Run?

This machine’s primary strength is running local LLMs that require 32–96 GB of VRAM. Here’s the breakdown:

70B models at Q5_K_M (e.g., Llama 3.1-70B, DeepSeek-R1-70B, Qwen 2.5-72B): Fit easily in 96 GB VRAM. Expect 8–12 tokens/second with llama.cpp (ROCm backend). Long-context (128K) works, though context fill will be slower.
120B models at Q4_K_M (e.g., Mixtral 8x22B, Qwen 2.5-110B, GPT-NeoX-120B): Can run with effort. Use 40–50 GB VRAM. Token generation drops to 3–5 tokens/second. Prompt processing may take 30–60 seconds for 8K context.
Multimodal models: LLaVA-NeXT, InternVL2, or any vision-language model that loads weights and vision encoder together (e.g., Qwen2-VL-72B) works within VRAM. Image throughput is limited by GPU compute, but usable for batch processing.
Smaller models: Run Llama 3.1-8B, Mistral-7B, or Gemma-2-9B at Q8_0 with plenty of room for long context and multiple concurrent sessions.

Sweet spot: 70B models at Q5_K_M_M (medium quantization). You get near-original quality with 8–10 tokens/second—adequate for interactive use. For agents or batch generation, tuning batch size to 2 (if context fits) can increase throughput.

VRAM allocation: Use AMD’s Adrenalin software or BIOS to reserve 96 GB for the iGPU. The remaining 32 GB is for system tasks. That’s the recommended split for heavy inference.

Use Cases & Target Audience

This Mini PC is not for everyone. It’s built for:

Local LLM inference hobbyists: People who want to run Llama 3.1-70B uncensored or DeepSeek-R1 at home without cloud API costs. The 70B sweet spot is ideal for roleplay, coding assistants, or research.
AI application developers: Teams prototyping agents that depend on a 70B model for reasoning. The NIMO serves as a local inference server for a small team, handling multiple requests with queuing. 2.5GbE LAN and WiFi 7 keep latency low.
Edge deployment: Compact size, low power (120W TDP), and passive cooling capacity make it deployable in server closets, lab racks, or even vehicles. Runs Windows 11, so any Windows-compatible AI framework (llama.cpp, Ollama, LM Studio) works out of the box.
Training? No. This machine is for inference only. The 15 TFLOPS FP16 compute and limited memory bandwidth make training impractical beyond fine-tuning small LoRAs on 7B models. If you need training, look at a discrete GPU workstation.

How It Compares

vs. Apple Mac Mini M4 Pro (48 GB, $2,199)

NIMO wins: Up to 96 GB VRAM vs 48 GB. Can run 70B models that the Mac cannot.
Mac wins: Higher token/sec on models that fit (M4 Pro’s bandwidth and compute efficiency). But you hit the 48 GB VRAM wall fast.
Pick NIMO if you need to run models larger than 48 GB or want Windows-native tooling.

vs. Framework Desktop (configured with same Strix Halo, ~$2,500)

NIMO wins: Lower price, includes 128 GB RAM and 1 TB SSD out of the box. Framework requires assembly and separate RAM/SSD purchase.
Framework wins: Modular, repairable, open source BIOS. Better for enthusiasts who want tinkering.
Pick NIMO if you want a ready-to-go system with a warranty and no assembly.

vs. Intel NUC 13 Extreme with RTX 4060 (8 GB VRAM, ~$1,800)

No comparison: The NIMO runs models the NUC cannot touch. The NUC is better for gaming or small models (<13B). For 70B inference, the NIMO is the only choice.

Bottom line: If your AI work requires running 70B–120B parameter models locally, this is one of the most cost-effective and space-efficient options available at $2,299. If you can compromise on model size, other hardware may offer faster token speeds for smaller models.

Compatible AI Models

Hide F tierOnly popular models

56 models


Qwen3-30B-A3BAlibaba Cloud (Qwen)	30B(3B active)	AA	38.3 tok/s	5.4 GB
Llama 3 8B InstructMeta	8B	AA	36.4 tok/s	5.7 GB
Llama 2 7B ChatMeta	7B	AA	43.0 tok/s	4.8 GB
Carnice-9b for Hermes agentkai-os	9B	AA	34.3 tok/s	6.0 GB
Gemma 4 E2B ITGoogle	2B	AA	55.6 tok/s	3.7 GB
Qwen3.6 35B-A3BAlibaba Cloud	35B(3B active)	AA	24.2 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba Cloud (Qwen)	35B(3B active)	AA	24.2 tok/s	8.5 GB
Mistral 7B InstructMistral AI	7B	AA	32.2 tok/s	6.4 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Llama 2 13B ChatMeta	13B	AA	24.3 tok/s	8.5 GB
Gemma 4 E4B ITGoogle	4B	BB	29.8 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	BB	29.8 tok/s	6.9 GB
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	BB	18.1 tok/s	11.4 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	BB	18.7 tok/s	11.0 GB
GLM-4.5Z.ai	355B(32B active)	BB	4.0 tok/s	51.8 GB
Kimi K2 InstructMoonshot AI	1000B(32B active)	BB	4.0 tok/s	51.8 GB
Llama 3 70B InstructMeta	70B	BB	4.5 tok/s	45.7 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
GLM-4.7Z.ai	358B(32B active)	BB	3.9 tok/s	52.6 GB
Qwen3.5-397B-A17BAlibaba Cloud (Qwen)	397B(17B active)	BB	4.5 tok/s	46.0 GB
Qwen 3.5 OmniAlibaba Cloud	397B(17B active)	BB	4.6 tok/s	45.2 GB
Llama 2 70B ChatMeta	70B	BB	4.7 tok/s	43.4 GB
Mixtral 8x22B InstructMistral AI	141B(39B active)	BB	4.7 tok/s	43.6 GB
DeepSeek-V3DeepSeek	671B(37B active)	BB	3.4 tok/s	59.8 GB
DeepSeek-R1DeepSeek	671B(37B active)	BB	3.4 tok/s	59.8 GB
DeepSeek-V3.1DeepSeek	671B(37B active)	BB	3.4 tok/s	59.8 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
DeepSeek-V3.2DeepSeek	685B(37B active)	BB	3.4 tok/s	59.8 GB

Rows per page

Page 1 of 3

Quick Specs

VRAM96 GB

INT850 TOPS

TDP120 W

Memory BW256 GB/s

Max Params70B at Q5_K_M; 120B at Q4 with effort

Form FactorMini PC (9.7 × 7.4 × 3.8 in)

APUAMD Ryzen AI Max+ 395 (16C/32T Zen 5, Radeon 8060S 40 RDNA 3.5 CUs @ 2.9 GHz, XDNA 2 NPU 50 TOPS)

Boost ClockUp to 5.1 GHz

iGPU VRAM AllocationUp to 96GB unified

Memory128GB LPDDR5X-8000 unified, 8-channel (256 GB/s, soldered, shared with GPU)

Storage1TB PCIe 4.0 NVMe SSD (dual M.2 slots)

CoolingTriple-fan + 5 heat-pipe thermal system

ConnectivityUSB 4.0, WiFi 7, BT 5.2, 2.5GbE LAN

Display Output4K display capable

Performance Modes3 modes (up to 120W)

TDP120W sustained

OSWindows 11

NIMO Mini PC (Ryzen AI Max+ 395 128GB)

Our Take

Specifications

Overview

AI Performance & Specifications

What Models Can It Run?

Use Cases & Target Audience

How It Compares

Compatible AI Models

Similar Products

Reatan Mini Gaming PC (Ryzen AI 9 HX 470 with Speaker)

Reatan HTPC (Ryzen AI 9 HX 470 48GB)

Reatan X8 (Ryzen AI 9 HX 470 48GB)

Lenovo ThinkCentre P3 Tiny Gen 2 (Ultra 5 235)

Quick Specs