Estimate which AI models your hardware can run, how fast, and at what quality. Powered entirely in your browser.
At least VRAM is required to calculate.
Configure your hardware and hit Calculate to see which AI models you can run.
The AI hardware compatibility calculator is a free in-browser tool that estimates which AI models you can run locally on a given device, how fast they will generate tokens, and at what quality grade. It computes VRAM requirements from model weights, KV cache, and runtime overhead, then projects tokens-per-second from your GPU memory bandwidth, all without sending any data to a server.
Pick a hardware product from the directory or detect yours automatically in the browser. The calculator then iterates every tracked open-source model, picks the highest-quality quantization that fits your VRAM (FP16 down to Q2_K), and scores each combination from S to F across quality, speed, fit, and context length, weighted by your chosen use case.
Results are deterministic and private. Hardware detection uses the WebGPU adapter and GPU class heuristics. All scoring math runs in a single client-side bundle. No telemetry, no API calls, no data leaves your browser, which means you can run it offline once the page is cached.

Once You Know What Fits
Knowing a model fits is step one. The decision tool compares the total cost of buying that GPU, renting it in the cloud, and paying per token to a frontier API at your usage.

Compare buying this hardware against paying a cloud API at your usage volume.

Don’t have hardware yet? Get a budget-matched parts list and pre-built side by side.

Every model the calculator scores, ranked by composite benchmark score.

If your hardware falls short, see the cheapest cloud GPU rental that fits.
It sums three components: model weights, KV cache, and a flat 0.5 GB runtime overhead. Weights are parameter count multiplied by bytes-per-parameter (which varies by quantization: Q4_K_M uses 0.58, FP16 uses 2.0). KV cache scales with parameters and context length using the formula 0.000008 × params (B) × context length. The math is detailed in the "How we calculate" panel.
Tokens per second is approximated from GPU memory bandwidth divided by model size, then multiplied by an efficiency factor (0.65 for Apple Silicon unified memory, 0.70 for discrete NVIDIA GPUs) and a quantization speed multiplier. When bandwidth is unknown for a card, we fall back to hardware-class constants (220 for NVIDIA, 160 for Apple Silicon). The estimate is conservative: real-world speeds with optimized backends (vLLM, llama.cpp) are usually higher.
Each model gets a composite score from 0 to 100 based on four dimensions weighted by your use case: quality (parameter tier minus quantization penalty), speed (tokens/sec relative to a use-case target), fit (sweet spot is 50–80% VRAM utilization), and context (does the context window meet the use-case target). Score thresholds: S ≥ 85, A ≥ 70, B ≥ 55, C ≥ 40, D ≥ 20, F < 20. Models that exceed your VRAM always score F regardless of other factors.
For most users running consumer hardware, yes. Q4_K_M is the sweet spot for 7B–70B models: about 4× smaller than FP16 with minimal quality loss on most benchmarks. The calculator auto-recommends the highest-quality format that fits, so if you have headroom you will see Q5, Q6, or Q8 picked instead. Drop to Q2 or Q3 only when you have to squeeze a much bigger model into limited VRAM.
Browser GPU detection is best-effort. Some integrated GPUs and laptop discrete GPUs share renderer strings that the WebGPU adapter cannot disambiguate. If the detected match is wrong, switch from "Auto-detect" to manual selection and pick the exact card from the directory.
Yes. Apple M1, M2, M3, and M4 chips use a single unified memory pool that the OS dynamically partitions between system and GPU. The calculator models the effective GPU memory pool (typically 70–75% of system memory for inference) and uses a lower efficiency constant (0.65) to reflect unified-memory overhead compared to discrete VRAM.