Every notable engine for running LLMs locally and in production, with live GitHub stars, PyPI downloads, language, license, and the trade-offs nobody puts on the homepage.
| Maintainer | Language | License | Compare | ||||
|---|---|---|---|---|---|---|---|
| Ollama Inc. | Go | MIT | 175.0K | — | Yesterday | ||
| Hugging Face | Python | Apache 2.0 | 162.0K | 157.9M | Today | ||
| ggml.org | C++ | MIT | 118.4K | — | Today | ||
| vLLM Project | Python | Apache 2.0 | 84.6K | 5.6M | Today | ||
| hiyouga | Python | Apache 2.0 | 72.6K | 56.4K | Yesterday | ||
| Unsloth AI | Python | Apache 2.0 | 67.5K | 2.4M | Today | ||
| LocalAI | Go | MIT | 47.2K | — | Today | ||
| Stanford NLP | Python | MIT | 35.5K | 6.1M | 1 wk ago | ||
| SGLang Project | Python | Apache 2.0 | 29.7K | 486.5M | Today | ||
| Apple | Python | MIT | 27.3K | 1.6M | Today | ||
| Guidance AI | Python | MIT | 21.5K | 26.9K | 1 mo ago | ||
| Hugging Face | Python | Apache 2.0 | 18.7K | 3.0M | Yesterday | ||
| dottxt | Python | Apache 2.0 | 14.2K | 1.9M | 1 wk ago | ||
| NVIDIA | Mixed | Apache 2.0 | 14.0K | 11.1K | Today | ||
| 567 Labs | Python | MIT | 13.2K | 15.0M | 1 wk ago |
An inference engine is the software that runs a language model on your hardware, loading the weights, managing GPU or CPU memory, and serving tokens, usually behind an API.
Picking one shapes your hardware bill, your latency, and how many requests you can serve at once. The simplest engine that covers your real requirements almost always beats the most powerful one.
Start with the hardware you have, then with what you are trying to do: run a model on your laptop, stand up an OpenAI-compatible API, or push the most throughput out of a GPU cluster. Pin two or three to the comparator and read the trade-offs side by side before you commit.

Side-by-Side
Pick any two or three and see stars, downloads, capabilities, and the case for each one in a single matrix you can share.
An inference engine is the software that actually runs a language model. It loads the model weights, manages memory on your GPU or CPU, batches incoming requests, and turns prompts into tokens, usually behind an API. Without one, a downloaded model is just a file on disk. The engine is what makes it serve answers.
For a laptop or workstation, Ollama and LM Studio are the easiest starting points: one-line install, a friendly interface, and they run on NVIDIA, AMD, Apple Silicon, or CPU. On a Mac, MLX is the native option. llama.cpp is the lightweight engine many of these are built on. If you want production throughput on an NVIDIA GPU, look at vLLM or SGLang instead.
Ollama is built for running models locally with the least friction, ideal for a laptop, a demo, or a single-user app. vLLM is built for production serving on GPUs, with continuous batching and multi-GPU support that let one machine serve many users at once. Ollama optimizes for ease; vLLM optimizes for throughput. The comparator on this page lays them side by side.
GitHub stars, forks, contributors, last commit, and PyPI monthly downloads come from a scheduled sync against the source registries. Capability flags like OpenAI-compatible API, GPU support, quantization, and continuous batching are editorial, set by reading the docs and the source code. Trend arrows show change since the last sync.
No. Engines like llama.cpp and Ollama run on CPU, and quantization shrinks a model enough to fit in regular RAM. The trade-off is speed: CPU inference is much slower than a GPU, and large models can feel sluggish. For anything interactive at scale, an NVIDIA GPU or Apple Silicon makes a big difference. Use the hardware filter to see what fits your setup.
For production serving on NVIDIA GPUs, vLLM and SGLang lead on throughput thanks to continuous batching and efficient memory use. They keep the GPU busy across many concurrent requests instead of one at a time, which is the difference between serving a handful of users and serving thousands on the same card. Filter by continuous batching and multi-GPU to find them.
Stars, downloads, and commit timestamps refresh on a daily cron from GitHub and PyPI. Capability flags are reviewed monthly. When an engine ships a major release or gets archived, this directory updates within a day.