ggml.org

llama.cpp

Run models almost anywhere, from a laptop CPU to a server GPU.

Running models on almost any hardware

Visit Site View on GitHub Read the Docs

GitHub Stars

118.4K

Contributors

1.8K

PyPI / Month

—

Maintained by: ggml.org
First released: Mar 2023
Last commit: Today
Language: C++
License: MIT

Strengths

Runs on CPUs, NVIDIA and AMD GPUs, and Apple Silicon from one codebase.
Tiny footprint and the foundation many other tools build on.
The GGUF quantization format keeps memory use low.

Trade-offs

More hands-on setup than a packaged desktop app.
Peak GPU throughput trails dedicated engines like vLLM.

Key Features

What the engine gives you out of the box, in plain language.

OpenAI-Compatible API
NVIDIA GPU
AMD GPU
Apple Silicon
CPU Inference
Quantization
Continuous Batching
Multi-GPU
Desktop GUI
One-Line Install
Structured Output
Streaming

Runs everywhere
One engine for CPU, CUDA, ROCm, Vulkan, and Apple Metal backends.
GGUF quantization
Compact model files that fit large models into modest memory.
llama-server
A built-in server with an OpenAI-compatible API and grammar-constrained output.

Where It Shines

The jobs this engine is best suited for.

CPU-only inference
Run a model on a machine with no GPU at all.
Edge and embedded
Fit models onto small or constrained devices with quantization.
A base for other tools
Build a custom runner on the same engine that powers many local apps.

Side-by-Side

Compare llama.cpp With Another Engine

Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.

Open the Comparator

Frequently Asked Questions

What Is an Inference Engine?

An inference engine is the software that runs a language model and turns your prompt into tokens. It loads the model weights, manages memory on your GPU or CPU, and serves the output, usually behind an API.

Is llama.cpp open source?

llama.cpp ships under the MIT license. The source code lives on GitHub, so you can read it, fork it, and run it on your own hardware if your team prefers self-hosting.

Which language is llama.cpp built in?

llama.cpp is primarily a C++ project. The implementation language matters less than the hardware it supports and the throughput it delivers, but it does affect how easily your team can extend or debug it.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

ggml.org

llama.cpp

Run models almost anywhere, from a laptop CPU to a server GPU.

Running models on almost any hardware

Visit Site View on GitHub Read the Docs

GitHub Stars

118.4K

Contributors

1.8K

PyPI / Month

—

Maintained by: ggml.org
First released: Mar 2023
Last commit: Today
Language: C++
License: MIT

Strengths

Runs on CPUs, NVIDIA and AMD GPUs, and Apple Silicon from one codebase.
Tiny footprint and the foundation many other tools build on.
The GGUF quantization format keeps memory use low.

Trade-offs

More hands-on setup than a packaged desktop app.
Peak GPU throughput trails dedicated engines like vLLM.

Key Features

What the engine gives you out of the box, in plain language.

OpenAI-Compatible API
NVIDIA GPU
AMD GPU
Apple Silicon
CPU Inference
Quantization
Continuous Batching
Multi-GPU
Desktop GUI
One-Line Install
Structured Output
Streaming

Runs everywhere
One engine for CPU, CUDA, ROCm, Vulkan, and Apple Metal backends.
GGUF quantization
Compact model files that fit large models into modest memory.
llama-server
A built-in server with an OpenAI-compatible API and grammar-constrained output.

Where It Shines

The jobs this engine is best suited for.

CPU-only inference
Run a model on a machine with no GPU at all.
Edge and embedded
Fit models onto small or constrained devices with quantization.
A base for other tools
Build a custom runner on the same engine that powers many local apps.

Side-by-Side

Compare llama.cpp With Another Engine

Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.

Open the Comparator

Frequently Asked Questions

What Is an Inference Engine?

Is llama.cpp open source?

llama.cpp ships under the MIT license. The source code lives on GitHub, so you can read it, fork it, and run it on your own hardware if your team prefers self-hosting.

Which language is llama.cpp built in?

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

llama.cpp

Strengths

Trade-offs

Key Features

Runs everywhere

GGUF quantization

llama-server

Where It Shines

CPU-only inference

Edge and embedded

A base for other tools

Compare llama.cpp With Another Engine

Frequently Asked Questions

What Is an Inference Engine?

Is llama.cpp open source?

Which language is llama.cpp built in?

The AI Build Report

llama.cpp

Strengths

Trade-offs

Key Features

Runs everywhere

GGUF quantization

llama-server

Where It Shines

CPU-only inference

Edge and embedded

A base for other tools

Compare llama.cpp With Another Engine

Frequently Asked Questions

What Is an Inference Engine?

Is llama.cpp open source?

Which language is llama.cpp built in?

The AI Build Report