vLLM Project
High-throughput GPU serving with an OpenAI-compatible API out of the box.
GitHub Stars
84.6K
Contributors
2.9K
PyPI / Month
5.6M
What the engine gives you out of the box, in plain language.
A smarter way to manage GPU memory that fits more concurrent requests on the same card.
New requests join the batch as soon as a slot frees up, keeping the GPU busy.
Expose a local endpoint that mirrors the OpenAI API for chat and completions.
The jobs this engine is best suited for.
Serve an open model to a real app with high request volume and low cost per token.
Point an existing OpenAI client at your own GPU box and keep data in house.
Split a large model across several GPUs when it will not fit on one.

Side-by-Side
Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.