InternLM

LMDeploy

Compress, deploy, and serve open models with high throughput.

High-throughput serving with built-in quantization

Visit Site View on GitHub Read the Docs Join Discord

GitHub Stars

7.9K

Contributors

140

PyPI / Month

52.8K

Maintained by: InternLM
First released: Jun 2023
Last commit: Yesterday
Language: Mixed
License: Apache 2.0

Strengths

Strong throughput from the TurboMind engine.
Built-in quantization fits larger models on the same GPU.
Ships an OpenAI-compatible server out of the box.

Trade-offs

Focused on NVIDIA GPUs.
Smaller English-language community than vLLM.

Key Features

What the engine gives you out of the box, in plain language.

OpenAI-Compatible API
NVIDIA GPU
AMD GPU
Apple Silicon
CPU Inference
Quantization
Continuous Batching
Multi-GPU
Desktop GUI
One-Line Install
Structured Output
Streaming

TurboMind engine
A fast inference backend tuned for throughput on NVIDIA GPUs.
Built-in quantization
Shrink models with weight and KV cache quantization to save memory.
OpenAI-compatible server
Serves a familiar API so existing clients connect without changes.

Where It Shines

The jobs this engine is best suited for.

Production model serving
Serve an open model to a busy app with low cost per token.
Memory-constrained GPUs
Use quantization to run a bigger model on a smaller card.
Self-hosted OpenAI replacement
Point existing OpenAI clients at your own GPU box.

Side-by-Side

Compare LMDeploy With Another Engine

Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.

Open the Comparator

Frequently Asked Questions

What Is an Inference Engine?

An inference engine is the software that runs a language model and turns your prompt into tokens. It loads the model weights, manages memory on your GPU or CPU, and serves the output, usually behind an API.

Is LMDeploy open source?

LMDeploy ships under the Apache 2.0 license. The source code lives on GitHub, so you can read it, fork it, and run it on your own hardware if your team prefers self-hosting.

Which language is LMDeploy built in?

LMDeploy is primarily a Mixed project. The implementation language matters less than the hardware it supports and the throughput it delivers, but it does affect how easily your team can extend or debug it.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

InternLM

LMDeploy

Compress, deploy, and serve open models with high throughput.

High-throughput serving with built-in quantization

Visit Site View on GitHub Read the Docs Join Discord

GitHub Stars

7.9K

Contributors

140

PyPI / Month

52.8K

Maintained by: InternLM
First released: Jun 2023
Last commit: Yesterday
Language: Mixed
License: Apache 2.0

Strengths

Strong throughput from the TurboMind engine.
Built-in quantization fits larger models on the same GPU.
Ships an OpenAI-compatible server out of the box.

Trade-offs

Focused on NVIDIA GPUs.
Smaller English-language community than vLLM.

Key Features

What the engine gives you out of the box, in plain language.

OpenAI-Compatible API
NVIDIA GPU
AMD GPU
Apple Silicon
CPU Inference
Quantization
Continuous Batching
Multi-GPU
Desktop GUI
One-Line Install
Structured Output
Streaming

TurboMind engine
A fast inference backend tuned for throughput on NVIDIA GPUs.
Built-in quantization
Shrink models with weight and KV cache quantization to save memory.
OpenAI-compatible server
Serves a familiar API so existing clients connect without changes.

Where It Shines

The jobs this engine is best suited for.

Production model serving
Serve an open model to a busy app with low cost per token.
Memory-constrained GPUs
Use quantization to run a bigger model on a smaller card.
Self-hosted OpenAI replacement
Point existing OpenAI clients at your own GPU box.

Side-by-Side

Compare LMDeploy With Another Engine

Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.

Open the Comparator

Frequently Asked Questions

What Is an Inference Engine?

Is LMDeploy open source?

LMDeploy ships under the Apache 2.0 license. The source code lives on GitHub, so you can read it, fork it, and run it on your own hardware if your team prefers self-hosting.

Which language is LMDeploy built in?

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

LMDeploy

Strengths

Trade-offs

Key Features

TurboMind engine

Built-in quantization

OpenAI-compatible server

Where It Shines

Production model serving

Memory-constrained GPUs

Self-hosted OpenAI replacement

Compare LMDeploy With Another Engine

Frequently Asked Questions

What Is an Inference Engine?

Is LMDeploy open source?

Which language is LMDeploy built in?

The AI Build Report

LMDeploy

Strengths

Trade-offs

Key Features

TurboMind engine

Built-in quantization

OpenAI-compatible server

Where It Shines

Production model serving

Memory-constrained GPUs

Self-hosted OpenAI replacement

Compare LMDeploy With Another Engine

Frequently Asked Questions

What Is an Inference Engine?

Is LMDeploy open source?

Which language is LMDeploy built in?

The AI Build Report