NVIDIA

TensorRT-LLM

NVIDIA's engine for the fastest inference on NVIDIA GPUs.

Maximum performance on NVIDIA GPUs

Visit Site View on GitHub Read the Docs

GitHub Stars

14.0K

Contributors

411

PyPI / Month

11.1K

Maintained by: NVIDIA
First released: Aug 2023
Last commit: Today
Language: Mixed
License: Apache 2.0

Strengths

Among the fastest options when you run on NVIDIA GPUs.
In-flight batching and quantization squeeze more out of each card.
Scales across multiple GPUs for large models.

Trade-offs

NVIDIA only, with no CPU, AMD, or Apple Silicon support.
Building optimized engines adds a setup step before you serve.

Key Features

What the engine gives you out of the box, in plain language.

OpenAI-Compatible API
NVIDIA GPU
AMD GPU
Apple Silicon
CPU Inference
Quantization
Continuous Batching
Multi-GPU
Desktop GUI
One-Line Install
Structured Output
Streaming

Compiled inference engines
Turns a model into a hardware-tuned engine for faster runs on NVIDIA GPUs.
In-flight batching
Adds and removes requests from the batch on the fly to keep the GPU busy.
OpenAI-compatible server
trtllm-serve exposes a local endpoint that mirrors the OpenAI API.

Where It Shines

The jobs this engine is best suited for.

Latency-sensitive serving
Serve a model where every millisecond of response time matters.
High-volume NVIDIA deployments
Get the most throughput per card on NVIDIA hardware.
Large multi-GPU models
Split a big model across several NVIDIA GPUs.

Side-by-Side

Compare TensorRT-LLM With Another Engine

Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.

Open the Comparator

Frequently Asked Questions

What Is an Inference Engine?

An inference engine is the software that runs a language model and turns your prompt into tokens. It loads the model weights, manages memory on your GPU or CPU, and serves the output, usually behind an API.

Is TensorRT-LLM open source?

TensorRT-LLM ships under the Apache 2.0 license. The source code lives on GitHub, so you can read it, fork it, and run it on your own hardware if your team prefers self-hosting.

Which language is TensorRT-LLM built in?

TensorRT-LLM is primarily a Mixed project. The implementation language matters less than the hardware it supports and the throughput it delivers, but it does affect how easily your team can extend or debug it.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

TensorRT-LLM

NVIDIA's engine for the fastest inference on NVIDIA GPUs.

Maximum performance on NVIDIA GPUs

Visit Site View on GitHub Read the Docs

GitHub Stars

14.0K

Contributors

411

PyPI / Month

11.1K

Maintained by: NVIDIA
First released: Aug 2023
Last commit: Today
Language: Mixed
License: Apache 2.0

Strengths

Among the fastest options when you run on NVIDIA GPUs.
In-flight batching and quantization squeeze more out of each card.
Scales across multiple GPUs for large models.

Trade-offs

NVIDIA only, with no CPU, AMD, or Apple Silicon support.
Building optimized engines adds a setup step before you serve.

Key Features

What the engine gives you out of the box, in plain language.

OpenAI-Compatible API
NVIDIA GPU
AMD GPU
Apple Silicon
CPU Inference
Quantization
Continuous Batching
Multi-GPU
Desktop GUI
One-Line Install
Structured Output
Streaming

Compiled inference engines
Turns a model into a hardware-tuned engine for faster runs on NVIDIA GPUs.
In-flight batching
Adds and removes requests from the batch on the fly to keep the GPU busy.
OpenAI-compatible server
trtllm-serve exposes a local endpoint that mirrors the OpenAI API.

Where It Shines

The jobs this engine is best suited for.

Latency-sensitive serving
Serve a model where every millisecond of response time matters.
High-volume NVIDIA deployments
Get the most throughput per card on NVIDIA hardware.
Large multi-GPU models
Split a big model across several NVIDIA GPUs.

Side-by-Side

Compare TensorRT-LLM With Another Engine

Add a second or third engine and see stars, downloads, and capabilities lined up next to each other.

Open the Comparator

Frequently Asked Questions

What Is an Inference Engine?

Is TensorRT-LLM open source?

TensorRT-LLM ships under the Apache 2.0 license. The source code lives on GitHub, so you can read it, fork it, and run it on your own hardware if your team prefers self-hosting.

Which language is TensorRT-LLM built in?

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

TensorRT-LLM

Strengths

Trade-offs

Key Features

Compiled inference engines

In-flight batching

OpenAI-compatible server

Where It Shines

Latency-sensitive serving

High-volume NVIDIA deployments

Large multi-GPU models

Compare TensorRT-LLM With Another Engine

Frequently Asked Questions

What Is an Inference Engine?

Is TensorRT-LLM open source?

Which language is TensorRT-LLM built in?

The AI Build Report

TensorRT-LLM

Strengths

Trade-offs

Key Features

Compiled inference engines

In-flight batching

OpenAI-compatible server

Where It Shines

Latency-sensitive serving

High-volume NVIDIA deployments

Large multi-GPU models

Compare TensorRT-LLM With Another Engine

Frequently Asked Questions

What Is an Inference Engine?

Is TensorRT-LLM open source?

Which language is TensorRT-LLM built in?

The AI Build Report