NVIDIA

NVIDIA L4 Tensor Core GPU

Name: NVIDIA L4 Tensor Core GPU
Brand: NVIDIA
Price: 2500 USD
Availability: InStock

Ultra-efficient Ada Lovelace inference GPU with 24GB GDDR6 in a low-profile 72W form factor. The most power-efficient NVIDIA data center GPU, ideal for dense inference deployments and video processing.

NVIDIA GPUsIn Stock

EnterpriseData CenterEnergy EfficientLow LatencyProduction Ready

Buy on Amazon$2,500Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM24 GB

FP16121 TFLOPS

INT8242 TOPS

TDP72 W

Memory BW300 GB/s

Max Params13B at FP16, 7B comfortably

ArchitectureAda Lovelace (AD104)

CUDA Cores7,424

Tensor Cores240 (4th gen)

FP8 TFLOPS242

Memory TypeGDDR6 with ECC

Memory Bus192-bit

Process NodeTSMC 4N

InterfacePCIe 4.0 x16 (low-profile)

Form FactorSingle-slot, low-profile

FP32 TFLOPS30.3

Video EncodeNVENC with AV1

Video DecodeNVDEC with AV1

vGPU SupportYes

Our Take

Best for: 32B at Q4 or 70B at heavy quantization

The 24 GB tier is where most local-LLM tooling assumes you live. Strong fit for code agents, RAG, and 30B-class reasoning models without exotic quants. Notably efficient for its compute class — strong perf-per-watt makes it a natural pick for always-on inference.

Pair this withMixtral 8x7B Instruct (46.7B)Largest popular open model that fits at Q4 — needs roughly 11.4 GB on this 24 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

The NVIDIA L4 Tensor Core GPU is the efficiency leader of the Ada Lovelace data center lineup. Designed specifically to replace the aging T4, the L4 is a low-profile, single-slot accelerator that packs 24GB of GDDR6 VRAM into a 72W TDP. While high-end H100s and B200s dominate training headlines, the L4 is the workhorse for dense inference deployments, video processing, and local AI agent orchestration where power constraints and thermal management are primary concerns.

For engineers building agentic workflows or deploying local LLMs, the L4 represents a specialized middle ground between consumer RTX cards and high-end enterprise silicon. It lacks the active cooling of a 4090 but offers enterprise-grade reliability, ECC memory, and vGPU support. It is the definitive choice for high-density server environments and edge deployments where maximizing "performance per watt" is more critical than raw peak TFLOPS.

AI Performance & Specifications

The NVIDIA L4 Tensor Core GPU for AI is built on the AD104 die (TSMC 4N process), optimized for high-throughput inference rather than heavy training. The standout metric for this card is its 24GB of GDDR6 memory. In the context of 2025 AI workloads, 24GB is the "Goldilocks" zone for local LLMs, allowing practitioners to run high-quality 7B to 14B parameter models entirely in VRAM without the performance degradation of system RAM offloading.

Compute Throughput and Precision

The L4 introduces support for the FP8 (8-bit floating point) data format, which is essential for modern inference engines like vLLM, TensorRT-LLM, and LMDeploy.

FP8 Performance: 242 TFLOPS
FP16 Performance: 121 TFLOPS
INT8 Performance: 242 TOPS
FP32 Performance: 30.3 TFLOPS

While the 300 GB/s memory bandwidth is lower than that of the A10 or the consumer-grade RTX 4090, the L4 compensates with 4th Generation Tensor Cores that handle sparsity and low-precision arithmetic with extreme efficiency. In a production environment, this translates to stable, low-latency performance for real-time applications.

Efficiency and Form Factor

The 72W TDP is the L4’s primary competitive advantage. It requires no external power connectors, drawing all its power directly from the PCIe slot. This makes it the best AI chip for local deployment in existing server chassis or workstations that cannot support the 450W+ requirements of high-end GPUs. Its low-profile, single-slot design allows for maximum density—often fitting 8 or more units in a 2U server.

What Models Can It Run?

When evaluating the NVIDIA L4 Tensor Core GPU VRAM for large language models, the 24GB capacity dictates the operational ceiling.

LLM Compatibility and Quantization

7B - 8B Models (Llama 3.1, Mistral 7B, Gemma 2 9B): These models are the "sweet spot" for the L4. You can run these at FP16 (full precision) with room for a 16k-32k context window. At 4-bit or 8-bit quantization (using AWQ or GGUF), you can run these with massive context windows (up to 128k) or maintain multiple concurrent inference streams for agentic loops.
13B - 14B Models (Qwen 2.5 14B, Mistral NeMo): These models fit comfortably at 8-bit or 4-bit quantization. Running a 13B model at FP16 is possible but leaves very little "headroom" for the KV cache, potentially limiting context length.
30B+ Models (Llama 3.1 70B, Command R): These will not fit in a single L4 at usable bit-depths. To run 70B models, you would need to pool 2-3 L4 GPUs using multi-GPU inference frameworks.

Expected Performance (Tokens Per Second)

For a standard Llama 3.1 8B model using FP8 precision via TensorRT-LLM, practitioners can expect:

Throughput: 80–120 tokens per second (TPS) for single-user requests.
Multimodal: The L4 is excellent for Vision-Language Models (VLMs) like Llava 1.6 or Moondream2, as the 24GB VRAM easily accommodates the image encoders and the language backbone simultaneously.

Video and Media Agents

The L4 includes three NVENC and three NVDEC engines with full AV1 support. This makes it the premier choice for AI agents that process video feeds in real-time—such as automated video editing, surveillance analysis, or real-time transcription/translation services.

Use Cases & Target Audience

Teams Running Inference Servers

The L4 is designed for production. If you are a developer building an API-backed application and want to move away from expensive cloud providers like OpenAI or Anthropic, a cluster of L4s provides a predictable, low-latency environment. Its vGPU support allows teams to partition a single L4 into multiple smaller virtual GPUs for lighter workloads, such as embedding models (BERT, BGE-M3).

Local AI Agent Development

For those building "Agentic" workflows—where an LLM must call tools, search the web, and execute code—the L4 provides the stability needed for 24/7 operation. Its low power draw means it can run in a home lab or office closet without specialized cooling or high electricity costs.

Edge and IoT Deployment

The L4 is the best hardware for local AI agents in 2025 for edge scenarios. Whether it's a smart factory or an on-premise medical imaging server, the L4’s thermal profile and enterprise support lifecycle make it superior to consumer hardware.

How It Compares

NVIDIA L4 vs. NVIDIA RTX 4090

The RTX 4090 is significantly faster in terms of raw compute and memory bandwidth (1 TB/s vs 300 GB/s). However, the 4090 is a 450W triple-slot card that is difficult to stack in servers. The L4 offers:

Better Density: You can fit four L4s in the space of one 4090.
Lower Power: The L4 uses ~15% of the power of a 4090.
Enterprise Features: Passive cooling, ECC memory, and longer driver support cycles.
Pick the L4 if: You need high-density, low-power, or enterprise-grade reliability.
Pick the 4090 if: You are doing heavy local training or need the fastest possible single-user inference.

NVIDIA L4 vs. NVIDIA T4

The L4 is the direct successor to the T4. It offers up to 2.5x the performance in AI inference and significantly better video encoding capabilities. If you are currently running T4 instances in the cloud (like AWS G4dn), moving to L4 (G6 instances) provides a massive uplift in tokens per second and allows for the use of FP8 precision.

NVIDIA vs. AMD for AI Inference

While AMD’s MI300 series is competitive at the high end, NVIDIA remains the standard for local and mid-tier inference. The L4 benefits from the mature CUDA ecosystem and TensorRT, which generally offer better out-of-the-box optimization for new models like DeepSeek-R1 or Llama 3.1 compared to AMD's ROCm. For practitioners who want "plug and play" compatibility with the widest range of GitHub repositories and model architectures, the L4 is the safer bet.

Compatible AI Models

Hide F tierOnly popular models

56 models


Qwen3-30B-A3BAlibaba	30B(3B active)	SS	44.8 tok/s	5.4 GB
Carnice-9b for Hermes agentkai-os	9B	SS	40.2 tok/s	6.0 GB
Llama 3 8B InstructMeta	8B	SS	42.6 tok/s	5.7 GB
Qwen3.6 35B-A3BAlibaba	35B(3B active)	AA	28.3 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba	35B(3B active)	AA	28.3 tok/s	8.5 GB
Llama 2 7B ChatMeta	7B	AA	50.4 tok/s	4.8 GB
Mistral 7B InstructMistral AI	7B	AA	37.8 tok/s	6.4 GB
Llama 2 13B ChatMeta	13B	AA	28.5 tok/s	8.5 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	AA	21.3 tok/s	11.4 GB
Gemma 4 E4B ITGoogle	4B	AA	34.9 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	34.9 tok/s	6.9 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	AA	21.9 tok/s	11.0 GB
Gemma 4 E2B ITGoogle	2B	AA	65.1 tok/s	3.7 GB
Llama 3.1 8B InstructMeta	8B	AA	18.1 tok/s	13.3 GB
minimax-m2.5MiniMax	230B(10B active)	BB	10.6 tok/s	22.7 GB
Falcon 40B InstructTechnology Innovation Institute	40B	DD	9.9 tok/s	24.4 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
Qwen3.5-9BAlibaba	9B	DD	9.8 tok/s	24.6 GB
Mistral Small 3 24BMistral AI	24B	FF	6.2 tok/s	39.0 GB
Qwen3.6-27BAlibaba	27B	FF	3.3 tok/s	72.8 GB
Gemma 3 27B ITGoogle	27B	FF	5.5 tok/s	43.8 GB
Qwen3.5-27BAlibaba	27B	FF	3.3 tok/s	72.8 GB
Gemma 4 31B ITGoogle	31B	FF	2.9 tok/s	82.0 GB
Qwen3-32BAlibaba	32.8B	FF	4.5 tok/s	53.9 GB
LLaMA 65BMeta	65B	FF	6.2 tok/s	39.3 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
Llama 2 70B ChatMeta	70B	FF	5.6 tok/s	43.4 GB

Rows per page

Page 1 of 3