NVIDIA

NVIDIA L40S

Name: NVIDIA L40S
Brand: NVIDIA
Price: 9000 USD
Availability: InStock

Ada Lovelace data center GPU optimized for inference, graphics, and media workloads. 48GB GDDR6 with ECC and no NVLink, positioned for versatile enterprise deployment.

NVIDIA GPUsIn Stock

EnterpriseData CenterBest for Computer VisionProduction Ready

Buy on Amazon$9,000Calculate ROI

PayPerQ—Pay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ

Quick Specs

VRAM48 GB

FP16362.1 TFLOPS

INT8724.2 TOPS

TDP350 W

Memory BW864 GB/s

Max Params30B at FP16, 70B at Q4

ArchitectureAda Lovelace (AD102)

CUDA Cores18,176

Tensor Cores568 (4th gen)

Memory TypeGDDR6 with ECC

Memory Bus384-bit

Process NodeTSMC 4N

Form FactorDual-slot PCIe

Our Take

Best for: Workstation-class serving of 70B at Q5/Q6 with long context

The first tier where 70B-class models stop feeling cramped. Headroom for KV cache means 32K+ context on Q4 quants without falling off the GPU. Notably efficient for its compute class — strong perf-per-watt makes it a natural pick for always-on inference.

Pair this withQwen3-235B-A22B (235B)Largest popular open model that fits at Q4 — needs roughly 36.3 GB on this 48 GB card.

Generated from this product’s spec sheet. Editor reviews refine it over time.

Specifications

The NVIDIA L40S is a high-performance data center GPU built on the Ada Lovelace architecture, specifically engineered to bridge the gap between pure graphics rendering and large-scale AI inference. While the H100 remains the flagship for massive foundation model training, the L40S has emerged as the pragmatic "workhorse" for enterprise AI deployment. It is a PCIe-based card designed for universal compatibility with standard server racks, making it one of the most accessible 48GB GPUs for AI development and production-grade inference.

Positioned as the successor to the A40 and a more versatile alternative to the A100, the L40S is optimized for the current shift toward agentic workflows and fine-tuning. Unlike the consumer-grade RTX 4090, which shares the same AD102 silicon, the L40S features enterprise-grade ECC memory, a passive cooling design for server environments, and significantly higher FP16 compute performance. It competes directly with the NVIDIA A6000 Ada in the professional workstation space and offers a high-bandwidth alternative to AMD’s Instinct MI210 for specialized inference tasks.

AI Performance & Specifications

For AI engineers, the most critical metric for the NVIDIA L40S is its 48GB of GDDR6 memory. This capacity allows for the local deployment of substantial models that would otherwise require multi-GPU setups. While it lacks NVLink support—meaning you cannot pool memory across cards with the same efficiency as H100 clusters—its high individual throughput makes it a premier choice for "NVIDIA L40S AI inference performance" in single-node configurations.

Compute and Throughput

The L40S delivers 362.1 TFLOPS of FP16 performance. In practical terms, this translates to massive throughput for batch inference. The inclusion of 4th Generation Tensor Cores allows it to hit 724.2 TOPS of INT8 performance, which is vital for running highly quantized models at extreme speeds.

Memory Bandwidth and Latency

With a memory bandwidth of 864 GB/s on a 384-bit bus, the L40S is significantly faster than the previous generation A40 (448 GB/s). In LLM terms, memory bandwidth is the primary bottleneck for "tokens per second." The L40S provides enough headroom to ensure that even large models don't feel sluggish during interactive chat sessions or real-time agentic reasoning.

Power and Efficiency

The card has a 350W TDP. While high, it is manageable within standard enterprise power envelopes. For teams building "local AI agents 2025," the L40S offers a superior performance-per-watt ratio compared to older Ampere-based cards, especially when utilizing Transformer Engine acceleration to optimize precision levels dynamically.

What Models Can It Run?

The 48GB VRAM capacity is the "sweet spot" for modern open-source weights. When evaluating "NVIDIA L40S VRAM for large language models," practitioners can expect the following compatibility:

Llama 3.1 8B / Mistral 7B / Qwen 2.5 7B: These models fit entirely in VRAM with room for massive 128k+ context windows. You can run these at FP16 with lightning-fast tokens per second (often exceeding 150+ t/s depending on the backend).
Llama 3.1 30B / Mixtral 8x7B: These are the primary targets for this hardware. A 30B parameter model at FP16 takes approximately 60GB, which exceeds a single card's capacity. However, at 4-bit (Q4_K_M) or 8-bit (GGUF/EXL2) quantization, these models fit comfortably with room for KV cache.
Llama 3.1 70B / DeepSeek-R1-Distill-Llama-70B: This is the upper limit for a single L40S. At Q4 quantization, a 70B model requires roughly 40-42GB of VRAM. This leaves 6GB for context, making it the "best hardware for running 70B at Q4 parameter models" in a single-slot PCIe form factor.
Multimodal & Vision: Given its "Best for Computer Vision" tag, the L40S excels at running models like Flux.1 (image generation), CogVideoX, or large vision-language models (VLMs) like LLaVA, where both compute and VRAM are strained by high-resolution image tokens.

Use Cases & Target Audience

Enterprise Inference Servers

The L40S is "Production Ready." It is designed for 24/7 operation in data centers. Teams deploying internal RAG (Retrieval-Augmented Generation) pipelines or AI-powered customer service agents find the L40S ideal because it can be easily added to existing PCIe-based servers without requiring specialized HGX baseboards.

AI Agent Development

For developers building "local AI agents," the L40S provides the necessary VRAM to keep multiple models resident in memory. An agentic workflow might require a primary LLM (Llama 3 70B) and a secondary embedding model or a small "judge" model (Phi-3) running simultaneously. The 48GB buffer allows for this multi-model residency without the latency of swapping weights from system RAM.

Fine-Tuning and Training

While not intended for training GPT-5, the L40S is an excellent "AI GPU for agent training" and LoRA (Low-Rank Adaptation) fine-tuning. Practitioners can fine-tune 8B and 30B models locally using frameworks like Unsloth or Axolotl, benefiting from the 4th Gen Tensor cores which support FP8 training for faster convergence and lower memory overhead.

How It Compares

NVIDIA L40S vs. NVIDIA RTX 6000 Ada

The L40S and the RTX 6000 Ada share the same core architecture and 48GB VRAM. However, the L40S is a passively cooled server card with a higher power limit (350W vs 300W on the 6000 Ada), leading to slightly better sustained performance in data center environments. Choose the L40S for rack servers; choose the 6000 Ada for desktop workstations.

NVIDIA L40S vs. NVIDIA A100 (80GB)

The A100 has nearly double the VRAM and significantly higher memory bandwidth (HBM2e), making it superior for massive batch processing and training. However, the L40S is built on the newer Ada architecture, which includes the Transformer Engine and better ray-tracing cores. For single-stream inference and graphics-heavy AI (like 3D Gaussian Splatting), the L40S often outperforms the older A100 at a lower price point.

NVIDIA vs. AMD for AI Inference

While AMD’s MI300X offers more VRAM (192GB), the NVIDIA software ecosystem (CUDA, TensorRT, Triton) remains the industry standard for "best nvidia gpus for running AI models locally." The L40S benefits from day-one support for every major inference framework, from vLLM and TGI to LM Studio and Ollama, ensuring that practitioners spend their time building agents rather than debugging drivers.

Compatible AI Models

Hide F tierOnly popular models

56 models


Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	SS	61.2 tok/s	11.4 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	SS	63.2 tok/s	11.0 GB
minimax-m2.5MiniMax	230B(10B active)	SS	30.6 tok/s	22.7 GB
Qwen3.6 35B-A3BAlibaba	35B(3B active)	SS	81.5 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba	35B(3B active)	SS	81.5 tok/s	8.5 GB
Llama 3.1 8B InstructMeta	8B	SS	52.2 tok/s	13.3 GB
Qwen3-30B-A3BAlibaba	30B(3B active)	SS	129.1 tok/s	5.4 GB
Llama 2 13B ChatMeta	13B	SS	82.2 tok/s	8.5 GB
AdPayPerQPay-per-query access to top LLMs without a subscription. Use any model on demand.Try PayPerQ
Qwen3.5-122B-A10BAlibaba	122B(10B active)	AA	25.5 tok/s	27.3 GB
Carnice-9b for Hermes agentkai-os	9B	AA	115.6 tok/s	6.0 GB
Llama 3 8B InstructMeta	8B	AA	122.8 tok/s	5.7 GB
Falcon 40B InstructTechnology Innovation Institute	40B	AA	28.6 tok/s	24.4 GB
Qwen3.5-9BAlibaba	9B	AA	28.3 tok/s	24.6 GB
Gemma 4 E4B ITGoogle	4B	AA	100.6 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	100.6 tok/s	6.9 GB
Mistral 7B InstructMistral AI	7B	AA	108.8 tok/s	6.4 GB
AdVast.aiAffordable on-demand GPU rentals for training and inference. Pick from thousands of hosts.Rent a GPU
Qwen3-235B-A22BAlibaba	235B(22B active)	AA	19.1 tok/s	36.3 GB
Llama 2 7B ChatMeta	7B	AA	145.2 tok/s	4.8 GB
Gemma 4 E2B ITGoogle	2B	AA	187.6 tok/s	3.7 GB
Mistral Small 3 24BMistral AI	24B	BB	17.8 tok/s	39.0 GB
LLaMA 65BMeta	65B	BB	17.7 tok/s	39.3 GB
Llama 2 70B ChatMeta	70B	BB	16.0 tok/s	43.4 GB
Mixtral 8x22B InstructMistral AI	141B(39B active)	BB	16.0 tok/s	43.6 GB
Qwen 3.5 OmniAlibaba	397B(17B active)	BB	15.4 tok/s	45.2 GB
AdRunPodServerless and dedicated GPU cloud built for AI workloads. Spin up instances in seconds.Launch on RunPod
Llama 3 70B InstructMeta	70B	BB	15.2 tok/s	45.7 GB

Rows per page

Page 1 of 3