made by agents

Ada Lovelace data center GPU optimized for inference, graphics, and media workloads. 48GB GDDR6 with ECC and no NVLink, positioned for versatile enterprise deployment.
The NVIDIA L40S is a high-performance data center GPU built on the Ada Lovelace architecture, specifically engineered to bridge the gap between pure graphics rendering and large-scale AI inference. While the H100 remains the flagship for massive foundation model training, the L40S has emerged as the pragmatic "workhorse" for enterprise AI deployment. It is a PCIe-based card designed for universal compatibility with standard server racks, making it one of the most accessible 48GB GPUs for AI development and production-grade inference.
Positioned as the successor to the A40 and a more versatile alternative to the A100, the L40S is optimized for the current shift toward agentic workflows and fine-tuning. Unlike the consumer-grade RTX 4090, which shares the same AD102 silicon, the L40S features enterprise-grade ECC memory, a passive cooling design for server environments, and significantly higher FP16 compute performance. It competes directly with the NVIDIA A6000 Ada in the professional workstation space and offers a high-bandwidth alternative to AMD’s Instinct MI210 for specialized inference tasks.
For AI engineers, the most critical metric for the NVIDIA L40S is its 48GB of GDDR6 memory. This capacity allows for the local deployment of substantial models that would otherwise require multi-GPU setups. While it lacks NVLink support—meaning you cannot pool memory across cards with the same efficiency as H100 clusters—its high individual throughput makes it a premier choice for "NVIDIA L40S AI inference performance" in single-node configurations.
The L40S delivers 362.1 TFLOPS of FP16 performance. In practical terms, this translates to massive throughput for batch inference. The inclusion of 4th Generation Tensor Cores allows it to hit 724.2 TOPS of INT8 performance, which is vital for running highly quantized models at extreme speeds.
With a memory bandwidth of 864 GB/s on a 384-bit bus, the L40S is significantly faster than the previous generation A40 (448 GB/s). In LLM terms, memory bandwidth is the primary bottleneck for "tokens per second." The L40S provides enough headroom to ensure that even large models don't feel sluggish during interactive chat sessions or real-time agentic reasoning.
The card has a 350W TDP. While high, it is manageable within standard enterprise power envelopes. For teams building "local AI agents 2025," the L40S offers a superior performance-per-watt ratio compared to older Ampere-based cards, especially when utilizing Transformer Engine acceleration to optimize precision levels dynamically.
The 48GB VRAM capacity is the "sweet spot" for modern open-source weights. When evaluating "NVIDIA L40S VRAM for large language models," practitioners can expect the following compatibility:
The L40S is "Production Ready." It is designed for 24/7 operation in data centers. Teams deploying internal RAG (Retrieval-Augmented Generation) pipelines or AI-powered customer service agents find the L40S ideal because it can be easily added to existing PCIe-based servers without requiring specialized HGX baseboards.
For developers building "local AI agents," the L40S provides the necessary VRAM to keep multiple models resident in memory. An agentic workflow might require a primary LLM (Llama 3 70B) and a secondary embedding model or a small "judge" model (Phi-3) running simultaneously. The 48GB buffer allows for this multi-model residency without the latency of swapping weights from system RAM.
While not intended for training GPT-5, the L40S is an excellent "AI GPU for agent training" and LoRA (Low-Rank Adaptation) fine-tuning. Practitioners can fine-tune 8B and 30B models locally using frameworks like Unsloth or Axolotl, benefiting from the 4th Gen Tensor cores which support FP8 training for faster convergence and lower memory overhead.
The L40S and the RTX 6000 Ada share the same core architecture and 48GB VRAM. However, the L40S is a passively cooled server card with a higher power limit (350W vs 300W on the 6000 Ada), leading to slightly better sustained performance in data center environments. Choose the L40S for rack servers; choose the 6000 Ada for desktop workstations.
The A100 has nearly double the VRAM and significantly higher memory bandwidth (HBM2e), making it superior for massive batch processing and training. However, the L40S is built on the newer Ada architecture, which includes the Transformer Engine and better ray-tracing cores. For single-stream inference and graphics-heavy AI (like 3D Gaussian Splatting), the L40S often outperforms the older A100 at a lower price point.
While AMD’s MI300X offers more VRAM (192GB), the NVIDIA software ecosystem (CUDA, TensorRT, Triton) remains the industry standard for "best nvidia gpus for running AI models locally." The L40S benefits from day-one support for every major inference framework, from vLLM and TGI to LM Studio and Ollama, ensuring that practitioners spend their time building agents rather than debugging drivers.
Mixtral 8x7B InstructMistral AI | 46.7B(12.9B active) | SS | 61.2 tok/s | 11.4 GB | |
Gemma 4 26B-A4B ITGoogle | 26B(4B active) | SS | 63.2 tok/s | 11.0 GB | |
Qwen3.5-35B-A3BAlibaba Cloud (Qwen) | 35B(3B active) | SS | 81.5 tok/s | 8.5 GB | |
| 8B | SS | 52.2 tok/s | 13.3 GB | ||
Qwen3-30B-A3BAlibaba Cloud (Qwen) | 30B(3B active) | SS | 129.1 tok/s | 5.4 GB | |
Llama 2 13B ChatMeta | 13B | SS | 82.2 tok/s | 8.5 GB | |
Qwen3.5-122B-A10BAlibaba Cloud (Qwen) | 122B(10B active) | AA | 25.5 tok/s | 27.3 GB | |
| 8B | AA | 122.8 tok/s | 5.7 GB | ||
Falcon 40B InstructTechnology Innovation Institute | 40B | AA | 28.6 tok/s | 24.4 GB | |
Qwen3.5-9BAlibaba Cloud (Qwen) | 9B | AA | 28.3 tok/s | 24.6 GB | |
Gemma 4 E4B ITGoogle | 4B | AA | 100.6 tok/s | 6.9 GB | |
Gemma 3 4B ITGoogle | 4B | AA | 100.6 tok/s | 6.9 GB | |
Mistral 7B InstructMistral AI | 7B | AA | 108.8 tok/s | 6.4 GB | |
Qwen3-235B-A22BAlibaba Cloud (Qwen) | 235B(22B active) | AA | 19.1 tok/s | 36.3 GB | |
Llama 2 7B ChatMeta | 7B | AA | 145.2 tok/s | 4.8 GB | |
Gemma 4 E2B ITGoogle | 2B | AA | 187.6 tok/s | 3.7 GB | |
Mistral Small 3 24BMistral AI | 24B | BB | 17.8 tok/s | 39.0 GB | |
LLaMA 65BMeta | 65B | BB | 17.7 tok/s | 39.3 GB | |
Llama 2 70B ChatMeta | 70B | BB | 16.0 tok/s | 43.4 GB | |
Mixtral 8x22B InstructMistral AI | 141B(39B active) | BB | 16.0 tok/s | 43.6 GB | |
| 70B | BB | 15.2 tok/s | 45.7 GB | ||
Qwen3.5-397B-A17BAlibaba Cloud (Qwen) | 397B(17B active) | BB | 15.1 tok/s | 46.0 GB | |
Gemma 3 27B ITGoogle | 27B | BB | 15.9 tok/s | 43.8 GB | |
Kimi K2 InstructMoonshot AI | 1000B(32B active) | CC | 13.4 tok/s | 51.8 GB | |
Qwen3.5-27BAlibaba Cloud (Qwen) | 27B | FF | 9.6 tok/s | 72.8 GB |

