
NVIDIA's flagship Ampere consumer GPU (GA102, Samsung 8nm, 28.3B transistors). Features 10,496 CUDA cores across 82 SMs, 24 GB GDDR6X at 936 GB/s on a 384-bit bus, 328 3rd-gen Tensor Cores, and 82 2nd-gen RT Cores. Supports 2-way NVLink for 48 GB combined VRAM. Positioned as the spiritual successor to the Titan series, targeting 8K HDR gaming and professional AI/creative workloads.
The NVIDIA GeForce RTX 3090 is a prosumer GPU that, since its launch, has become a de facto standard for running large language models (LLMs) locally. Built on the Ampere architecture (GA102 die) and fabricated on Samsung’s 8nm process, it packs 28.3 billion transistors into a 628 mm² die. Its 24 GB of GDDR6X VRAM gives it the memory capacity to load models that most consumer GPUs cannot touch, which is why it remains a top choice for AI engineers, ML researchers, and hobbyists running local inference in 2026.
Priced at an MSRP of $1,499, the RTX 3090 occupies a unique niche: it delivers datacenter-like VRAM capacity at a fraction of the cost of an A100 or H100. It’s the spiritual successor to NVIDIA’s Titan line, and while it was marketed for 8K gaming and creative workflows, its real value for the AI community lies in its 24 GB unified memory, 936 GB/s memory bandwidth, and 3rd-gen Tensor Cores. For practitioners who need to run 13B parameter models or quantized 30B–70B models on a single machine, the RTX 3090 remains a compelling, in-stock option.
This page covers the RTX 3090’s AI-specific specs, real-world model compatibility, expected tokens-per-second, and how it stacks up against alternatives like the RTX 4090 and used datacenter cards.
For AI inference, especially transformer-based models, three metrics matter most: VRAM capacity, memory bandwidth, and compute throughput.
VRAM: 24 GB GDDR6X — This is the RTX 3090’s killer feature. 24 GB lets you load 13B models in FP16 without offloading (e.g., Llama 3.1 8B Q8 fits with headroom, Mistral 7B Q4_K_M fits easily). With 4-bit quantization, you can squeeze in 30B–34B models like Qwen 2.5 32B, DeepSeek-R1 33B, or YI-34B. Two RTX 3090s connected via NVLink give you 48 GB combined, enabling models like Llama 3.1 70B in 4-bit with tensor parallelism.
Memory Bandwidth: 936 GB/s — On a 384-bit bus with 19.5 Gbps GDDR6X, the RTX 3090 delivers high bandwidth that directly impacts token generation speed. For a 13B model in 4-bit, expect 35–70 tokens/sec on a single card — enough for interactive chatbot use. Bandwidth is the primary bottleneck for autoregressive generation; the RTX 3090’s 936 GB/s outperforms the RTX 4090’s 1,008 GB/s by only a small margin, and far exceeds the 600–700 GB/s of cards like the A4000.
Compute Performance: 35.58 TFLOPS FP32, 142 TFLOPS FP16 Tensor — The 328 3rd-gen Tensor Cores support mixed-precision operations (TF32, BF16, FP16, INT8). In practice, inference uses FP16 or quantized INT8, so the 142 TFLOPS (dense) for FP16 tensor math is the relevant figure. That’s enough to keep the memory pipeline fed for most LLM inference workloads. For training, the RTX 3090 falls short of the RTX 4090 (330 TFLOPS FP16) or RTX 6000 Ada, but for fine-tuning (QLoRA, LoRA) it works well for models up to 13B.
Power and Cooling: 350W TDP — The RTX 3090 is power-hungry. A recommended 750W PSU with 2x 8-pin PCIe power connectors is a minimum. The founder’s edition is a 3-slot card, 313 x 138 mm, weighing 2.1 kg. For multi-GPU setups, ensure adequate airflow — the card can hit 93°C under sustained load. Still, for a card that can run a 13B model at 50+ tokens/sec, the power draw is justifiable vs. a $15,000 A100.
NVLink (3rd Gen, 2-way) — A rare feature at this price point. Two RTX 3090s can be linked with NVLink (not SLI) for 112.5 GB/s bidirectional bandwidth, enabling efficient model parallelism. In practice, 2x 3090 with NVLink yields 40–60% throughput improvement over 2x PCIe-only for LLM inference, particularly for models that don’t fit in a single card.
This is where the RTX 3090 earns its keep. Here’s a realistic breakdown by model size and quantization.
Models that fit comfortably in 24 GB (single card):
Models requiring 48 GB (two RTX 3090s with NVLink):
Models that are challenging or impossible:
Real-world tokens per second benchmarks (from community and web research):
These numbers depend on backend (llama.cpp, vLLM, ExLlamaV2), context length, batch size, and GPU clock/power limits. The RTX 3090’s max supported model parameters is listed as 13B for FP16, but with quantization you can push well beyond that.
The RTX 3090 serves a broad range of AI practitioners:
Training vs. inference: The RTX 3090 is a strong inference card but only a decent training card. Its FP16 tensor performance (142 TFLOPS) is roughly half that of the RTX 4090 (330 TFLOPS) and far behind the RTX 6000 Ada (91 TFLOPS FP32, but higher VRAM). If your primary workload is training 7B+ models from scratch, look at the RTX 4090 or datacenter cards. For inference and fine-tuning, the 3090 remains a better value per GB of VRAM.
The RTX 3090’s closest competitors are the RTX 4090 and used RTX A6000 (48 GB). Here’s a factual breakdown:
| Feature | RTX 3090 (24 GB) | RTX 4090 (24 GB) | RTX A6000 (48 GB) |
|---------|------------------|------------------|------------------|
| VRAM | 24 GB | 24 GB | 48 GB |
| Memory Bandwidth | 936 GB/s | 1,008 GB/s | 768 GB/s |
| FP16 Tensor | 142 TFLOPS | 330 TFLOPS | 75 TFLOPS (approx) |
| TDP | 350 W | 450 W | 300 W |
| NVLink | Yes (2-way) | No | No (NVLink on A6000 is rare) |
| Used Price (2026) | $800–$1,200 | $1,600–$2,000 | $4,000–$6,000 |
RTX 4090 – Faster compute, slightly higher memory bandwidth, but same VRAM. For inference on models that fit in 24 GB (7B–13B), the RTX 4090 is 30–50% faster. However, it lacks NVLink, so you cannot combine two cards into a 48 GB memory pool. The RTX 3090 is the better choice if you need to scale beyond 24 GB or if you are on a tighter budget.
RTX A6000 – 48 GB VRAM is a clear advantage for 70B models in 4-bit on a single card. But the A6000’s memory bandwidth (768 GB/s) is lower, leading to 10–20% slower token generation than the RTX 3090. The A6000 is also significantly more expensive new, though used datacenter cards can be found. For multi-GPU setups, the RTX 3090’s NVLink makes it more flexible.
VS AMD (7900 XTX) – For AI inference, NVIDIA’s CUDA ecosystem and tensor core support make the RTX 3090 the clear choice. AMD’s ROCm software compatibility is improving, but for LLM inference, PyTorch, vLLM, and ollama all favor NVIDIA. The RTX 3090’s 24 GB VRAM also matches the 7900 XTX’s 24 GB while offering better performance on transformer models.
When to pick the RTX 3090:
When to skip the RTX 3090:
For developers and teams evaluating hardware for local AI agents and LLM inference in 2026, the NVIDIA GeForce RTX 3090 remains one of the most practical, well-supported, and cost-effective GPUs available. Its 24 GB VRAM, NVLink support, and strong community software support make it a reliable workhorse for running a wide range of open-source models locally.
Mixtral 8x7B InstructMistral AI | 46.7B(12.9B active) | SS | 66.3 tok/s | 11.4 GB | |
Gemma 4 26B-A4B ITGoogle | 26B(4B active) | SS | 68.4 tok/s | 11.0 GB | |
| 8B | SS | 56.5 tok/s | 13.3 GB | ||
Qwen3.6 35B-A3BAlibaba Cloud | 35B(3B active) | SS | 88.3 tok/s | 8.5 GB | |
Qwen3.5-35B-A3BAlibaba Cloud (Qwen) | 35B(3B active) | SS | 88.3 tok/s | 8.5 GB | |
Llama 2 13B ChatMeta | 13B | SS | 89.0 tok/s | 8.5 GB | |
Qwen3-30B-A3BAlibaba Cloud (Qwen) | 30B(3B active) | SS | 139.9 tok/s | 5.4 GB | |
| 9B | SS | 125.3 tok/s | 6.0 GB | ||
| 8B | SS | 133.0 tok/s | 5.7 GB | ||
Gemma 4 E4B ITGoogle | 4B | AA | 108.9 tok/s | 6.9 GB | |
Gemma 3 4B ITGoogle | 4B | AA | 108.9 tok/s | 6.9 GB | |
Mistral 7B InstructMistral AI | 7B | AA | 117.8 tok/s | 6.4 GB | |
Llama 2 7B ChatMeta | 7B | AA | 157.3 tok/s | 4.8 GB | |
Gemma 4 E2B ITGoogle | 2B | AA | 203.2 tok/s | 3.7 GB | |
minimax-m2.5MiniMax | 230B(10B active) | AA | 33.2 tok/s | 22.7 GB | |
Falcon 40B InstructTechnology Innovation Institute | 40B | BB | 30.9 tok/s | 24.4 GB | |
Qwen3.5-9BAlibaba Cloud (Qwen) | 9B | CC | 30.6 tok/s | 24.6 GB | |
Mistral Small 3 24BMistral AI | 24B | FF | 19.3 tok/s | 39.0 GB | |
Qwen3.6-27BAlibaba Cloud | 27B | FF | 10.4 tok/s | 72.8 GB | |
Gemma 3 27B ITGoogle | 27B | FF | 17.2 tok/s | 43.8 GB | |
Qwen3.5-27BAlibaba Cloud (Qwen) | 27B | FF | 10.4 tok/s | 72.8 GB | |
Gemma 4 31B ITGoogle | 31B | FF | 9.2 tok/s | 82.0 GB | |
Qwen3-32BAlibaba Cloud (Qwen) | 32.8B | FF | 14.0 tok/s | 53.9 GB | |
LLaMA 65BMeta | 65B | FF | 19.2 tok/s | 39.3 GB | |
Llama 2 70B ChatMeta | 70B | FF | 17.4 tok/s | 43.4 GB |

