NVIDIA

NVIDIA GeForce RTX 3090

Name: NVIDIA GeForce RTX 3090
Brand: NVIDIA
Price: 1499 USD
Availability: InStock

NVIDIA's flagship Ampere consumer GPU (GA102, Samsung 8nm, 28.3B transistors). Features 10,496 CUDA cores across 82 SMs, 24 GB GDDR6X at 936 GB/s on a 384-bit bus, 328 3rd-gen Tensor Cores, and 82 2nd-gen RT Cores. Supports 2-way NVLink for 48 GB combined VRAM. Positioned as the spiritual successor to the Titan series, targeting 8K HDR gaming and professional AI/creative workloads.

NVIDIA GPUsIn Stock

Best for LLMsPremium / High-End

Buy on Amazon$1,499Calculate ROI

Quick Specs

VRAM24 GB

FP16142 TFLOPS

INT8285 TOPS

TDP350 W

Memory BW936 GB/s

Max Params13B

GPU DieGA102-300-A1

ArchitectureAmpere

Process NodeSamsung 8nm

Transistors28.3 billion

Die Size628 mm²

CUDA Cores10,496

Streaming Multiprocessors82

Tensor Cores328 (3rd Generation)

RT Cores82 (2nd Generation)

TMUs / ROPs328 / 112

Base Clock1,395 MHz

Boost Clock1,695 MHz

Memory24 GB GDDR6X @ 19.5 Gbps

Memory Bus Width384-bit

Memory Bandwidth936 GB/s

L1 Cache128 KB per SM

L2 Cache6 MB

FP32 Compute35.58 TFLOPS

FP16 Tensor (Dense)142 TFLOPS (FP16 accum) / 71 TFLOPS (FP32 accum)

FP16 Tensor (Sparse)285 TFLOPS (FP16 accum) / 142 TFLOPS (FP32 accum)

INT8 Tensor285 TOPS dense / 570 TOPS sparse

TF32 TensorSupported

BF16 TensorSupported

FP64 Compute~556 GFLOPS (1:64 of FP32)

TDP350W

Max GPU Temperature93°C

Recommended PSU750W

Power Connectors2x PCIe 8-pin (12-pin adapter included)

PCIe InterfacePCIe 4.0 x16

NVLink3rd Gen, 2-way (112.5 GB/s total bidirectional)

Display Outputs1x HDMI 2.1, 3x DisplayPort 1.4a

Max Resolution7680x4320 (8K)

Max Displays4

HDCP2.3

CUDA Compute Capability8.6

DirectX12 Ultimate

Vulkan / OpenGL1.3 / 4.6

NVENC / NVDEC7th Gen / 5th Gen (AV1 decode)

DLSSVersion 2 (Super Resolution)

Resizable BARYes

Dimensions (FE)313 x 138 mm, 3-slot

Weight (FE)2,125 g (4.7 lbs)

Specifications

Overview

The NVIDIA GeForce RTX 3090 is a prosumer GPU that, since its launch, has become a de facto standard for running large language models (LLMs) locally. Built on the Ampere architecture (GA102 die) and fabricated on Samsung’s 8nm process, it packs 28.3 billion transistors into a 628 mm² die. Its 24 GB of GDDR6X VRAM gives it the memory capacity to load models that most consumer GPUs cannot touch, which is why it remains a top choice for AI engineers, ML researchers, and hobbyists running local inference in 2026.

Priced at an MSRP of $1,499, the RTX 3090 occupies a unique niche: it delivers datacenter-like VRAM capacity at a fraction of the cost of an A100 or H100. It’s the spiritual successor to NVIDIA’s Titan line, and while it was marketed for 8K gaming and creative workflows, its real value for the AI community lies in its 24 GB unified memory, 936 GB/s memory bandwidth, and 3rd-gen Tensor Cores. For practitioners who need to run 13B parameter models or quantized 30B–70B models on a single machine, the RTX 3090 remains a compelling, in-stock option.

This page covers the RTX 3090’s AI-specific specs, real-world model compatibility, expected tokens-per-second, and how it stacks up against alternatives like the RTX 4090 and used datacenter cards.

AI Performance & Specifications

For AI inference, especially transformer-based models, three metrics matter most: VRAM capacity, memory bandwidth, and compute throughput.

VRAM: 24 GB GDDR6X — This is the RTX 3090’s killer feature. 24 GB lets you load 13B models in FP16 without offloading (e.g., Llama 3.1 8B Q8 fits with headroom, Mistral 7B Q4_K_M fits easily). With 4-bit quantization, you can squeeze in 30B–34B models like Qwen 2.5 32B, DeepSeek-R1 33B, or YI-34B. Two RTX 3090s connected via NVLink give you 48 GB combined, enabling models like Llama 3.1 70B in 4-bit with tensor parallelism.

Memory Bandwidth: 936 GB/s — On a 384-bit bus with 19.5 Gbps GDDR6X, the RTX 3090 delivers high bandwidth that directly impacts token generation speed. For a 13B model in 4-bit, expect 35–70 tokens/sec on a single card — enough for interactive chatbot use. Bandwidth is the primary bottleneck for autoregressive generation; the RTX 3090’s 936 GB/s outperforms the RTX 4090’s 1,008 GB/s by only a small margin, and far exceeds the 600–700 GB/s of cards like the A4000.

Compute Performance: 35.58 TFLOPS FP32, 142 TFLOPS FP16 Tensor — The 328 3rd-gen Tensor Cores support mixed-precision operations (TF32, BF16, FP16, INT8). In practice, inference uses FP16 or quantized INT8, so the 142 TFLOPS (dense) for FP16 tensor math is the relevant figure. That’s enough to keep the memory pipeline fed for most LLM inference workloads. For training, the RTX 3090 falls short of the RTX 4090 (330 TFLOPS FP16) or RTX 6000 Ada, but for fine-tuning (QLoRA, LoRA) it works well for models up to 13B.

Power and Cooling: 350W TDP — The RTX 3090 is power-hungry. A recommended 750W PSU with 2x 8-pin PCIe power connectors is a minimum. The founder’s edition is a 3-slot card, 313 x 138 mm, weighing 2.1 kg. For multi-GPU setups, ensure adequate airflow — the card can hit 93°C under sustained load. Still, for a card that can run a 13B model at 50+ tokens/sec, the power draw is justifiable vs. a $15,000 A100.

NVLink (3rd Gen, 2-way) — A rare feature at this price point. Two RTX 3090s can be linked with NVLink (not SLI) for 112.5 GB/s bidirectional bandwidth, enabling efficient model parallelism. In practice, 2x 3090 with NVLink yields 40–60% throughput improvement over 2x PCIe-only for LLM inference, particularly for models that don’t fit in a single card.

What Models Can It Run?

This is where the RTX 3090 earns its keep. Here’s a realistic breakdown by model size and quantization.

Models that fit comfortably in 24 GB (single card):

7B–8B models (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) in FP16 or Q8: ~90–130 tokens/sec with Q4_K_M quantization. These run with plenty of VRAM to spare for context (128K+ tokens with sliding window or Ring Attention).
13B models (Llama 3 13B, Mistral 13B, CodeLlama 13B) in 4-bit (Q4_K_M or GPTQ): ~35–70 tokens/sec. FP16 versions also fit (21–24 GB), leaving less headroom for long contexts. Use vLLM or llama.cpp with flash attention for best throughput.
20B–34B models (Qwen 2.5 32B, DeepSeek-R1 33B, YI-34B) in 4-bit or AWQ: They fit with careful quantization. Expect ~15–30 tokens/sec. Some models like Mixtral 8x7B (MoE, ~47B total parameters) can fit in 24 GB with 4-bit and sparse inference, though only 12B are active per token — 35–40 tokens/sec reported.
Multimodal models (LLaVA-NeXT 13B, Qwen-VL 7B): Fit easily with the vision encoder consuming 2–4 GB. Full 24 GB allows high-resolution image processing.

Models requiring 48 GB (two RTX 3090s with NVLink):

70B–72B models (Llama 3.1 70B, Qwen 2.5 72B, DeepSeek-V2 67B) in 4-bit: ~10–20 tokens/sec with tensor parallelism. Without NVLink, PCIe-only setups see ~30% lower throughput.
Large MoE models (Mixtral 8x22B, DeepSeek-V2 236B) – only possible with heavy quantization and model sharding; not recommended for real-time inference.

Models that are challenging or impossible:

Unquantized 70B models (140 GB+ in FP16) are out of reach even with two cards.
120B+ dense models like Qwen 2.5 110B or Llama 3 405B require cloud GPUs.
Long-context fine-tuning (>32K tokens) may exceed VRAM for 13B+ models in FP16.

Real-world tokens per second benchmarks (from community and web research):

Llama 3.1 8B Q4_K_M: ~120 tokens/sec (1x 3090)
Mistral 7B Q4_K_M: ~130 tokens/sec
Qwen 2.5 7B Q4_K_M: ~110 tokens/sec
Llama 3.1 13B Q4_K_M: ~55 tokens/sec
Qwen 2.5 32B Q4_K_M: ~20 tokens/sec
Mixtral 8x7B Q4_K_M: ~38 tokens/sec
Llama 3.1 70B Q4_K_M (2x 3090 NVLink): ~12 tokens/sec

These numbers depend on backend (llama.cpp, vLLM, ExLlamaV2), context length, batch size, and GPU clock/power limits. The RTX 3090’s max supported model parameters is listed as 13B for FP16, but with quantization you can push well beyond that.

Use Cases & Target Audience

The RTX 3090 serves a broad range of AI practitioners:

Local LLM hobbyists who want a private, uncensored assistant running at home (e.g., Llama 3.1 13B, Qwen 2.5 32B). 24 GB VRAM allows them to run large context windows without cloud dependencies. The 3090 is the most cost-effective entry into “big model” territory.
Developers building AI-powered applications that need offline inference — agentic workflows, coding assistants, RAG pipelines. A single 3090 can serve multiple concurrent requests for 7B models at 30+ tokens/sec, or handle batch processing of 13B models with vLLM.
Teams running inference servers on a budget. Two 3090s with NVLink provide equivalent VRAM to an A6000 (48 GB) at roughly half the cost. For serving models like Llama 3.1 70B in 4-bit to a small team, this is a practical, self-hosted solution.
Edge deployment – While the 350W TDP limits battery-powered scenarios, the RTX 3090 is used in desktop-edge environments (e.g., field workstations, automated driving labs). Its 8K HDR output also suits visualization and VR applications.
Fine-tuning with QLoRA – For models up to 13B, you can fine-tune in FP16 with LoRA adapters. 24 GB allows a batch size of 1–2 for a 13B model. Training larger models requires multi-GPU or cloud instances.

Training vs. inference: The RTX 3090 is a strong inference card but only a decent training card. Its FP16 tensor performance (142 TFLOPS) is roughly half that of the RTX 4090 (330 TFLOPS) and far behind the RTX 6000 Ada (91 TFLOPS FP32, but higher VRAM). If your primary workload is training 7B+ models from scratch, look at the RTX 4090 or datacenter cards. For inference and fine-tuning, the 3090 remains a better value per GB of VRAM.

How It Compares

The RTX 3090’s closest competitors are the RTX 4090 and used RTX A6000 (48 GB). Here’s a factual breakdown:

|---------|------------------|------------------|------------------|

| VRAM | 24 GB | 24 GB | 48 GB |

| Memory Bandwidth | 936 GB/s | 1,008 GB/s | 768 GB/s |

| TDP | 350 W | 450 W | 300 W |

| Used Price (2026) | $800–$1,200 | $1,600–$2,000 | $4,000–$6,000 |

RTX 4090 – Faster compute, slightly higher memory bandwidth, but same VRAM. For inference on models that fit in 24 GB (7B–13B), the RTX 4090 is 30–50% faster. However, it lacks NVLink, so you cannot combine two cards into a 48 GB memory pool. The RTX 3090 is the better choice if you need to scale beyond 24 GB or if you are on a tighter budget.

RTX A6000 – 48 GB VRAM is a clear advantage for 70B models in 4-bit on a single card. But the A6000’s memory bandwidth (768 GB/s) is lower, leading to 10–20% slower token generation than the RTX 3090. The A6000 is also significantly more expensive new, though used datacenter cards can be found. For multi-GPU setups, the RTX 3090’s NVLink makes it more flexible.

VS AMD (7900 XTX) – For AI inference, NVIDIA’s CUDA ecosystem and tensor core support make the RTX 3090 the clear choice. AMD’s ROCm software compatibility is improving, but for LLM inference, PyTorch, vLLM, and ollama all favor NVIDIA. The RTX 3090’s 24 GB VRAM also matches the 7900 XTX’s 24 GB while offering better performance on transformer models.

When to pick the RTX 3090:

You need 24 GB VRAM at the lowest cost per gigabyte.
You plan to run 13B–34B models locally with quantization.
You want to pair two cards via NVLink for 48 GB on a sub-$2,500 budget.
You are fine with 350W power draw and can manage airflow.

When to skip the RTX 3090:

You need >48 GB VRAM for 70B FP16 models.
Pure inference speed is your only concern and your models fit on a single RTX 4090.
You are constrained by power or size (e.g., SFF builds).

For developers and teams evaluating hardware for local AI agents and LLM inference in 2026, the NVIDIA GeForce RTX 3090 remains one of the most practical, well-supported, and cost-effective GPUs available. Its 24 GB VRAM, NVLink support, and strong community software support make it a reliable workhorse for running a wide range of open-source models locally.

Compatible AI Models

Hide F tierOnly popular models

56 models


Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	SS	66.3 tok/s	11.4 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	SS	68.4 tok/s	11.0 GB
Llama 3.1 8B InstructMeta	8B	SS	56.5 tok/s	13.3 GB
Qwen3.6 35B-A3BAlibaba Cloud	35B(3B active)	SS	88.3 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba Cloud (Qwen)	35B(3B active)	SS	88.3 tok/s	8.5 GB
Llama 2 13B ChatMeta	13B	SS	89.0 tok/s	8.5 GB
Qwen3-30B-A3BAlibaba Cloud (Qwen)	30B(3B active)	SS	139.9 tok/s	5.4 GB
Carnice-9b for Hermes agentkai-os	9B	SS	125.3 tok/s	6.0 GB
Llama 3 8B InstructMeta	8B	SS	133.0 tok/s	5.7 GB
Gemma 4 E4B ITGoogle	4B	AA	108.9 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	108.9 tok/s	6.9 GB
Mistral 7B InstructMistral AI	7B	AA	117.8 tok/s	6.4 GB
Llama 2 7B ChatMeta	7B	AA	157.3 tok/s	4.8 GB
Gemma 4 E2B ITGoogle	2B	AA	203.2 tok/s	3.7 GB
minimax-m2.5MiniMax	230B(10B active)	AA	33.2 tok/s	22.7 GB
Falcon 40B InstructTechnology Innovation Institute	40B	BB	30.9 tok/s	24.4 GB
Qwen3.5-9BAlibaba Cloud (Qwen)	9B	CC	30.6 tok/s	24.6 GB
Mistral Small 3 24BMistral AI	24B	FF	19.3 tok/s	39.0 GB
Qwen3.6-27BAlibaba Cloud	27B	FF	10.4 tok/s	72.8 GB
Gemma 3 27B ITGoogle	27B	FF	17.2 tok/s	43.8 GB
Qwen3.5-27BAlibaba Cloud (Qwen)	27B	FF	10.4 tok/s	72.8 GB
Gemma 4 31B ITGoogle	31B	FF	9.2 tok/s	82.0 GB
Qwen3-32BAlibaba Cloud (Qwen)	32.8B	FF	14.0 tok/s	53.9 GB
LLaMA 65BMeta	65B	FF	19.2 tok/s	39.3 GB
Llama 2 70B ChatMeta	70B	FF	17.4 tok/s	43.4 GB

Rows per page

Page 1 of 3

NVIDIA GeForce RTX 3090

NVIDIA GPUsIn Stock

Best for LLMsPremium / High-End

Buy on Amazon$1,499Calculate ROI

Quick Specs

VRAM24 GB

FP16142 TFLOPS

INT8285 TOPS

TDP350 W

Memory BW936 GB/s

Max Params13B

GPU DieGA102-300-A1

ArchitectureAmpere

Process NodeSamsung 8nm

Transistors28.3 billion

Die Size628 mm²

CUDA Cores10,496

Streaming Multiprocessors82

Tensor Cores328 (3rd Generation)

RT Cores82 (2nd Generation)

TMUs / ROPs328 / 112

Base Clock1,395 MHz

Boost Clock1,695 MHz

Memory24 GB GDDR6X @ 19.5 Gbps

Memory Bus Width384-bit

Memory Bandwidth936 GB/s

L1 Cache128 KB per SM

L2 Cache6 MB

FP32 Compute35.58 TFLOPS

FP16 Tensor (Dense)142 TFLOPS (FP16 accum) / 71 TFLOPS (FP32 accum)

FP16 Tensor (Sparse)285 TFLOPS (FP16 accum) / 142 TFLOPS (FP32 accum)

INT8 Tensor285 TOPS dense / 570 TOPS sparse

TF32 TensorSupported

BF16 TensorSupported

FP64 Compute~556 GFLOPS (1:64 of FP32)

TDP350W

Max GPU Temperature93°C

Recommended PSU750W

Power Connectors2x PCIe 8-pin (12-pin adapter included)

PCIe InterfacePCIe 4.0 x16

NVLink3rd Gen, 2-way (112.5 GB/s total bidirectional)

Display Outputs1x HDMI 2.1, 3x DisplayPort 1.4a

Max Resolution7680x4320 (8K)

Max Displays4

HDCP2.3

CUDA Compute Capability8.6

DirectX12 Ultimate

Vulkan / OpenGL1.3 / 4.6

NVENC / NVDEC7th Gen / 5th Gen (AV1 decode)

DLSSVersion 2 (Super Resolution)

Resizable BARYes

Dimensions (FE)313 x 138 mm, 3-slot

Weight (FE)2,125 g (4.7 lbs)

Specifications

Overview

This page covers the RTX 3090’s AI-specific specs, real-world model compatibility, expected tokens-per-second, and how it stacks up against alternatives like the RTX 4090 and used datacenter cards.

AI Performance & Specifications

For AI inference, especially transformer-based models, three metrics matter most: VRAM capacity, memory bandwidth, and compute throughput.

What Models Can It Run?

This is where the RTX 3090 earns its keep. Here’s a realistic breakdown by model size and quantization.

Models that fit comfortably in 24 GB (single card):

7B–8B models (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) in FP16 or Q8: ~90–130 tokens/sec with Q4_K_M quantization. These run with plenty of VRAM to spare for context (128K+ tokens with sliding window or Ring Attention).
13B models (Llama 3 13B, Mistral 13B, CodeLlama 13B) in 4-bit (Q4_K_M or GPTQ): ~35–70 tokens/sec. FP16 versions also fit (21–24 GB), leaving less headroom for long contexts. Use vLLM or llama.cpp with flash attention for best throughput.
20B–34B models (Qwen 2.5 32B, DeepSeek-R1 33B, YI-34B) in 4-bit or AWQ: They fit with careful quantization. Expect ~15–30 tokens/sec. Some models like Mixtral 8x7B (MoE, ~47B total parameters) can fit in 24 GB with 4-bit and sparse inference, though only 12B are active per token — 35–40 tokens/sec reported.
Multimodal models (LLaVA-NeXT 13B, Qwen-VL 7B): Fit easily with the vision encoder consuming 2–4 GB. Full 24 GB allows high-resolution image processing.

Models requiring 48 GB (two RTX 3090s with NVLink):

70B–72B models (Llama 3.1 70B, Qwen 2.5 72B, DeepSeek-V2 67B) in 4-bit: ~10–20 tokens/sec with tensor parallelism. Without NVLink, PCIe-only setups see ~30% lower throughput.
Large MoE models (Mixtral 8x22B, DeepSeek-V2 236B) – only possible with heavy quantization and model sharding; not recommended for real-time inference.

Models that are challenging or impossible:

Unquantized 70B models (140 GB+ in FP16) are out of reach even with two cards.
120B+ dense models like Qwen 2.5 110B or Llama 3 405B require cloud GPUs.
Long-context fine-tuning (>32K tokens) may exceed VRAM for 13B+ models in FP16.

Real-world tokens per second benchmarks (from community and web research):

Llama 3.1 8B Q4_K_M: ~120 tokens/sec (1x 3090)
Mistral 7B Q4_K_M: ~130 tokens/sec
Qwen 2.5 7B Q4_K_M: ~110 tokens/sec
Llama 3.1 13B Q4_K_M: ~55 tokens/sec
Qwen 2.5 32B Q4_K_M: ~20 tokens/sec
Mixtral 8x7B Q4_K_M: ~38 tokens/sec
Llama 3.1 70B Q4_K_M (2x 3090 NVLink): ~12 tokens/sec

Use Cases & Target Audience

The RTX 3090 serves a broad range of AI practitioners:

Local LLM hobbyists who want a private, uncensored assistant running at home (e.g., Llama 3.1 13B, Qwen 2.5 32B). 24 GB VRAM allows them to run large context windows without cloud dependencies. The 3090 is the most cost-effective entry into “big model” territory.
Developers building AI-powered applications that need offline inference — agentic workflows, coding assistants, RAG pipelines. A single 3090 can serve multiple concurrent requests for 7B models at 30+ tokens/sec, or handle batch processing of 13B models with vLLM.
Teams running inference servers on a budget. Two 3090s with NVLink provide equivalent VRAM to an A6000 (48 GB) at roughly half the cost. For serving models like Llama 3.1 70B in 4-bit to a small team, this is a practical, self-hosted solution.
Edge deployment – While the 350W TDP limits battery-powered scenarios, the RTX 3090 is used in desktop-edge environments (e.g., field workstations, automated driving labs). Its 8K HDR output also suits visualization and VR applications.
Fine-tuning with QLoRA – For models up to 13B, you can fine-tune in FP16 with LoRA adapters. 24 GB allows a batch size of 1–2 for a 13B model. Training larger models requires multi-GPU or cloud instances.

How It Compares

The RTX 3090’s closest competitors are the RTX 4090 and used RTX A6000 (48 GB). Here’s a factual breakdown:

|---------|------------------|------------------|------------------|

| VRAM | 24 GB | 24 GB | 48 GB |

| Memory Bandwidth | 936 GB/s | 1,008 GB/s | 768 GB/s |

| TDP | 350 W | 450 W | 300 W |

| Used Price (2026) | $800–$1,200 | $1,600–$2,000 | $4,000–$6,000 |

When to pick the RTX 3090:

You need 24 GB VRAM at the lowest cost per gigabyte.
You plan to run 13B–34B models locally with quantization.
You want to pair two cards via NVLink for 48 GB on a sub-$2,500 budget.
You are fine with 350W power draw and can manage airflow.

When to skip the RTX 3090:

You need >48 GB VRAM for 70B FP16 models.
Pure inference speed is your only concern and your models fit on a single RTX 4090.
You are constrained by power or size (e.g., SFF builds).

Compatible AI Models

Hide F tierOnly popular models

56 models


Mixtral 8x7B InstructMistral AI	46.7B(12.9B active)	SS	66.3 tok/s	11.4 GB
Gemma 4 26B-A4B ITGoogle	26B(4B active)	SS	68.4 tok/s	11.0 GB
Llama 3.1 8B InstructMeta	8B	SS	56.5 tok/s	13.3 GB
Qwen3.6 35B-A3BAlibaba Cloud	35B(3B active)	SS	88.3 tok/s	8.5 GB
Qwen3.5-35B-A3BAlibaba Cloud (Qwen)	35B(3B active)	SS	88.3 tok/s	8.5 GB
Llama 2 13B ChatMeta	13B	SS	89.0 tok/s	8.5 GB
Qwen3-30B-A3BAlibaba Cloud (Qwen)	30B(3B active)	SS	139.9 tok/s	5.4 GB
Carnice-9b for Hermes agentkai-os	9B	SS	125.3 tok/s	6.0 GB
Llama 3 8B InstructMeta	8B	SS	133.0 tok/s	5.7 GB
Gemma 4 E4B ITGoogle	4B	AA	108.9 tok/s	6.9 GB
Gemma 3 4B ITGoogle	4B	AA	108.9 tok/s	6.9 GB
Mistral 7B InstructMistral AI	7B	AA	117.8 tok/s	6.4 GB
Llama 2 7B ChatMeta	7B	AA	157.3 tok/s	4.8 GB
Gemma 4 E2B ITGoogle	2B	AA	203.2 tok/s	3.7 GB
minimax-m2.5MiniMax	230B(10B active)	AA	33.2 tok/s	22.7 GB
Falcon 40B InstructTechnology Innovation Institute	40B	BB	30.9 tok/s	24.4 GB
Qwen3.5-9BAlibaba Cloud (Qwen)	9B	CC	30.6 tok/s	24.6 GB
Mistral Small 3 24BMistral AI	24B	FF	19.3 tok/s	39.0 GB
Qwen3.6-27BAlibaba Cloud	27B	FF	10.4 tok/s	72.8 GB
Gemma 3 27B ITGoogle	27B	FF	17.2 tok/s	43.8 GB
Qwen3.5-27BAlibaba Cloud (Qwen)	27B	FF	10.4 tok/s	72.8 GB
Gemma 4 31B ITGoogle	31B	FF	9.2 tok/s	82.0 GB
Qwen3-32BAlibaba Cloud (Qwen)	32.8B	FF	14.0 tok/s	53.9 GB
LLaMA 65BMeta	65B	FF	19.2 tok/s	39.3 GB
Llama 2 70B ChatMeta	70B	FF	17.4 tok/s	43.4 GB

Rows per page

Page 1 of 3

NVIDIA GeForce RTX 3090

Quick Specs

Specifications

Overview

AI Performance & Specifications

What Models Can It Run?

Use Cases & Target Audience

How It Compares

Compatible AI Models

Similar Products

NVIDIA GeForce RTX 4070

NVIDIA GB200 NVL72 Rack System

NVIDIA GeForce RTX 5060 Ti 8GB

NVIDIA L4 Tensor Core GPU

NVIDIA GeForce RTX 3090

Quick Specs

Specifications

Overview

AI Performance & Specifications

What Models Can It Run?

Use Cases & Target Audience

How It Compares

Compatible AI Models

Similar Products

NVIDIA GeForce RTX 4070

NVIDIA GB200 NVL72 Rack System

NVIDIA GeForce RTX 5060 Ti 8GB

NVIDIA L4 Tensor Core GPU