
NVIDIA's Llama-3.1-8B-based bidirectional multilingual embedder; #1 MMTEB Borda at October 2025 release.
A solid 7.5B-parameter dense embedding model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA’s llama-embed-nemotron-8b is a 7.5B parameter dense text embedding model that holds the top position on the Multilingual MTEB (MMTEB) leaderboard as of October 2025. Built on the Llama-3.1-8B backbone with a bidirectional training objective, it is designed for retrieval, reranking, semantic similarity, and classification across 100+ languages. This is not a general-purpose chat or generation model — it is a specialist embedder optimized for turning text into dense vector representations that preserve cross-lingual meaning.
The model was trained on a novel data mix of 16.1 million query-document pairs, combining 7.7M public samples with 8.4M synthetically generated examples from open-weight LLMs. NVIDIA has released the full training recipe, dataset, and code via NeMo AutoModel, making this one of the most open embedding models at this size. The license, however, is strictly non-commercial (NVIDIA Customized NSCLv1), so this is targeted at researchers, hobbyists, and internal prototyping — not production deployments in commercial products.
At 7.5B parameters, it sits in a competitive slot: large enough to capture nuanced multilingual semantics, yet small enough to run on consumer-grade hardware with quantization. For developers building local RAG pipelines or multilingual search systems, this is currently the highest-performing open-weights embedder in its class.
llama-embed-nemotron-8b is a dense transformer with 7.5B parameters, fine-tuned from meta-llama/Llama-3.1-8B. Unlike the original Llama, which uses a causal attention mask, NVIDIA applied a bidirectional attention mechanism during embedding training — meaning each token can attend to all other tokens in the input. This is standard for embedding models and critical for tasks like sentence similarity and classification where full context matters.
The model uses a mean pooling strategy over the last hidden states to produce fixed-size embeddings (4096 dimensions). It is instruction-aware: you can prepend a task-specific instruction (e.g., “Represent this document for retrieval:”) to steer the embedding space toward a particular use case. This flexibility is a direct advantage over older embedders that treat all text uniformly.
Context length is not formally specified, but as a derivative of Llama-3.1-8B with 128K context capability, the practical limit depends on the inference framework and GPU memory. For most embedding workloads (single sentences or short paragraphs), context is not a bottleneck.
Because this is a dense model, every parameter is active during inference — unlike a mixture-of-experts (MoE) model where only a fraction of parameters fire per token. This means VRAM consumption scales linearly with parameter count. At 16-bit precision, the model occupies roughly 15 GB of GPU memory. At 4-bit quantization (e.g., Q4_K_M), that drops to ~4.5 GB, making it viable on mid-range consumer GPUs.
llama-embed-nemotron-8b is a text embedding model, not a text generation model. Its output is a vector — use it to compute semantic similarity, retrieve relevant documents, rank candidates, or train classifiers. The model is specifically strong in the following areas:
Concrete use cases: a developer building a local RAG system for a multilingual knowledge base; a researcher evaluating cross-lingual information retrieval for Swahili or Tamil; a hobbyist running a personal semantic search engine over a collection of documents in multiple languages.
This is where the model shines for practitioners who want to run AI models on their own hardware without cloud dependencies.
llama.cpp or Ollama with a Q4_K_M quant. Ollama provides the quickest path: ollama pull llama-embed-nemotron-8b (once the model is added to the registry) and then embed text via the API.On a desktop with RTX 4090 (FP16): ~50–80 tokens/second for single-batch embeddings. With batching (e.g., 32 texts at once), throughput scales to several hundred texts per second. On a laptop with RTX 4060 (Q4_K_M): ~30–50 tokens/second. Token/s numbers vary by sequence length and framework (llama.cpp, transformers, ONNX). For typical sentences (50–100 tokens), embedding latency is under 20 ms.
Hardware requirements are well within reach of any recent gaming GPU. If you want to run the model locally for a private RAG pipeline, you only need a machine with 8 GB of VRAM and the desired quantization.
Two realistic alternatives at a similar parameter count and embedding focus: BGE-M3 (BAAI, ~568M parameters) and E5-mistral-7b (Microsoft, 7B parameters).
llama-embed-nemotron-8b, especially on cross-lingual and low-resource languages. BGE-M3 is also open-weight with a permissive license (MIT), making it a better fit for commercial use. If you need commercial deployment, choose BGE-M3 despite the lower accuracy.llama-embed-nemotron-8b is the clear winner.The main tradeoff with llama-embed-nemotron-8b is its non-commercial license. Against other open-weights multilingual embedders (e.g., intfloat/multilingual-e5-large), it leads on the MMTEB benchmark by a wide margin. For researchers and enthusiasts who are not constrained by licensing, this is currently the best local multilingual embedding model available.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Explore the Family
The full Llama family leaderboard with sizes, benchmark scores, and a release timeline.