
Flagship 8B multilingual embedding model from Qwen, ranked #1 on MTEB Multilingual at release.
A strong 7.6B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Copy and paste this command to start running the model locally.
ollama run qwen3-embedding:8bAccess model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Qwen3-Embedding-8B is a dense, 7.6B-parameter text embedding model developed by Alibaba’s Qwen team. It ranks #1 on the MTEB Multilingual leaderboard (score 70.58 as of June 2025), making it the top-performing publicly available embedding model for multilingual retrieval, classification, clustering, and bitext mining. Released under the Apache 2.0 license, it is designed for practitioners who need a single model that handles over 100 languages—including programming languages—without sacrificing precision.
The model is the largest in the Qwen3 Embedding series, which also includes 0.6B and 4B variants. Unlike Mixture-of-Experts (MoE) architectures, this is a dense model—all 7.6B parameters are active during inference. That means predictable VRAM usage and consistent throughput, but also a higher memory floor than an MoE model with equivalent total parameters. It competes directly with other 7B-class embedding models such as gte-Qwen2-7B-instruct (from the same lineage) and intfloat-multilingual-e5-large (around 560M params, but much smaller), but outperforms both on multilingual benchmarks.
Qwen3-Embedding-8B is built on the dense Qwen3-8B-Base language model. It retains the same transformer backbone: 32 layers, hidden size 4096, 32 attention heads, and a vocabulary of 152,064 tokens. The embedding model converts text into fixed-length vectors (default 4096 dimensions, but configurable down to 256 via a flexible dimension head). The model supports user-defined instructions—you can prefix queries with task-specific prompts (e.g., “Represent this document for retrieval:”) to boost performance on particular tasks or languages.
Because it’s dense, every forward pass uses the full parameter set. In FP16, the model requires approximately 15.5 GB of VRAM (7.6B params × 2 bytes). The context length is not officially specified, but the underlying Qwen3-8B-Base supports up to 32K tokens. The embedding model truncates inputs to fit within its maximum length—practically, you can embed documents up to several thousand tokens without issues, but very long documents will be truncated. For longer inputs, consider chunking or using the retriever-only mode.
Qwen3-Embedding-8B excels in multilingual and cross-lingual retrieval. It supports over 100 natural languages and dozens of programming languages. On the MTEB Multilingual leaderboard (which includes tasks like XNLI, AmazonReviews, BUCC, etc.), it scored 70.58, beating all other embedding models at any size at the time of release. Here are concrete use cases:
The model also supports reranking when paired with the Qwen3-Reranker series (separate model), but as a standalone embedder it delivers competitive results on its own, especially for asymmetric retrieval (query vs. document).
| Setup | VRAM | Quantization | Notes |
|---|---|---|---|
| RTX 4090 (24 GB) | Native FP16 | FP16 (full precision) | Best quality, fastest inference |
| RTX 3090 (24 GB) | Q8_0 or Q4_K_M | 8-bit or 4-bit | Q8_0 possible with careful memory management |
| RTX 4070 Ti (12 GB) | Q4_K_M | 4-bit | Works well; batch size 1–4 |
| M4 Max (64 GB) | Q4_K_M or Q8_0 | 4-bit/8-bit | Full precision possible if using unified memory |
| Any 8 GB GPU | Q4_K_M? | 4-bit | Tight—batch size 1, may run into memory issues. Consider the 4B variant instead. |
Measured on an RTX 4090 with Ollama and Q4_K_M quantization:
On an M4 Max (48 GB unified memory) with Q4_K_M, expect ~150–250 tokens/sec per request. For large-scale indexing, batch processing is crucial—use a GPU with at least 24 GB to maximize throughput.
Ollama provides the fastest path to running this model locally:
1ollama pull qwen3-embedding:8b
Then embed text:
1import ollama2response = ollama.embed(3 model='qwen3-embedding:8b',4 input='Your text here'5)6print(response.embeddings) # List of vectors
For API usage or custom inference, the model is available on HuggingFace (Qwen/Qwen3-Embedding-8B) and supports the sentence-transformers inference API.
Against intfloat-multilingual-e5-large (560M params, dense): Qwen3-Embedding-8B is significantly larger and outperforms e5 on almost every multilingual MTEB task—especially on cross-lingual retrieval and bitext mining. The tradeoff is VRAM: e5 runs easily on 8 GB GPUs, while Qwen3-Embedding-8B requires at least 12 GB (quantized). If you’re resource-constrained or only need English, e5 is still a solid choice. For multilingual or code-heavy workloads, Qwen3 is clearly superior.
Against gte-Qwen2-7B-instruct (7.6B, also from Alibaba): Qwen3-Embedding-8B is the direct successor. It uses an improved training pipeline with multi-stage training and synthetic data from Qwen3 LLMs. On MTEB Multilingual, Qwen3 beats gte-Qwen2 by ~1.5–2 points. Both models use the same architecture; Qwen3 is simply better optimized. If you already have gte-Qwen2 deployed, upgrading to Qwen3 is straightforward, but the gains may not justify a full migration unless you need maximum accuracy.
Against BGE-M3 (1.2B, dense): BGE-M3 is smaller and supports less languages, but runs on modest hardware. Qwen3 beats it on multilingual benchmarks, but BGE-M3’s lower VRAM requirement makes it a practical choice for edge devices. Choose Qwen3 for server-grade RAG pipelines; choose BGE-M3 for low-resource deployments.
Bottom line: Qwen3-Embedding-8B is the current gold standard for multilingual embedding at the 7B scale. The main constraint is VRAM—quantize to Q4_K_M for consumer GPUs, or use the 4B variant if you need to fit under 8 GB. Its Apache 2.0 license and Ollama support make it immediately usable for local RAG, search, and classification workloads.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Explore the Family
The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.