
Mid-size 4B Qwen3 embedding model balancing quality and efficiency.
A strong 4B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Copy and paste this command to start running the model locally.
ollama run qwen3-embedding:4bAccess model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Alibaba’s Qwen3-Embedding-4B is a dense text embedding model with 4 billion parameters, built on the Qwen3-4B-Base foundation. It’s the mid-size option in a lineup that runs from 0.6B to 8B, tailored for developers who need high-quality vector representations without the overhead of larger models. This is not a chat or generation model—it’s designed for semantic similarity, retrieval, clustering, classification, and cross-lingual tasks.
The Qwen3 Embedding series achieved the top spot on the MTEB multilingual leaderboard (score 70.58 for the 8B variant as of June 5, 2025). While the 4B version doesn’t match the 8B’s peak scores, it delivers a strong efficiency-to-quality ratio. It’s a practical choice for teams deploying RAG pipelines, semantic search, or code retrieval where VRAM and latency matter more than the absolute SOTA.
Licensed under Apache 2.0, it’s free for commercial use. It competes with models like bge-base-en-v1.5 (around 0.1B), all-MiniLM-L6-v2 (0.07B), and larger ones like intfloat/e5-mistral-7b-instruct. The 4B parameter count places it in a sweet spot—light enough to run on consumer GPUs, heavy enough to beat most sub-2B embedders on benchmarks.
Qwen3-Embedding-4B is a dense transformer, meaning all 4B parameters are active during inference. Unlike mixture-of-experts (MoE) models, there is no token-level routing; every input uses the full parameter set. This makes inference straightforward and consistent in memory usage—no variable active parameter counts to manage.
The model supports a context length of 32K tokens (based on third-party verification; official specs do not specify). This is sufficient for encoding long documents, reports, or code files in a single pass. However, embedding models typically use a shorter effective max sequence length due to pooling operations. For practical RAG use, chunking documents into segments of 512–8192 tokens is still recommended.
Key architectural points:
The model inherits Qwen3’s multilingual backbone, covering over 100 languages including natural languages and programming languages. For code retrieval, it handles Python, JavaScript, C++, and others—useful for embedding code documentation or code snippets.
The Qwen3-Embedding-4B excels in tasks where you need robust, multilingual embeddings from a single model. Its specific strengths:
Because it’s not a generative model, it won’t produce text or reasoning traces. That’s by design—embedding models are specialized for transforming text into fixed-size vectors. If you need both generation and embedding, you’d pair this with a separate LLM (e.g., Qwen3-8B) for inference.
The 4B size is ideal when you need better quality than 300-400M parameter embedders but can’t fit an 8B+ model on your hardware or latency budget. It’s also a good fit for batch processing where throughput matters more than single-query latency.
At full precision (FP32), the model requires roughly 16 GB of VRAM (4B parameters × 4 bytes). Realistically, you’ll use quantized versions:
| Quantization | VRAM (approx) | Notes |
|---|---|---|
| FP16 / BF16 | 8 GB | Full precision, minimal quality loss. |
| Q4_K_M | 4.5–5 GB | Recommended balance for consumer GPUs. |
| Q4_K_S | 4–4.5 GB | Slightly less memory, negligible quality drop. |
| Q3_K_M | 3.5–4 GB | Usable but quality may degrade for cross-lingual tasks. |
Minimum VRAM: 4 GB (using Q4_K_S or IQ4_XS) allows running on cards like GTX 1660 Ti, RTX 3050, or integrated NPUs with shared memory—but expect very low throughput.
Recommended VRAM: 8 GB or more. This opens up Q4_K_M quantization, which offers near-lossless quality for most embedding tasks. The official Ollama 4-bit quantized model (Q4_K_M) is 2.5 GB disk size and runs comfortably on:
For batch inference (e.g., encoding 1000 documents), allocate more VRAM. Running a batch size of 8–16 at Q4_K_M may need ~6–8 GB. The 32K context length will increase memory linearly with sequence length—use shorter chunks (512–2048 tokens) for production.
Exact throughput depends on hardware and quantization. Estimates on an RTX 4090 (24 GB) at Q4_K_M with typical sequence length (512 tokens):
On an Apple M4 Max (40-core GPU), expect approximately 500–800 tokens/s per sequence (Metal backend).
To get started quickly, use Ollama:
1ollama pull qwen3-embedding:4b
Then embed via API or Python:
1import ollama2response = ollama.embed(model='qwen3-embedding:4b', input='Your text here')3print(response['embeddings'])
Ollama automatically applies the recommended Q4_K_M quantization. If you want a different quantization, use the --quantize flag when pulling (e.g., qwen3-embedding:4b-q3_K_M).
For production deployments, consider using the sentence-transformers library directly with transformers to control batch size, device placement, and pooling. The model is available on Hugging Face as Qwen/Qwen3-Embedding-4B.
In summary: Qwen3-Embedding-4B is the best option when you need a balance of quality, multilingual capability, and VRAM efficiency—especially for developers targeting 8–12 GB consumer GPUs. For English-only tasks on tight budgets, smaller models like BGE can suffice. For maximum raw English performance and you have spare VRAM, e5-mistral-7b or the 8B Qwen3 variant are better.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Explore the Family
The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.