Alibaba

Qwen3-Embedding-4B

Mid-size 4B Qwen3 embedding model balancing quality and efficiency.

4B paramsDense

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A strong 4B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters4B

Active Params3.6B

ArchitectureDense

ProviderAlibaba

Download Size36.2 GB

Community

Monthly Downloads2.4M

Likes291

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3-embedding:4b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

MTEB Overall

69.5

Retrieval

69.6

Classification

72.3

Clustering

57.1

STS

80.9

MBA Open Score

71.7AA

Benchmark60%

69.9

Popularity25%

85.6

Efficiency15%

55.6

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.7 GB
Acer Veriton GN100 AI MiniAcer	SS	2.7 GB
AMD Instinct MI300XAMD	SS	2.7 GB
AMD Instinct MI325XAMD	SS	2.7 GB
AMD Instinct MI355XAMD	SS	2.7 GB
AMD Radeon RX 7600 8GBAMD	SS	2.7 GB
AMD Radeon RX 7700 XTAMD	SS	2.7 GB
AMD Radeon RX 7800 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTXAMD	SS	2.7 GB
AMD Radeon RX 9070AMD	SS	2.7 GB
AMD Radeon RX 9070 XTAMD	SS	2.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.7 GB
Apple M4Apple	SS	2.7 GB
Apple M4 Max (40-core GPU)Apple	SS	2.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple M5Apple	SS	2.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Alibaba’s Qwen3-Embedding-4B is a dense text embedding model with 4 billion parameters, built on the Qwen3-4B-Base foundation. It’s the mid-size option in a lineup that runs from 0.6B to 8B, tailored for developers who need high-quality vector representations without the overhead of larger models. This is not a chat or generation model—it’s designed for semantic similarity, retrieval, clustering, classification, and cross-lingual tasks.

The Qwen3 Embedding series achieved the top spot on the MTEB multilingual leaderboard (score 70.58 for the 8B variant as of June 5, 2025). While the 4B version doesn’t match the 8B’s peak scores, it delivers a strong efficiency-to-quality ratio. It’s a practical choice for teams deploying RAG pipelines, semantic search, or code retrieval where VRAM and latency matter more than the absolute SOTA.

Licensed under Apache 2.0, it’s free for commercial use. It competes with models like bge-base-en-v1.5 (around 0.1B), all-MiniLM-L6-v2 (0.07B), and larger ones like intfloat/e5-mistral-7b-instruct. The 4B parameter count places it in a sweet spot—light enough to run on consumer GPUs, heavy enough to beat most sub-2B embedders on benchmarks.

Architecture & Technical Details

Qwen3-Embedding-4B is a dense transformer, meaning all 4B parameters are active during inference. Unlike mixture-of-experts (MoE) models, there is no token-level routing; every input uses the full parameter set. This makes inference straightforward and consistent in memory usage—no variable active parameter counts to manage.

The model supports a context length of 32K tokens (based on third-party verification; official specs do not specify). This is sufficient for encoding long documents, reports, or code files in a single pass. However, embedding models typically use a shorter effective max sequence length due to pooling operations. For practical RAG use, chunking documents into segments of 512–8192 tokens is still recommended.

Key architectural points:

Dense architecture – no sparsity overhead; predictable VRAM.
Pooling – outputs a single vector per input (likely CLS or mean pooling, typical for embedding models).
Instruction-tuned – supports user-defined prefixes to adapt embeddings for specific domains or languages (e.g., “Represent this query for retrieval:” or “Classify this review:”).
Flexible output dimensions – can be set to any size (e.g., 256, 512, 768, 2048) via dimensionality reduction during inference, trading off speed/storage vs. precision.

The model inherits Qwen3’s multilingual backbone, covering over 100 languages including natural languages and programming languages. For code retrieval, it handles Python, JavaScript, C++, and others—useful for embedding code documentation or code snippets.

Capabilities & Use Cases

The Qwen3-Embedding-4B excels in tasks where you need robust, multilingual embeddings from a single model. Its specific strengths:

Semantic search / RAG: Generate embeddings for a document corpus, then perform cosine similarity queries. Works well with chunked documents up to 32K tokens.
Code retrieval: Embed code functions, comments, or API docs. Top-tier performance on code search benchmarks compared to similarly sized models.
Text classification: Use as a feature extractor for supervised classifiers—freeze the embedding layer, train a lightweight head.
Clustering: Group news articles, customer feedback, or scientific papers by topic using embedding vectors.
Cross-lingual retrieval: Query in English, retrieve documents in Chinese, German, or other languages. The model’s multilingual training makes it effective here.
Bitext mining: Align parallel sentences across languages for machine translation training data.

Because it’s not a generative model, it won’t produce text or reasoning traces. That’s by design—embedding models are specialized for transforming text into fixed-size vectors. If you need both generation and embedding, you’d pair this with a separate LLM (e.g., Qwen3-8B) for inference.

The 4B size is ideal when you need better quality than 300-400M parameter embedders but can’t fit an 8B+ model on your hardware or latency budget. It’s also a good fit for batch processing where throughput matters more than single-query latency.

Running Qwen3-Embedding-4B Locally

VRAM Requirements

At full precision (FP32), the model requires roughly 16 GB of VRAM (4B parameters × 4 bytes). Realistically, you’ll use quantized versions:

Quantization	VRAM (approx)	Notes
FP16 / BF16	8 GB	Full precision, minimal quality loss.
Q4_K_M	4.5–5 GB	Recommended balance for consumer GPUs.
Q4_K_S	4–4.5 GB	Slightly less memory, negligible quality drop.
Q3_K_M	3.5–4 GB	Usable but quality may degrade for cross-lingual tasks.

Minimum VRAM: 4 GB (using Q4_K_S or IQ4_XS) allows running on cards like GTX 1660 Ti, RTX 3050, or integrated NPUs with shared memory—but expect very low throughput.

Recommended VRAM: 8 GB or more. This opens up Q4_K_M quantization, which offers near-lossless quality for most embedding tasks. The official Ollama 4-bit quantized model (Q4_K_M) is 2.5 GB disk size and runs comfortably on:

NVIDIA RTX 3060 (12 GB) / RTX 4060 (8 GB) / RTX 4090 (24 GB)
AMD Radeon RX 6700 XT (12 GB)
Apple M4 Max or M3 Pro with unified memory (16 GB+)
Intel Arc A770 (16 GB)

For batch inference (e.g., encoding 1000 documents), allocate more VRAM. Running a batch size of 8–16 at Q4_K_M may need ~6–8 GB. The 32K context length will increase memory linearly with sequence length—use shorter chunks (512–2048 tokens) for production.

Performance (Tokens per Second)

Exact throughput depends on hardware and quantization. Estimates on an RTX 4090 (24 GB) at Q4_K_M with typical sequence length (512 tokens):

Single sequence: ~1500–2000 tokens/s
Batch of 16: ~7000–10000 tokens/s aggregate

On an Apple M4 Max (40-core GPU), expect approximately 500–800 tokens/s per sequence (Metal backend).

To get started quickly, use Ollama:

1ollama pull qwen3-embedding:4b

Then embed via API or Python:

1import ollama
2response = ollama.embed(model='qwen3-embedding:4b', input='Your text here')
3print(response['embeddings'])

Ollama automatically applies the recommended Q4_K_M quantization. If you want a different quantization, use the --quantize flag when pulling (e.g., qwen3-embedding:4b-q3_K_M).

For production deployments, consider using the sentence-transformers library directly with transformers to control batch size, device placement, and pooling. The model is available on Hugging Face as Qwen/Qwen3-Embedding-4B.

How It Compares

Qwen3-Embedding-4B vs. BGE-Large-EN-v1.5 (1.3B)

Size: Qwen3 is 3x larger, so it requires more VRAM but captures more nuance.
Multilingual: Qwen3 supports 100+ languages; BGE-Large is primarily English.
Performance: Qwen3 beats BGE-Large on MTEB English benchmarks by ~2-3 points on average. On multilingual tasks, the gap widens.
Context length: Qwen3 supports 32K vs. BGE-Large's 512 tokens. This is critical for long-document retrieval.
When to choose Qwen3: You need multilingual or cross-lingual capability, or you process documents longer than 512 tokens.
When to choose BGE-Large: You are English-only and have limited VRAM (e.g., 4 GB cards). BGE-Large at Q4 quant fits in ~2 GB.

Qwen3-Embedding-4B vs. intfloat/e5-mistral-7b-instruct

Size: e5-mistral-7b has ~7B parameters (dense), Qwen3-4B is smaller.
Performance: e5-mistral-7b scores higher on MTEB English (by ~1-2 points) and supports longer context (32K as well).
VRAM: e5-mistral-7b at Q4_K_M needs ~7–8 GB; Qwen3-4B needs ~4.5 GB.
When to choose Qwen3: You need to fit on an 8 GB card while keeping quality high; or you require stronger multilingual support (e5-mistral is mostly English + some code).
When to choose e5-mistral-7b: You have a 24 GB card and want the best English retrieval performance at the cost of higher memory.

In summary: Qwen3-Embedding-4B is the best option when you need a balance of quality, multilingual capability, and VRAM efficiency—especially for developers targeting 8–12 GB consumer GPUs. For English-only tasks on tight budgets, smaller models like BGE can suffice. For maximum raw English performance and you have spare VRAM, e5-mistral-7b or the 8B Qwen3 variant are better.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Alibaba

Qwen3-Embedding-4B

Mid-size 4B Qwen3 embedding model balancing quality and efficiency.

4B paramsDense

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A strong 4B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters4B

Active Params3.6B

ArchitectureDense

ProviderAlibaba

Download Size36.2 GB

Community

Monthly Downloads2.4M

Likes291

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run qwen3-embedding:4b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

MTEB Overall

69.5

Retrieval

69.6

Classification

72.3

Clustering

57.1

STS

80.9

MBA Open Score

71.7AA

Benchmark60%

69.9

Popularity25%

85.6

Efficiency15%

55.6

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.7 GB
Acer Veriton GN100 AI MiniAcer	SS	2.7 GB
AMD Instinct MI300XAMD	SS	2.7 GB
AMD Instinct MI325XAMD	SS	2.7 GB
AMD Instinct MI355XAMD	SS	2.7 GB
AMD Radeon RX 7600 8GBAMD	SS	2.7 GB
AMD Radeon RX 7700 XTAMD	SS	2.7 GB
AMD Radeon RX 7800 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTXAMD	SS	2.7 GB
AMD Radeon RX 9070AMD	SS	2.7 GB
AMD Radeon RX 9070 XTAMD	SS	2.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.7 GB
Apple M4Apple	SS	2.7 GB
Apple M4 Max (40-core GPU)Apple	SS	2.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple M5Apple	SS	2.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key architectural points:

Dense architecture – no sparsity overhead; predictable VRAM.
Pooling – outputs a single vector per input (likely CLS or mean pooling, typical for embedding models).
Instruction-tuned – supports user-defined prefixes to adapt embeddings for specific domains or languages (e.g., “Represent this query for retrieval:” or “Classify this review:”).
Flexible output dimensions – can be set to any size (e.g., 256, 512, 768, 2048) via dimensionality reduction during inference, trading off speed/storage vs. precision.

Capabilities & Use Cases

The Qwen3-Embedding-4B excels in tasks where you need robust, multilingual embeddings from a single model. Its specific strengths:

Semantic search / RAG: Generate embeddings for a document corpus, then perform cosine similarity queries. Works well with chunked documents up to 32K tokens.
Code retrieval: Embed code functions, comments, or API docs. Top-tier performance on code search benchmarks compared to similarly sized models.
Text classification: Use as a feature extractor for supervised classifiers—freeze the embedding layer, train a lightweight head.
Clustering: Group news articles, customer feedback, or scientific papers by topic using embedding vectors.
Cross-lingual retrieval: Query in English, retrieve documents in Chinese, German, or other languages. The model’s multilingual training makes it effective here.
Bitext mining: Align parallel sentences across languages for machine translation training data.

Running Qwen3-Embedding-4B Locally

VRAM Requirements

At full precision (FP32), the model requires roughly 16 GB of VRAM (4B parameters × 4 bytes). Realistically, you’ll use quantized versions:

Quantization	VRAM (approx)	Notes
FP16 / BF16	8 GB	Full precision, minimal quality loss.
Q4_K_M	4.5–5 GB	Recommended balance for consumer GPUs.
Q4_K_S	4–4.5 GB	Slightly less memory, negligible quality drop.
Q3_K_M	3.5–4 GB	Usable but quality may degrade for cross-lingual tasks.

Minimum VRAM: 4 GB (using Q4_K_S or IQ4_XS) allows running on cards like GTX 1660 Ti, RTX 3050, or integrated NPUs with shared memory—but expect very low throughput.

NVIDIA RTX 3060 (12 GB) / RTX 4060 (8 GB) / RTX 4090 (24 GB)
AMD Radeon RX 6700 XT (12 GB)
Apple M4 Max or M3 Pro with unified memory (16 GB+)
Intel Arc A770 (16 GB)

Performance (Tokens per Second)

Exact throughput depends on hardware and quantization. Estimates on an RTX 4090 (24 GB) at Q4_K_M with typical sequence length (512 tokens):

Single sequence: ~1500–2000 tokens/s
Batch of 16: ~7000–10000 tokens/s aggregate

On an Apple M4 Max (40-core GPU), expect approximately 500–800 tokens/s per sequence (Metal backend).

To get started quickly, use Ollama:

1ollama pull qwen3-embedding:4b

Then embed via API or Python:

1import ollama
2response = ollama.embed(model='qwen3-embedding:4b', input='Your text here')
3print(response['embeddings'])

Ollama automatically applies the recommended Q4_K_M quantization. If you want a different quantization, use the --quantize flag when pulling (e.g., qwen3-embedding:4b-q3_K_M).

How It Compares

Qwen3-Embedding-4B vs. BGE-Large-EN-v1.5 (1.3B)

Size: Qwen3 is 3x larger, so it requires more VRAM but captures more nuance.
Multilingual: Qwen3 supports 100+ languages; BGE-Large is primarily English.
Performance: Qwen3 beats BGE-Large on MTEB English benchmarks by ~2-3 points on average. On multilingual tasks, the gap widens.
Context length: Qwen3 supports 32K vs. BGE-Large's 512 tokens. This is critical for long-document retrieval.
When to choose Qwen3: You need multilingual or cross-lingual capability, or you process documents longer than 512 tokens.
When to choose BGE-Large: You are English-only and have limited VRAM (e.g., 4 GB cards). BGE-Large at Q4 quant fits in ~2 GB.

Qwen3-Embedding-4B vs. intfloat/e5-mistral-7b-instruct

Size: e5-mistral-7b has ~7B parameters (dense), Qwen3-4B is smaller.
Performance: e5-mistral-7b scores higher on MTEB English (by ~1-2 points) and supports longer context (32K as well).
VRAM: e5-mistral-7b at Q4_K_M needs ~7–8 GB; Qwen3-4B needs ~4.5 GB.
When to choose Qwen3: You need to fit on an 8 GB card while keeping quality high; or you require stronger multilingual support (e5-mistral is mostly English + some code).
When to choose e5-mistral-7b: You have a 24 GB card and want the best English retrieval performance at the cost of higher memory.