The retrieval slice of MTEB. The most important sub-score if you are building RAG.
Retrieval is the task most teams care about when picking an embedding model. The model embeds a query and a corpus, and the retrieval score measures how often the right document appears at the top. This sub-score is the single best predictor of how the model will feel in a RAG pipeline.
Scoring uses NDCG@10 — discounted cumulative gain at rank 10, normalized to ideal. Per-dataset NDCG@10 values are averaged across the 15 datasets.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | harrier-oss-v1-27b | Microsoft | Open | 78.3 |
| 02 | KaLM-Embedding-Gemma3-12B-2511 | Tencent | Open | 75.7 |
| 03 | Octen-Embedding-8B | Octen AI | Open | 71.6 |
| 04 | Qwen3-Embedding-8B | Qwen/Alibaba | Open | 70.9 |
| 05 | harrier-oss-v1-0.6b | Microsoft | Open | 70.8 |
| 06 | Qwen3-Embedding-4B | Qwen/Alibaba | Open | 69.6 |
| 07 |
No models in this category.
A retrieval score above 55 typically feels usable in production. Above 60 is competitive with commercial APIs. Below 45 you will lose recall enough that users notice.
Based on score correlations across our database.
| llama-embed-nemotron-8b |
| NVIDIA |
| Open |
| 68.7 |
| 08 | F2LLM-v2-14B | CodeFuse-AI (Ant Group) | Open | 66.5 |
| 09 | harrier-oss-v1-270m | Microsoft | Open | 66.4 |
| 10 | F2LLM-v2-8B | CodeFuse-AI (Ant Group) | Open | 66.2 |
| 11 | jina-embeddings-v5-text-small | Jina AI | Open | 64.9 |
| 12 | F2LLM-v2-4B | CodeFuse-AI (Ant Group) | Open | 64.8 |
| 13 | Qwen3-Embedding-0.6B | Qwen/Alibaba | Open | 64.7 |
| 14 | jina-embeddings-v5-text-nano | Jina AI | Open | 63.3 |
| 15 | BOOM_4B_v1 | ICT-CAS TIME / Querit | Open | 62.2 |