Benchmarks · 2022

MTEB Retrieval: MTEB Retrieval Task Group

Name: MTEB Retrieval: MTEB Retrieval Task Group
Creator: Hugging Face
Published: 2022
Keywords: MTEB Retrieval, AI benchmark, embedding model evaluation, Hugging Face

The retrieval slice of MTEB. The most important sub-score if you are building RAG.

Open Dataset

Models Tested

Top Score

78.3

Published

2022

Source

Hugging Face

How It Works

Retrieval is the task most teams care about when picking an embedding model. The model embeds a query and a corpus, and the retrieval score measures how often the right document appears at the top. This sub-score is the single best predictor of how the model will feel in a RAG pipeline.

Scoring uses NDCG@10 — discounted cumulative gain at rank 10, normalized to ideal. Per-dataset NDCG@10 values are averaged across the 15 datasets.

Dataset size

15 retrieval datasets including MS MARCO, BEIR, Natural Questions, and FiQA.

Mean score

64.1

Median score

63.3

Open / Closed

27 / 0

Top Scorers

#	Model	Lab	Source	Score
01	harrier-oss-v1-27b	Microsoft	Open	78.3
02	KaLM-Embedding-Gemma3-12B-2511	Tencent	Open	75.7
03	Octen-Embedding-8B	Octen AI	Open	71.6
04	Qwen3-Embedding-8B	Qwen/Alibaba	Open	70.9
05	harrier-oss-v1-0.6b	Microsoft	Open	70.8
06	Qwen3-Embedding-4B	Qwen/Alibaba	Open	69.6
07

Score Distribution

Open vs Closed Source

Top Open-Source Models

1harrier-oss-v1-27b78.3
2KaLM-Embedding-Gemma3-12B-251175.7
3Octen-Embedding-8B71.6

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

Microsoft
71.8n = 3
Qwen/Alibaba
68.4n = 3
Jina AI
64.1n = 2
CodeFuse-AI (Ant Group)
63.8n = 5
BidirLM
59.2n = 3

Most Correlated Benchmarks

Classification
+0.89n = 27
STS
+0.82n = 27
MTEB Overall
+0.64n = 15
Clustering
+0.63n = 27
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Closest predictor of RAG quality in production.
Big enough that small per-dataset wins do not skew the average.
Comparable across model families.

Where It Falls Short

Public benchmarks may leak into training data for some models.
NDCG@10 rewards top-ranked relevance but ignores recall at deeper ranks.
Does not test re-ranking, which often matters more than first-stage retrieval.

Frequently Asked Questions

What retrieval score predicts a good RAG pipeline?

A retrieval score above 55 typically feels usable in production. Above 60 is competitive with commercial APIs. Below 45 you will lose recall enough that users notice.