Benchmarks · 2022

MTEB Overall: Massive Text Embedding Benchmark — Overall

Name: MTEB Overall: Massive Text Embedding Benchmark — Overall
Creator: Hugging Face
Published: 2022
Keywords: MTEB Overall, AI benchmark, embedding model evaluation, Hugging Face

The single number most teams quote when comparing embedding models. Aggregates 56 datasets across 8 task types.

Open Dataset Read Paper

Models Tested

Top Score

74.3

Published

2022

Source

Hugging Face

How It Works

MTEB is the de-facto evaluation harness for text embedding models. It runs the same model across dozens of public datasets and reports per-task and overall scores. "MTEB Overall" averages across all included datasets and is the headline number on most embedding model cards.

Each task has a per-task metric (NDCG@10 for retrieval, accuracy for classification, V-Measure for clustering, Spearman for STS, and so on). The Overall score is the mean of per-task averages. Different metrics live on different scales but all are normalized into the 0–100 average.

Dataset size

56 evaluation datasets across retrieval, classification, clustering, semantic textual similarity (STS), reranking, summarization, pair classification, and bitext mining.

Mean score

68.3

Median score

68.6

Open / Closed

15 / 3

Top Scorers

#	Model	Lab	Source	Score
01	harrier-oss-v1-27b	Microsoft	Open	74.3
02	KaLM-Embedding-Gemma3-12B-2511	Tencent	Open	72.3
03	jina-embeddings-v5-text-small	Jina AI	Open	71.7
04	jina-embeddings-v5-text-nano	Jina AI	Open	71.0
05	Qwen3-Embedding-8B	Qwen/Alibaba	Open	70.6
06	gte-Qwen2-7B-instruct	Alibaba	Open	70.2

Score Distribution

Open vs Closed Source

Gap on MTEB Overall:9.1pts open leads

Top Open-Source Models

1harrier-oss-v1-27b74.3
2KaLM-Embedding-Gemma3-12B-251172.3
3jina-embeddings-v5-text-small71.7

Top Closed-Source Models

1Cohere Embed v4.065.2
2text-embedding-3-large64.6
3

Score vs Parameter Count

3 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Jina AI
71.3n = 2
Microsoft
69.9n = 3
Qwen/Alibaba
68.1n = 3
OpenAI
63.4n = 2

Most Correlated Benchmarks

Clustering
+0.70n = 15
Classification
+0.65n = 15
STS
+0.65n = 15
Retrieval
+0.64n = 15
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Broad coverage means a single number captures most of what an embedding model needs to do well.
Open, reproducible, and continuously updated as new models and datasets land.
De-facto standard, so virtually every new embedding model reports MTEB scores.

Where It Falls Short

Some component tasks are saturated; differences between top models can be 1–2 points.
Heavy on English — non-English scores live in separate MTEB-X leaderboards.
Aggregated score hides task-level variance — always check sub-scores.

Frequently Asked Questions

What MTEB score should I look for?

Strong general-purpose embedding models score 65–72 overall. Frontier models break 75. Anything under 60 is likely to lag in retrieval-heavy workloads.