Benchmarks · 2022

MTEB Classification: MTEB Classification Task Group

Name: MTEB Classification: MTEB Classification Task Group
Creator: Hugging Face
Published: 2022
Keywords: MTEB Classification, AI benchmark, embedding model evaluation, Hugging Face

Tests whether the embedding captures enough semantic structure for downstream classifiers to work.

Open Dataset

Models Tested

Top Score

80.0

Published

2022

Source

Hugging Face

How It Works

For each dataset, a simple logistic regression is trained on the model's embeddings to predict the label. The score reflects how separable the embedding space is for that task — without any fine-tuning. High classification scores mean the embeddings carry good linear structure for downstream supervised models.

For each task, embeddings are frozen and a logistic regression is trained on top. Accuracy is reported per task and averaged.

Dataset size

12 classification datasets covering sentiment, topic, intent, and language-specific tasks.

Mean score

68.2

Median score

66.9

Open / Closed

27 / 0

Top Scorers

#	Model	Lab	Source	Score
01	harrier-oss-v1-27b	Microsoft	Open	80.0
02	KaLM-Embedding-Gemma3-12B-2511	Tencent	Open	77.9
03	Qwen3-Embedding-8B	Qwen/Alibaba	Open	74.0
04	harrier-oss-v1-0.6b	Microsoft	Open	73.9
05	llama-embed-nemotron-8b	NVIDIA	Open	73.2
06	F2LLM-v2-14B	CodeFuse-AI (Ant Group)	Open	73.0
07

Score Distribution

Open vs Closed Source

Top Open-Source Models

1harrier-oss-v1-27b80.0
2KaLM-Embedding-Gemma3-12B-251177.9
3Qwen3-Embedding-8B74

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

Microsoft
74.9n = 3
Qwen/Alibaba
71.1n = 3
Jina AI
70.3n = 2
CodeFuse-AI (Ant Group)
69.5n = 5
BidirLM
65.9n = 3

Most Correlated Benchmarks

Retrieval
+0.89n = 27
STS
+0.73n = 27
Clustering
+0.67n = 27
MTEB Overall
+0.65n = 15
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Cheap to compute and stable across runs.
Strong proxy for "can I use this embedding as a feature in a classifier".
Diverse tasks make the score robust.

Where It Falls Short

Linear probes underestimate models that need richer downstream architectures.
Some component datasets are saturated.
Less correlated with RAG-style retrieval performance.

Frequently Asked Questions

When should I prioritize Classification over Retrieval?

If you are building intent classifiers, sentiment models, or topic taggers on top of embeddings, classification matters more. For RAG and semantic search, retrieval is the better signal.