Benchmarks · 2022

MTEB Clustering: MTEB Clustering Task Group

Name: MTEB Clustering: MTEB Clustering Task Group
Creator: Hugging Face
Published: 2022
Keywords: MTEB Clustering, AI benchmark, embedding model evaluation, Hugging Face

Tests whether semantically similar items end up close together in the embedding space.

Open Dataset

Models Tested

Top Score

60.9

Published

2022

Source

Hugging Face

How It Works

Clustering scores measure whether documents from the same topic end up grouped together when the model's embeddings are clustered. The metric (V-Measure) does not care about specific cluster labels, only that documents about the same thing land in the same bucket.

Embeddings are run through k-means with the known number of clusters, then scored against the ground-truth grouping using V-Measure. Per-dataset scores are averaged.

Dataset size

11 clustering datasets including arXiv abstracts, news, biomedical, and StackExchange.

Mean score

54.2

Median score

52.8

Open / Closed

27 / 0

Top Scorers

#	Model	Lab	Source	Score
01	F2LLM-v2-14B	CodeFuse-AI (Ant Group)	Open	60.9
02	F2LLM-v2-8B	CodeFuse-AI (Ant Group)	Open	60.6
03	F2LLM-v2-4B	CodeFuse-AI (Ant Group)	Open	59.5
04	harrier-oss-v1-27b	Microsoft	Open	58.9
05	F2LLM-v2-1.7B	CodeFuse-AI (Ant Group)	Open	58.8
06	Qwen3-Embedding-8B	Alibaba	Open	57.6
07	Qwen3-Embedding-4B	Alibaba	Open	57.1
08	F2LLM-v2-0.6B	CodeFuse-AI (Ant Group)	Open	56.6
09	KaLM-Embedding-Gemma3-12B-2511	Tencent	Open	55.8
10	Octen-Embedding-8B	Octen AI	Open	55.7
11	llama-embed-nemotron-8b	NVIDIA	Open	54.4
12	harrier-oss-v1-0.6b	Microsoft	Open	54.0
13	jina-embeddings-v5-text-small	Jina AI	Open	53.4
14	BOOM_4B_v1	Institute of Computing Technology	Open	52.8
15	gte-Qwen2-7B-instruct	Alibaba	Open	52.8

Score Distribution

Open vs Closed Source

Top Open-Source Models

1F2LLM-v2-14B60.9
2F2LLM-v2-8B60.6
3F2LLM-v2-4B59.5

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

CodeFuse-AI (Ant Group)
59.3n = 5
Alibaba
55.0n = 4
Microsoft
53.4n = 5
Jina AI
53.1n = 2
BidirLM
51.1n = 3
Contextual AI
50.0n = 2

Most Correlated Benchmarks

MTEB Overall
+0.70n = 15
Classification
+0.67n = 27
Retrieval
+0.63n = 27
STS
+0.51n = 27
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Captures whether the embedding has useful global structure, not just local similarity.
Strong predictor of topic discovery, deduplication, and exploration workflows.
Cheap to compute.

Where It Falls Short

V-Measure is sensitive to the chosen number of clusters.
Less directly useful for ranked retrieval.
Some datasets are saturated.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.70

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2022

MTEB Clustering: MTEB Clustering Task Group

Tests whether semantically similar items end up close together in the embedding space.

Open Dataset

Models Tested

Top Score

60.9

Published

2022

Source

Hugging Face

How It Works

Embeddings are run through k-means with the known number of clusters, then scored against the ground-truth grouping using V-Measure. Per-dataset scores are averaged.

Dataset size

11 clustering datasets including arXiv abstracts, news, biomedical, and StackExchange.

Mean score

54.2

Median score

52.8

Open / Closed

27 / 0

Top Scorers

#	Model	Lab	Source	Score
01	F2LLM-v2-14B	CodeFuse-AI (Ant Group)	Open	60.9
02	F2LLM-v2-8B	CodeFuse-AI (Ant Group)	Open	60.6
03	F2LLM-v2-4B	CodeFuse-AI (Ant Group)	Open	59.5
04	harrier-oss-v1-27b	Microsoft	Open	58.9
05	F2LLM-v2-1.7B	CodeFuse-AI (Ant Group)	Open	58.8
06	Qwen3-Embedding-8B	Alibaba	Open	57.6
07	Qwen3-Embedding-4B	Alibaba	Open	57.1
08	F2LLM-v2-0.6B	CodeFuse-AI (Ant Group)	Open	56.6
09	KaLM-Embedding-Gemma3-12B-2511	Tencent	Open	55.8
10	Octen-Embedding-8B	Octen AI	Open	55.7
11	llama-embed-nemotron-8b	NVIDIA	Open	54.4
12	harrier-oss-v1-0.6b	Microsoft	Open	54.0
13	jina-embeddings-v5-text-small	Jina AI	Open	53.4
14	BOOM_4B_v1	Institute of Computing Technology	Open	52.8
15	gte-Qwen2-7B-instruct	Alibaba	Open	52.8

Score Distribution

Open vs Closed Source

Top Open-Source Models

1F2LLM-v2-14B60.9
2F2LLM-v2-8B60.6
3F2LLM-v2-4B59.5

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

CodeFuse-AI (Ant Group)
59.3n = 5
Alibaba
55.0n = 4
Microsoft
53.4n = 5
Jina AI
53.1n = 2
BidirLM
51.1n = 3
Contextual AI
50.0n = 2

Most Correlated Benchmarks

MTEB Overall
+0.70n = 15
Classification
+0.67n = 27
Retrieval
+0.63n = 27
STS
+0.51n = 27
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Captures whether the embedding has useful global structure, not just local similarity.
Strong predictor of topic discovery, deduplication, and exploration workflows.
Cheap to compute.

Where It Falls Short

V-Measure is sensitive to the chosen number of clusters.
Less directly useful for ranked retrieval.
Some datasets are saturated.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.70

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

MTEB Clustering: MTEB Clustering Task Group

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

MTEB Overall

Classification

Retrieval

STS

The AI Build Report

MTEB Clustering: MTEB Clustering Task Group

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

MTEB Overall

Classification

Retrieval

STS

The AI Build Report