Tests whether semantically similar items end up close together in the embedding space.
Clustering scores measure whether documents from the same topic end up grouped together when the model's embeddings are clustered. The metric (V-Measure) does not care about specific cluster labels — only that documents about the same thing land in the same bucket.
Embeddings are run through k-means with the known number of clusters, then scored against the ground-truth grouping using V-Measure. Per-dataset scores are averaged.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | F2LLM-v2-14B | CodeFuse-AI (Ant Group) | Open | 60.9 |
| 02 | F2LLM-v2-8B | CodeFuse-AI (Ant Group) | Open | 60.6 |
| 03 | F2LLM-v2-4B | CodeFuse-AI (Ant Group) | Open | 59.5 |
| 04 | harrier-oss-v1-27b | Microsoft | Open | 58.9 |
| 05 | F2LLM-v2-1.7B | CodeFuse-AI (Ant Group) | Open | 58.8 |
| 06 | Qwen3-Embedding-8B | Qwen/Alibaba | Open | 57.6 |
| 07 |
No models in this category.
Based on score correlations across our database.
| Qwen3-Embedding-4B |
| Qwen/Alibaba |
| Open |
| 57.1 |
| 08 | F2LLM-v2-0.6B | CodeFuse-AI (Ant Group) | Open | 56.6 |
| 09 | KaLM-Embedding-Gemma3-12B-2511 | Tencent | Open | 55.8 |
| 10 | Octen-Embedding-8B | Octen AI | Open | 55.7 |
| 11 | llama-embed-nemotron-8b | NVIDIA | Open | 54.4 |
| 12 | harrier-oss-v1-0.6b | Microsoft | Open | 54.0 |
| 13 | jina-embeddings-v5-text-small | Jina AI | Open | 53.4 |
| 14 | BOOM_4B_v1 | ICT-CAS TIME / Querit | Open | 52.8 |
| 15 | gte-Qwen2-7B-instruct | Alibaba | Open | 52.8 |