Tests whether the embedding captures enough semantic structure for downstream classifiers to work.
For each dataset, a simple logistic regression is trained on the model's embeddings to predict the label. The score reflects how separable the embedding space is for that task — without any fine-tuning. High classification scores mean the embeddings carry good linear structure for downstream supervised models.
For each task, embeddings are frozen and a logistic regression is trained on top. Accuracy is reported per task and averaged.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | harrier-oss-v1-27b | Microsoft | Open | 80.0 |
| 02 | KaLM-Embedding-Gemma3-12B-2511 | Tencent | Open | 77.9 |
| 03 | Qwen3-Embedding-8B | Qwen/Alibaba | Open | 74.0 |
| 04 | harrier-oss-v1-0.6b | Microsoft | Open | 73.9 |
| 05 | llama-embed-nemotron-8b | NVIDIA | Open | 73.2 |
| 06 | F2LLM-v2-14B | CodeFuse-AI (Ant Group) | Open | 73.0 |
| 07 |
No models in this category.
If you are building intent classifiers, sentiment models, or topic taggers on top of embeddings, classification matters more. For RAG and semantic search, retrieval is the better signal.
Based on score correlations across our database.
| Qwen3-Embedding-4B |
| Qwen/Alibaba |
| Open |
| 72.3 |
| 08 | F2LLM-v2-8B | CodeFuse-AI (Ant Group) | Open | 71.9 |
| 09 | jina-embeddings-v5-text-small | Jina AI | Open | 71.3 |
| 10 | harrier-oss-v1-270m | Microsoft | Open | 70.8 |
| 11 | F2LLM-v2-4B | CodeFuse-AI (Ant Group) | Open | 70.7 |
| 12 | jina-embeddings-v5-text-nano | Jina AI | Open | 69.2 |
| 13 | F2LLM-v2-1.7B | CodeFuse-AI (Ant Group) | Open | 67.7 |
| 14 | BOOM_4B_v1 | ICT-CAS TIME / Querit | Open | 66.9 |
| 15 | Qwen3-Embedding-0.6B | Qwen/Alibaba | Open | 66.8 |