The single number most teams quote when comparing embedding models. Aggregates 56 datasets across 8 task types.
MTEB is the de-facto evaluation harness for text embedding models. It runs the same model across dozens of public datasets and reports per-task and overall scores. "MTEB Overall" averages across all included datasets and is the headline number on most embedding model cards.
Each task has a per-task metric (NDCG@10 for retrieval, accuracy for classification, V-Measure for clustering, Spearman for STS, and so on). The Overall score is the mean of per-task averages. Different metrics live on different scales but all are normalized into the 0–100 average.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | harrier-oss-v1-27b | Microsoft | Open | 74.3 |
| 02 | KaLM-Embedding-Gemma3-12B-2511 | Tencent | Open | 72.3 |
| 03 | jina-embeddings-v5-text-small | Jina AI | Open | 71.7 |
| 04 | jina-embeddings-v5-text-nano | Jina AI | Open | 71.0 |
| 05 | Qwen3-Embedding-8B | Qwen/Alibaba | Open | 70.6 |
| 06 | gte-Qwen2-7B-instruct | Alibaba | Open | 70.2 |
3 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Strong general-purpose embedding models score 65–72 overall. Frontier models break 75. Anything under 60 is likely to lag in retrieval-heavy workloads.
Based on score correlations across our database.
| 07 |
| llama-embed-nemotron-8b |
| NVIDIA |
| Open |
| 69.5 |
| 08 | Qwen3-Embedding-4B | Qwen/Alibaba | Open | 69.5 |
| 09 | harrier-oss-v1-0.6b | Microsoft | Open | 69.0 |
| 10 | Linq-Embed-Mistral | Linq AI Research | Open | 68.2 |
| 11 | SFR-Embedding-Mistral | Salesforce | Open | 67.6 |
| 12 | GritLM-7B | GritLM (Contextual AI) | Open | 66.8 |
| 13 | e5-mistral-7b-instruct | intfloat (Microsoft Research) | Open | 66.6 |
| 14 | harrier-oss-v1-270m | Microsoft | Open | 66.5 |
| 15 | Cohere Embed v4.0 | Cohere | Closed | 65.2 |