Measures whether the embedding distance between two sentences matches human similarity judgments.
Pairs of sentences are scored by humans on a 0–5 similarity scale. The embedding model produces a cosine similarity for each pair, and the metric (Spearman correlation) measures how well the model's ranking matches the human ranking. High STS scores mean the embedding distance behaves the way humans expect.
For each dataset, Spearman correlation between model cosine similarity and human similarity ratings is computed, then averaged across the 10 datasets.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Octen-Embedding-8B | Octen AI | Open | 81.3 |
| 02 | Qwen3-Embedding-8B | Qwen/Alibaba | Open | 81.1 |
| 03 | Qwen3-Embedding-4B | Qwen/Alibaba | Open | 80.9 |
| 04 | harrier-oss-v1-27b | Microsoft | Open | 80.0 |
| 05 | llama-embed-nemotron-8b | NVIDIA | Open | 79.4 |
| 06 | KaLM-Embedding-Gemma3-12B-2511 | Tencent | Open | 79.0 |
| 07 |
No models in this category.
Based on score correlations across our database.
| jina-embeddings-v5-text-small |
| Jina AI |
| Open |
| 78.8 |
| 08 | jina-embeddings-v5-text-nano | Jina AI | Open | 78.2 |
| 09 | harrier-oss-v1-0.6b | Microsoft | Open | 77.1 |
| 10 | F2LLM-v2-14B | CodeFuse-AI (Ant Group) | Open | 77.0 |
| 11 | multilingual-e5-large-instruct | intfloat (Microsoft Research) | Open | 76.8 |
| 12 | F2LLM-v2-8B | CodeFuse-AI (Ant Group) | Open | 76.5 |
| 13 | Qwen3-Embedding-0.6B | Qwen/Alibaba | Open | 76.2 |
| 14 | F2LLM-v2-4B | CodeFuse-AI (Ant Group) | Open | 75.9 |
| 15 | F2LLM-v2-1.7B | CodeFuse-AI (Ant Group) | Open | 75.8 |