Benchmarks · 2022

MTEB STS: MTEB Semantic Textual Similarity

Name: MTEB STS: MTEB Semantic Textual Similarity
Creator: Hugging Face
Published: 2022
Keywords: MTEB STS, AI benchmark, embedding model evaluation, Hugging Face

Measures whether the embedding distance between two sentences matches human similarity judgments.

Open Dataset

Models Tested

Top Score

81.3

Published

2022

Source

Hugging Face

How It Works

Pairs of sentences are scored by humans on a 0–5 similarity scale. The embedding model produces a cosine similarity for each pair, and the metric (Spearman correlation) measures how well the model's ranking matches the human ranking. High STS scores mean the embedding distance behaves the way humans expect.

For each dataset, Spearman correlation between model cosine similarity and human similarity ratings is computed, then averaged across the 10 datasets.

Dataset size

10 semantic textual similarity datasets including the STS benchmark, SICK-R, and STS-22.

Mean score

76.5

Median score

75.9

Open / Closed

27 / 0

Top Scorers

#	Model	Lab	Source	Score
01	Octen-Embedding-8B	Octen AI	Open	81.3
02	Qwen3-Embedding-8B	Qwen/Alibaba	Open	81.1
03	Qwen3-Embedding-4B	Qwen/Alibaba	Open	80.9
04	harrier-oss-v1-27b	Microsoft	Open	80.0
05	llama-embed-nemotron-8b	NVIDIA	Open	79.4
06	KaLM-Embedding-Gemma3-12B-2511	Tencent	Open	79.0
07

Score Distribution

Open vs Closed Source

Top Open-Source Models

1Octen-Embedding-8B81.3
2Qwen3-Embedding-8B81.1
3Qwen3-Embedding-4B80.9

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

Qwen/Alibaba
79.4n = 3
Jina AI
78.5n = 2
Microsoft
77.5n = 3
CodeFuse-AI (Ant Group)
75.9n = 5
intfloat (Microsoft Research)
75.4

Most Correlated Benchmarks

Retrieval
+0.82n = 27
Classification
+0.73n = 27
MTEB Overall
+0.65n = 15
Clustering
+0.51n = 27
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Direct test of whether the embedding space matches human intuition about similarity.
Predictive of dedup, paraphrase detection, and semantic deduplication workflows.
Stable across runs.

Where It Falls Short

Some STS datasets are partly leaked into training data.
Saturated at the top — leading models all score very close to each other.
Less directly predictive of retrieval quality.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.82

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.