
Salesforce Research's 7B contrastive embedding that first topped MTEB in February 2024 (research-only).
A solid 7.1B-parameter dense embedding model from Salesforce. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing. Currently trending on Hugging Face — community interest is climbing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
SFR-Embedding-Mistral is a 7.1 billion parameter text embedding model from Salesforce Research. It was the first model to hit the top of the Massive Text Embedding Benchmark (MTEB) leaderboard in February 2024, achieving an average score of 67.6 across 56 datasets. This is a dense, contrastive embedding model built on top of E5-mistral-7b-instruct and Mistral-7B-v0.1.
What sets it apart is its retrieval performance: it jumped from a score of 56.9 (E5-mistral-7b-instruct) to 59.0 on the MTEB retrieval subset, and showed a +1.4 improvement on clustering tasks. If you need a local embedding model for semantic search, RAG pipelines, or document clustering, this is a strong candidate.
The model is released under the CC-BY-NC-4.0 license, which means it’s free for research and non-commercial use. For commercial applications, you’ll need to check with Salesforce or consider alternatives.
SFR-Embedding-Mistral is a dense transformer with 7.1B parameters — the same architecture as Mistral-7B, but fine-tuned for embedding tasks. It uses last-token pooling to generate embeddings from the final hidden state.
Key architectural choices:
The model uses contrastive loss with hard negatives (7 per query-document pair). It was trained on 8 A100 GPUs for about 15 hours. For inference, you get a single dense vector per input text, normalized for cosine similarity.
SFR-Embedding-Mistral is not a general-purpose language model — it produces text embeddings. Its strength lies in:
Concrete use cases:
This is a 7.1B dense model — it will run on consumer GPUs with 8GB+ VRAM, but performance varies by quantization.
| Quantization | VRAM (approx) | Notes |
|---|---|---|
| FP16 (full) | ~14 GB | Requires a 24GB GPU like RTX 4090 |
| Q4_K_M | ~5.5 GB | Fits on 8GB GPUs (RTX 3070, RTX 4060) |
| Q5_K_M | ~6.5 GB | Better quality, still fits 8GB |
| Q3_K_M | ~4.5 GB | Lower quality but runs on 6GB GPUs |
Embedding generation is fast because it’s a single forward pass per text. On an RTX 4090 with FP16, you can expect ~200–400 tokens per second (batch size 1). On an RTX 3070 with Q4_K_M, expect ~80–150 tokens per second. Throughput scales with batch size — you can embed thousands of documents in seconds.
ollama run nomic-embed-text is a smaller alternative, but for SFR-Embedding-Mistral you can use llama.cpp directly.SFR-Embedding-Mistral-Q4_K_M.gguf) and run:1 ./embedding -m SFR-Embedding-Mistral-Q4_K_M.gguf -p "Your text here"
sentence-transformers library — works with CUDA, but requires full precision (14GB VRAM).For best quantization, Q4_K_M is the sweet spot for most users — minimal quality loss at half the VRAM. If you have 24GB, Q5_K_M or FP16 gives slightly better retrieval accuracy.
| Model | Parameters | MTEB Score | License | Notes |
|---|---|---|---|---|
| SFR-Embedding-Mistral | 7.1B | 67.6 | CC-BY-NC-4.0 | Top retrieval, research-only |
| E5-mistral-7b-instruct | 7.1B | ~66? | MIT | Commercial, slightly lower retrieval |
| BGE-M3 | 567M | ~64 | MIT | Much smaller, multilingual, commercial |
| GTE-Qwen2-7B | 7B | ~68 | MIT | Newer, commercial, slightly better on some tasks |
SFR-Embedding-Mistral vs E5-mistral-7b-instruct: Both are based on Mistral-7B. SFR has better retrieval (+2.1 points) and clustering (+1.4) thanks to multi-task training and hard negatives. E5 is MIT-licensed, so it’s the choice for commercial use.
SFR-Embedding-Mistral vs GTE-Qwen2-7B: GTE-Qwen2-7B (by Alibaba) scores slightly higher on MTEB overall (~68) and is MIT-licensed. SFR still leads on retrieval (59.0 vs ~58.5). If you need commercial use, choose GTE. If you want the best retrieval for research, SFR is still competitive.
When to pick SFR-Embedding-Mistral: You’re doing research or non-commercial RAG, need top-tier retrieval accuracy, and have the VRAM to run a 7B model. If you need commercial license or lower VRAM, look at BGE-M3 (567M, fits 4GB GPUs) or E5-mistral-7b-instruct.