
Foundational 7B LLM-based embedder fine-tuned with GPT-4-synthesized instruction data.
A solid 7.1B-parameter dense embedding model from Microsoft. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Microsoft’s e5-mistral-7b-instruct is a 7.1 billion parameter dense embedding model built on the Mistral‑7B‑v0.1 architecture. It’s designed for one purpose: producing high‑quality text embeddings that can be customised with natural‑language instructions. This puts it in the instructable‑embedding category — models that let you specify the exact kind of similarity, retrieval, or classification task you want them to optimise for, rather than relying on a fixed similarity metric.
The model was fine‑tuned from the base Mistral‑7B using synthetic instruction data generated by GPT‑4, then trained with supervised contrastive learning on a mix of multilingual datasets. The result is a strong performer on English‑language semantic search, passage ranking, and embedding‑based classification, with a BEIR score of 56.9 and an average MTEB score of 60.3 — placing it among the top‑performing 7‑billion‑class embedding models.
It competes directly with other instruction‑tuned embedders such as intfloat/e5-mistral-7b-instruct (yes, the same model — this is the canonical Hugging Face identifier) and the larger multilingual‑e5‑large‑instruct (24 layers). For practitioners who need a single model that can switch between retrieval, clustering, and classification without swapping weights, this is the most practical choice in the 7‑billion range.
Licensed under MIT — no restrictions on commercial use, modification, or redistribution.
e5-mistral-7b-instruct is a dense transformer with 7.1B total parameters. It uses 32 decoder‑only layers, an embedding dimension of 4096, and a hidden dimension of 14336 (standard Mistral‑7B configuration). The context length is 4096 tokens — typical for Mistral‑7B‑based models and sufficient for most retrieval and classification use cases.
Because it is a dense model (not Mixture‑of‑Experts), every inference step uses all 7.1B parameters. This means predictable memory consumption and no dynamic routing overhead, but also a fixed compute cost per token. For embedding tasks, the model is used in an encoder‑style fashion: you feed in a text + optional instruction, and extract the final layer’s <eos> token representation as your embedding vector.
Key architectural highlights:
The model is published on Hugging Face as intfloat/e5-mistral-7b-instruct and is compatible with the sentence-transformers library, making it straightforward to load and run with minimal boilerplate.
This model is not a general‑purpose chatbot; it is an embedding model. Its primary output is a dense vector that captures the semantic meaning of an input text, conditioned on an optional instruction. The instruction mechanism gives it unique flexibility.
”Retrieve relevant documents:” steers the embedding to focus on topical relevance. BEIR benchmark score 56.9 reflects strong zero‑shot retrieval across 18 datasets.”Represent the text for clustering:” yields vectors that cluster well by semantic topic. MTEB clustering tasks show competitive performance.”Classify the sentiment:” can bias the embedding toward sentiment‑relevant dimensions, improving downstream accuracy.”Rerank passages for relevance:” and rank by cosine similarity.Limitations: Optimised for English; multilingual support exists (the model was fine-tuned on a multilingual mix) but for robust non‑English performance, the dedicated multilingual-e5-large-instruct is recommended.
This is a 7.1B dense model — it fits on modern consumer hardware with proper quantisation. Below are practical guidelines for running it on local machines.
| Quantisation | VRAM (approx.) | Notes |
|---|---|---|
| FP16 (full precision) | ~14 GB | Requires a GPU with 16+ GB VRAM. |
| Q5_K_M | ~8 GB | Good balance of quality and memory. |
| Q4_K_M (recommended) | ~6.5 GB | Best quality‑to‑memory ratio for most users. |
| Q3_K_M | ~5.5 GB | Usable but noticeable quality degradation on retrieval tasks. |
| Q2_K | ~4.5 GB | Only for extremely constrained setups; expect significant drop. |
Possible with llama.cpp or ONNX runtime, but expect < 5 tokens/second. Not recommended for production workloads.
Performance depends on hardware, quantisation, and inference engine. Rough estimates for batch size 1 (single embedding):
| Hardware | Quantisation | Tokens/sec |
|---|---|---|
| RTX 4090 | Q4_K_M | 80–120 |
| RTX 3090 | Q4_K_M | 50–70 |
| RTX 4060 12 GB | Q4_K_M | 25–35 |
| M4 Max (64 GB) | Q4_K_M | 60–90 |
| M4 Pro 24 GB | Q4_K_M | 30–50 |
| M2 16 GB | Q4_K_M | 15–25 |
For embedding tasks, you typically process many texts at once. With batch size 32, throughput scales almost linearly (with some overhead). Expect to embed 1000 documents of 256 tokens in under a second on an RTX 4090.
Ollama provides the quickest path to local inference:
1# Pull the model2ollama pull mxbai-embed-large # Not e5-mistral-7b; for e5-mistral-7b-instruct use:3ollama pull intfloat/e5-mistral-7b-instruct
Then use the Python client or HTTP API to generate embeddings. The model supports the typical huggingface format, so you can also load it directly with sentence-transformers:
1from sentence_transformers import SentenceTransformer23model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")4query = "What is the capital of France?"5instruction = "Retrieve relevant documents:"6emb = model.encode([instruction + " " + query])
For quantised versions, use llama.cpp or llama-cpp-python with a GGUF file (e.g., from TheBloke on Hugging Face).
Within the 7‑billion‑parameter class of embedding models, two main alternatives are worth considering:
| Model | Params | BEIR | MTEB | Notes |
|---|---|---|---|---|
| e5-mistral-7b-instruct | 7.1B | 56.9 | 60.3 | Instruction‑tuned, strong generalist. |
| e5-large (non‑instruct) | 335M | 54.2 | 58.7 | Smaller, faster, but no instruction control. |
| multilingual-e5-large-instruct | 1.3B | 52.5 | N/A | Better for non‑English, but larger than 7B. |
| gte-large-en-v1.5 | 1.3B | 60.1 | 63.5 | Smaller, higher MTEB, but no instruction capability. |
When to choose e5‑mistral‑7b‑instruct: You need instruction‑aware embeddings — tasks where the same model must handle different similarity criteria (e.g., a search engine that sometimes wants topical relevance, sometimes temporal relevance). The 7B size gives you more representational capacity than smaller models, at the cost of higher VRAM. If your use case is purely English retrieval or classification without instruction conditioning, gte‑large or e5‑large may offer better performance per parameter.
When to skip it: Your hardware is constrained to ≤5 GB VRAM — then a Q4_K_M version of a smaller model (e.g., e5‑base‑v2 at 0.1B) will run faster and still deliver competitive results. Or if you need multilingual support for multiple languages equally, multilingual‑e5‑large‑instruct is a better fit despite being smaller (1.3B) — the multilingual gap is not compensated by raw size.
For practitioners who want best‑in‑class retrieval with instruction flexibility and have the GPU memory to spare (8 GB+), e5‑mistral‑7b‑instruct is the recommended default in this parameter range.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Microsoft model we track.

Explore the Family
The full E5 family leaderboard with sizes, benchmark scores, and a release timeline.