
Microsoft's 270M edge-ready multilingual embedder with 32K context.
A strong 0.268B-parameter dense embedding model from Microsoft. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Microsoft’s harrier-oss-v1-270m is a dense multilingual text embedding model that packs a 32,768-token context window into just 0.268B parameters. It’s the smallest member of the Harrier family, designed for edge deployment and local inference where low latency and small memory footprint are non‑negotiable. Licensed under MIT, it produces 640‑dimensional vectors using a decoder‑only architecture with last‑token pooling and L2 normalization.
What makes this model worth a look: it achieves a 66.5 MTEB v2 score (multilingual leaderboard) at a size that fits comfortably on consumer hardware. That score is competitive with much larger encoders, and the 32k context means you can embed entire documents without chunking – a practical win for retrieval‑augmented generation (RAG), cross‑lingual search, and any pipeline that currently spends engineering effort on sliding windows.
Harrier-oss-v1-270m is a dense, decoder‑only transformer – the same causal architecture used by modern LLMs, adapted for embedding tasks. The model uses:
The 32,768‑token (32k) context is the standout feature. Most embedding models in this parameter range cap at 512 or 1,024 tokens. With 32k you can embed technical documentation, legal contracts, or code repositories without splitting into overlapping chunks – reducing complexity and preserving cross‑sectional context.
VRAM requirements at full precision (fp32) are roughly 1.1 GB. With common quantization:
| Quantization | VRAM (approx) | Use Case |
|---|---|---|
| fp16 (default) | ~540 MB | Highest accuracy, most environments |
| q8_0 (8‑bit) | ~280 MB | Minimal accuracy loss, half the memory |
| q4_0 / q4_K_M | ~150 MB | Aggressive compression for edge devices |
A Q4_K_M quant fits in less than 200 MB – easily co‑resident with an LLM or other services on a single GPU.
Harrier-oss-v1-270m excels at multilingual retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. It supports over 100 languages (including low‑resource ones like Amharic, Somali, and Yoruba) – the language tag list in the model card spans from Afrikaans to Zulu.
Concrete use cases:
The model is instruction‑aware: you can prepend task‑specific prompts (e.g., “Instruct: Retrieve semantically similar text\nQuery: …”) to steer the embedding space for better retrieval precision. The official release includes pre‑configured prompts for web_search_query, sts_query, bitext_query, and others.
This is where the small parameter count shines. You don’t need a datacenter GPU.
On a single RTX 4090, encoding a batch of 32 documents of 512 tokens each reaches roughly 8,000–10,000 tokens per second (fp16). On an M4 Max, expect 5,000–7,000 tokens/s. For single‑query inference (common in search), latency is under 10 ms.
Best quantization for most users: Q4_K_M – you lose less than 1% MTEB score while cutting memory in half. Use q8_0 if you’re deploying on server hardware with spare VRAM.
The quickest path is via sentence-transformers:
1from sentence_transformers import SentenceTransformer2model = SentenceTransformer("microsoft/harrier-oss-v1-270m", model_kwargs={"dtype": "auto"})
Or directly with transformers + torch (see the Hugging Face model card for the full pooling code). If you use Ollama, check whether the model is available in their library; if not, you can build a custom Modelfile from the HF weights.
| Model | Parameters | Dims | Context | MTEB v2 (multilingual) | VRAM (fp16) |
|---|---|---|---|---|---|
| harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 | ~540 MB |
| intfloat/e5-small-v2 | 118M | 384 | 512 | ~62 (English MTEB) | ~240 MB |
| BAAI/bge-small-en-v1.5 | 102M | 384 | 512 | ~58 (English MTEB) | ~200 MB |
| jina-embeddings-v2-small-en | 33M | 512 | 8,192 | ~55 (English MTEB) | ~70 MB |
When to choose harrier-oss-v1-270m: You need multilingual support (not just English) and a 32k context window without moving to a larger model. It outperforms equivalently sized encoders like e5‑small and BGE‑small on multilingual and cross‑lingual tasks, especially when documents exceed 512 tokens.
When to choose an alternative: If you only need English retrieval and your documents are short, models like all-MiniLM-L6-v2 (80M, 384 dim) are faster and even smaller. For extreme speed on CPU, jina-embeddings-v2-small-en (33M) with 8k context is a good trade‑off – but it’s English‑only and scores lower on benchmarks. For maximum accuracy, the larger harrier (0.6B or 27B) variants offer higher MTEB scores at the cost of more VRAM.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Microsoft model we track.