
Microsoft's compact 0.6B multilingual embedder distilled for mid-tier and CPU deployments.
A strong 0.596B-parameter dense embedding model from Microsoft. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Microsoft’s harrier-oss-v1-0.6b is a compact multilingual text embedding model built for mid-tier hardware and CPU-based inference. At just 0.596B parameters, it belongs to the embedder class – not a generative LLM – optimized for producing dense vector representations from text. It is the middle sibling in Microsoft’s open-source Harrier family, which also includes a 270M and a 27B variant. What sets this model apart is its ability to handle 94+ languages, a 32,768-token context window, and a state-of-the-art MTEB v2 score of 69.0 for its size class.
This model targets practitioners who need semantic search, retrieval-augmented generation (RAG), clustering, classification, or reranking – but cannot justify the VRAM or latency of larger embedders like the 27B Harrier or proprietary alternatives. With an MIT license and support for both Sentence Transformers and raw Transformers, it is a drop-in replacement for many existing embedding pipelines, especially in multilingual environments.
Harrier-oss-v1-0.6b uses a dense decoder-only architecture with last-token pooling and L2 normalization to produce 1,024-dimensional embeddings. Unlike mixture-of-experts (MoE) models, all 0.596B parameters are active during inference. This means a single forward pass consumes the full compute, but the model is small enough to fit comfortably on most consumer GPUs and even run on CPU with acceptable throughput.
Key specs:
sentence-transformers, transformersThe context window of 32K tokens is unusually large for a model of this size. Most sub-1B embedders cap out at 512 or 8,192. This enables processing long documents, code files, or multi-turn conversation histories without chunking – a significant advantage for RAG pipelines that index long-form content.
The model is a text-only embedder suited for any task that requires converting text into a fixed-length vector. The official model card lists retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. Here are concrete use cases relevant to local deployments:
The model is not designed for generative tasks (no chat completion, no code generation). It outputs only embeddings.
This is where the model shines for practitioners on limited hardware.
At full precision (FP32), the model weights occupy approximately 2.4 GB (0.596B × 4 bytes). With typical overhead for tokenizers and intermediate activations, expect peak usage around 3–4 GB for a single batch.
llama.cpp-based quantized variants. The 0.6B size makes it one of the few large-context embedders that still feel interactive on CPU.llama.cpp or the mlx framework. Expect 100–200 tokens/second on M4 Max.The quickest path is via Ollama. While the Harrier family is not yet in the official Ollama library, you can import the model from HuggingFace:
1ollama pull microsoft/harrier-oss-v1-0.6b
Or use Sentence Transformers directly:
1from sentence_transformers import SentenceTransformer2model = SentenceTransformer("microsoft/harrier-oss-v1-0.6b", model_kwargs={"dtype": "auto"})
For maximum performance on CPU, convert to GGUF format (e.g., Q4_K_M) and run via llama.cpp:
1git clone https://huggingface.co/microsoft/harrier-oss-v1-0.6b2python convert.py .3./quantize --model harrier-oss-v1-0.6b.gguf --type q4_k_m
Then load with llama-embedding or the Python binding.
| Hardware | Quantization | Tokens/sec (seq len 1K) |
|---|---|---|
| RTX 4090 | FP16 | 800–1,100 |
| RTX 3060 | Q4_K_M | 300–500 |
| M4 Max (64 GB) | Q4_K_M | 150–250 |
| Intel i7-13700 (CPU) | Q4_K_M | 40–70 |
For purely CPU deployments, the 0.6B parameter count of Harrier makes it a practical choice – most 7B-class embedders are unusable on CPU for real-time retrieval.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Microsoft model we track.