
Microsoft's MIT-licensed 27B decoder-only multilingual embedding model, #1 on Multilingual MTEB v2 at release.
A solid 27B-parameter dense embedding model from Microsoft. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 16 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3090RunPod · Community · 24 GB VRAM | $0.22 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Microsoft’s harrier-oss-v1-27b is a 27-billion-parameter decoder-only multilingual embedding model that topped the Multilingual MTEB v2 leaderboard at release with a score of 74.3. It is the largest variant in the open-source Harrier family, built from fine-tuned Gemma and Qwen architectures. Unlike most embedding models that rely on encoder-only designs (e.g., BERT-based), Harrier uses a causal decoder—the same backbone as many modern LLMs—adapted with last-token pooling and L2 normalization to produce dense text vectors.
This model fills a specific gap: high-quality, open-weight multilingual embeddings that can process extremely long inputs (up to 32,768 tokens) without aggressive chunking. It competes directly with proprietary offerings like zembed-1 and voyage-4, as well as other open models at similar scale (e.g., BGE-M3 and multilingual E5 variants). For teams building retrieval-augmented generation (RAG) pipelines, cross-lingual search, or enterprise document classification, harrier-oss-v1-27b offers a state-of-the-art baseline under an MIT license, meaning no restrictions on commercial use or deployment.
Harrier-oss-v1-27b is a dense decoder-only model—no mixture-of-experts (MoE). Every forward pass activates all 27 billion parameters, making it compute-heavy relative to MoE models of similar total parameter count. However, dense architectures often produce more predictable embeddings and simpler quantization behavior, which matters for production reliability.
Key specifications:
The 32k context length is a standout feature. Most embedding models cap at 512 or 8,192 tokens. With 32,768 tokens, you can embed entire legal contracts, research papers, or codebases in a single pass, eliminating the need for document chunking and the associated retrieval fragility. However, that long context also increases memory usage—a tradeoff to plan for during local deployment.
The model uses last_token_pooling rather than mean pooling. Because the decoder is causal, the last token’s hidden state encapsulates the full sequence representation. This design simplifies inference and is well-supported by the sentence-transformers library and Transformers pipelines.
Harrier-oss-v1-27b is a multilingual text embedding engine optimized for retrieval, clustering, semantic similarity, classification, bitext mining, and reranking. It supports over 100 languages—from English and Chinese to Amharic, Mongolian, and Yoruba. This makes it a strong candidate for global enterprise search, cross-lingual RAG, and document deduplication across language boundaries.
Concrete use cases:
In head-to-head evaluations on 28 graded datasets (using continuous relevance scores), harrier-oss-v1-27b outperforms voyage-4 on average NDCG@10 (0.706 vs 0.702) but trails zembed-1 (0.715). For tasks that require very fine-grained relevance ranking, the model performs best on datasets with long-form queries and documents.
This is a large model. Running it locally requires serious hardware, but it is feasible with modern consumer GPUs using quantization.
With Q4_K_M on an RTX 4090, expect:
llama.cpp or sentence-transformers with model_kwargs={"dtype": "auto"} (supports FP16 if you have the VRAM).Q4_K_M. It offers a good balance of quality (near FP16) and memory savings. If you need to fit on 12 GB GPUs, try Q3_K_M.To run with Ollama: ollama run microsoft/harrier-oss-v1-27b:q4_K_M
For most practitioners building local RAG systems with a 24 GB GPU, harrier-oss-v1-27b in Q4_K_M is the strongest open embedding model available today.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Microsoft model we track.