
MoE Mixtral-8x7B unified embedding+generation model, best-in-class open generation, competitive on MTEB.
A workable 57.9B-parameter MoE embedding model from Contextual AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 8 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA L4Vast.ai · Spot · 24 GB VRAM | $0.03 |
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
GritLM-8x7B is a Mixture-of-Experts (MoE) model developed by Contextual AI that pulls double duty: it’s both a state-of-the-art text embedder and a strong open-source language generator. The model uses the Generative Representational Instruction Tuning (GRIT) method, which trains a single set of weights to handle embedding and generation through instruction‑based routing. This means you don’t need separate models for retrieval and response — one model does both.
With 57.9B total parameters but only 13B active per forward pass, GritLM-8x7B targets users who want best-in-class open generation without paying the full compute cost of a dense 57B model. On the Massive Text Embedding Benchmark (MTEB) it is competitive with the best dedicated embedding models, while on generative tasks it outperforms every open model up to its size that we tested. It’s released under Apache 2.0, making it free to use, modify, and deploy in production.
The model is a text‑only system. It excels at tasks requiring both retrieval (via embeddings) and generation — most notably retrieval‑augmented generation (RAG) — because the unified architecture can cut pipeline latency by over 60% for long documents, compared to using separate retrieval and generation models.
GritLM-8x7B is based on an 8‑expert MoE transformer with 57.9B total parameters. Only 13B parameters are active per token, which is the effective computational cost for inference. This design balances the quality of a large model with the throughput of a much smaller one:
The active‑parameters count is the key number for hardware planning. A dense 57B model would require around 110 GB of VRAM in FP16, while GritLM-8x7B’s MoE sparsity brings the memory footprint closer to that of a 13B–15B dense model (plus overhead for expert weights). In practice, you can run the full FP16 model on a single high‑VRAM GPU — but only at aggressive quantization.
Context length is not officially specified, but the model inherits the 32K context length typical of its Mixtral‑8x7B base (confirm with your own testing for long‑context tasks). The tokenizer is the same as Mistral‑7B’s (v1).
GritLM-8x7B is trained for two modes: generate and embed. You choose the mode via a special instruction prefix. This makes it uniquely suited for:
text-embedding-ada-002 or BGE model required. The pipeline becomes simpler, and for long documents the end‑to‑end latency drops by more than 60%.Concrete use cases include: building a local RAG system for internal documentation, creating a semantic search engine over your codebase, or powering a chatbot that also indexes chat history for retrieval.
Running a 57.9B MoE model on consumer hardware is feasible if you quantize well. The MoE architecture means you must load all 57.9B parameters into memory (even though only 13B are computed per token). Here’s the realistic breakdown:
| Quantization | VRAM (approx.) | Performance / quality |
|---|---|---|
| FP16 (full) | ~116 GB | Best quality, requires A100 (80 GB) × 2 or similar |
| Q8_0 | ~58 GB | Near‑lossless, possible on RTX 6000 Ada (48 GB) with offloading or dual GPU |
| Q4_K_M | ~33 GB | Good quality, fits single RTX 4090 (24 GB) only partially – requires CPU offload or dual GPU |
| Q3_K_M | ~26 GB | Lower quality, fits RTX 4090 with heavy offloading |
Recommended for most practitioners: Q4_K_M on a dual‑GPU setup or Q3_K_M on a single 24 GB card with aggressive offloading. If you have an Apple M4 Max with 128 GB unified memory, you can run Q8_0 comfortably.
On a single RTX 4090 with Q3_K_M and CPU offloading, expect 2–4 tokens/second for generation. On two RTX 4090s with Q4_K_M, you’ll see 8–12 tokens/second. For embedding (batch size 1), throughput is higher because the forward pass is lighter (no autoregressive generation). Use llama.cpp or Ollama to get started quickly — Ollama supports GritLM-8x7B via gritlm/gritlm-8x7b and handles quantization automatically.
1ollama run gritlm/gritlm-8x7b
This downloads the Q4_K_M quantized model (about 18 GB download) and runs it with sensible defaults. For embedding, call the API with the appropriate instruction prefix.
GritLM-8x7B lives in the awkward zone between a 13B dense model and a 70B dense model. Here’s how it stacks up against two realistic alternatives:
tulu2 data.When to choose GritLM-8x7B: You need both retrieval and generation in a single OSS model, and you have the hardware to run a 33+ GB quantized model. The primary advantage is pipeline simplification and latency reduction for local RAG.
When to avoid it: If your use case is purely generative and you can’t spare 33 GB of VRAM, a 7B or 13B dense model will be faster and easier to deploy. If you only need embeddings, a dedicated 384‑parameter embedder is cheaper and faster.