
RTEB #1 domain-tuned 8B retrieval embedder from Octen AI, a LoRA fine-tune of Qwen3-Embedding-8B.
A solid 7.6B-parameter dense embedding model from Octen AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Octen-Embedding-8B is a domain-tuned text embedding model developed by Octen AI, currently ranking #1 on the RTEB Leaderboard (as of January 2026) with a Mean (Task) score of 0.8045. It is a dense 7.6B parameter model built as a LoRA fine-tune of Qwen3-Embedding-8B, targeting high-precision retrieval for industry workloads. This model competes directly with closed-source embedding APIs like Voyage-3-large and Cohere Embed v4, but with the advantage of being fully open-source under Apache 2.0 — meaning you can run it on your own hardware without per-query costs or data leakage.
The embedding space is 4096 dimensions, and the model supports a 32,768 token context length, making it suitable for long-document retrieval in legal, medical, and financial settings. If you need a local alternative to API-based embedders that matches or beats top commercial offerings on public and private benchmarks, this is the model to evaluate.
Octen-Embedding-8B is a dense transformer with 7.6B parameters, not MoE. Every inference call uses the full parameter count, so VRAM consumption is predictable: approximately 15–16 GB in FP16, 8–9 GB in INT8, and around 5–6 GB with 4-bit quantization (e.g., Q4_K_M). The architecture inherits Qwen3’s dual-encoder design for symmetric (query-document) and asymmetric retrieval, with a LoRA adapter trained on domain-specific data.
Octen/Octen-Embedding-8B-INT8), plus community Q4_K_M and Q5_K_M through GGUF.The dense architecture means you get full performance on every request — no routing overhead or active-parameter variance. For local deployment, the trade-off is higher VRAM than a comparable MoE model, but the retrieval quality is state-of-the-art.
Octen-Embedding-8B excels in domain-specific retrieval where generic embeddings fall short. Octen AI tuned it on vertical datasets spanning four key areas:
The model supports 100+ natural languages and several programming languages, with strong cross-lingual and multilingual retrieval performance. It achieves a Public dataset score of 0.7953 and Private dataset score of 0.8157 on RTEB, indicating minimal overfitting. Use it for:
This is where Octen-Embedding-8B differentiates itself from API-only models. You can deploy it on consumer GPUs, and the open license means no rate limits or data privacy risks.
| Quantization | VRAM (approx.) | Recommended Hardware |
|---|---|---|
| FP16 (full) | 15–16 GB | RTX 4080 Super (16GB), RTX 4090 (24GB), M4 Max (64GB unified) |
| INT8 | 8–9 GB | RTX 4060 Ti 16GB, RTX 3080 10GB (with swap) |
| Q4_K_M | 5–6 GB | RTX 3060 12GB, Apple Silicon 18GB+ unified memory |
For most users, Q4_K_M is the sweet spot: it preserves retrieval quality (within 1–2% of FP16 on MTEB benchmarks) while fitting on widely available cards like the RTX 3060 12GB. If you have a 24GB card like the RTX 4090, INT8 or even FP16 is feasible and maximizes precision for edge-case queries.
Embedding models are measured in tokens per second during inference, not generation speed. On a single RTX 4090:
Batch size affects throughput substantially. With a batch size of 16 and Q4_K_M, you can embed ~25,000 tokens per second. On an M4 Max (64GB unified), expect ~600–800 tokens/second in FP16 — sufficient for real-time retrieval in most RAG setups.
The fastest way to run Octen-Embedding-8B locally is via Ollama (once a GGUF variant is available). Alternatively, use the official INT8 model from HuggingFace with sentence-transformers:
1from sentence_transformers import SentenceTransformer23model = SentenceTransformer("Octen/Octen-Embedding-8B-INT8")4embeddings = model.encode(["Your query text here"])
For production deployments, consider using vLLM with the Nomic embedding backend or a custom Triton server.
If you need to run a 7.6B model on a consumer GPU like the RTX 3060, use Q4_K_M and keep the batch size under 8. For latency-sensitive applications, the INT8 variant on an RTX 4080 Super is a solid middle ground.
Octen-Embedding-8B competes primarily with two models at similar scale:
vs Voyage-3-large (API-only, 1024 dim)
Voyage-3-large scores 0.7812 Mean (Task) on RTEB vs Octen’s 0.8045. Octen also offers 4× the embedding dimensions (4096 vs 1024), which can improve recall on fine-grained retrieval. The trade-off: Octen requires local hardware and is larger (7.6B vs Voyage’s ~1.5B). If you need zero-maintenance cloud retrieval, Voyage is simpler; if you want control and better scores, Octen wins.
vs Qwen3-Embedding-8B (base model)
Octen-Embedding-8B is a refined version of Qwen3-Embedding-8B, which scores 0.7547. Octen’s domain tuning adds ~5 points on Mean (Task) and significantly improves performance on legal, finance, and medical tasks. If you’re already using Qwen3-Embedding-8B, upgrading to Octen costs nothing (license compatibility) and gives you a measurable quality lift without changing your inference stack.
When to choose Octen-Embedding-8B:
When to choose an alternative: