Octen AI

Octen-Embedding-8B

RTEB #1 domain-tuned 8B retrieval embedder from Octen AI, a LoRA fine-tune of Qwen3-Embedding-8B.

7.6B paramsDense

Our Take

Best for: Open-source embedding text workloads

A solid 7.6B-parameter dense embedding model from Octen AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters7.6B

Active Params6.9B

ArchitectureDense

ProviderOcten AI

Download Size30.3 GB

Community

Monthly Downloads6.4K

Likes173

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

71.6

Classification

66.7

Clustering

55.7

STS

81.3

MBA Open Score

55.2BB

Benchmark60%

68.8

Popularity25%

31.1

Efficiency15%

40.7

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.7 GB
Acer Veriton GN100 AI MiniAcer	SS	4.7 GB
AMD Instinct MI300XAMD	SS	4.7 GB
AMD Instinct MI325XAMD	SS	4.7 GB
AMD Instinct MI355XAMD	SS	4.7 GB
AMD Radeon RX 7600 8GBAMD	SS	4.7 GB
AMD Radeon RX 7700 XTAMD	SS	4.7 GB
AMD Radeon RX 7800 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTXAMD	SS	4.7 GB
AMD Radeon RX 9070AMD	SS	4.7 GB
AMD Radeon RX 9070 XTAMD	SS	4.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.7 GB
Apple M4Apple	SS	4.7 GB
Apple M4 Max (40-core GPU)Apple	SS	4.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple M5Apple	SS	4.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Octen-Embedding-8B is a domain-tuned text embedding model developed by Octen AI, currently ranking #1 on the RTEB Leaderboard (as of January 2026) with a Mean (Task) score of 0.8045. It is a dense 7.6B parameter model built as a LoRA fine-tune of Qwen3-Embedding-8B, targeting high-precision retrieval for industry workloads. This model competes directly with closed-source embedding APIs like Voyage-3-large and Cohere Embed v4, but with the advantage of being fully open-source under Apache 2.0 — meaning you can run it on your own hardware without per-query costs or data leakage.

The embedding space is 4096 dimensions, and the model supports a 32,768 token context length, making it suitable for long-document retrieval in legal, medical, and financial settings. If you need a local alternative to API-based embedders that matches or beats top commercial offerings on public and private benchmarks, this is the model to evaluate.

Architecture & Technical Details

Octen-Embedding-8B is a dense transformer with 7.6B parameters, not MoE. Every inference call uses the full parameter count, so VRAM consumption is predictable: approximately 15–16 GB in FP16, 8–9 GB in INT8, and around 5–6 GB with 4-bit quantization (e.g., Q4_K_M). The architecture inherits Qwen3’s dual-encoder design for symmetric (query-document) and asymmetric retrieval, with a LoRA adapter trained on domain-specific data.

Embedding dimension: 4096
Max tokens: 32,768
Base model: Qwen3-Embedding-8B
Fine-tuning method: LoRA (low-rank adaptation)
Supported quantization: INT8 (available via HuggingFace Octen/Octen-Embedding-8B-INT8), plus community Q4_K_M and Q5_K_M through GGUF.

The dense architecture means you get full performance on every request — no routing overhead or active-parameter variance. For local deployment, the trade-off is higher VRAM than a comparable MoE model, but the retrieval quality is state-of-the-art.

Capabilities & Use Cases

Octen-Embedding-8B excels in domain-specific retrieval where generic embeddings fall short. Octen AI tuned it on vertical datasets spanning four key areas:

Legal: Contract clauses, case law, regulatory documents
Finance: Earnings reports, Q&A datasets, personal finance content
Healthcare: Medical Q&A, clinical dialogues, health consultation logs
Code: Programming problems, code search, SQL queries

The model supports 100+ natural languages and several programming languages, with strong cross-lingual and multilingual retrieval performance. It achieves a Public dataset score of 0.7953 and Private dataset score of 0.8157 on RTEB, indicating minimal overfitting. Use it for:

Semantic search over long internal documents (32K token context)
RAG pipelines where retrieval quality directly impacts LLM output
Code search in multi-repo environments
Multilingual document matching without language-specific pipelines

Running Octen-Embedding-8B Locally

This is where Octen-Embedding-8B differentiates itself from API-only models. You can deploy it on consumer GPUs, and the open license means no rate limits or data privacy risks.

VRAM Requirements by Quantization

Quantization	VRAM (approx.)	Recommended Hardware
FP16 (full)	15–16 GB	RTX 4080 Super (16GB), RTX 4090 (24GB), M4 Max (64GB unified)
INT8	8–9 GB	RTX 4060 Ti 16GB, RTX 3080 10GB (with swap)
Q4_K_M	5–6 GB	RTX 3060 12GB, Apple Silicon 18GB+ unified memory

For most users, Q4_K_M is the sweet spot: it preserves retrieval quality (within 1–2% of FP16 on MTEB benchmarks) while fitting on widely available cards like the RTX 3060 12GB. If you have a 24GB card like the RTX 4090, INT8 or even FP16 is feasible and maximizes precision for edge-case queries.

Expected Performance (Embedding Throughput)

Embedding models are measured in tokens per second during inference, not generation speed. On a single RTX 4090:

FP16: ~800–1000 tokens/second
INT8: ~1100–1400 tokens/second
Q4_K_M: ~1300–1600 tokens/second

Batch size affects throughput substantially. With a batch size of 16 and Q4_K_M, you can embed ~25,000 tokens per second. On an M4 Max (64GB unified), expect ~600–800 tokens/second in FP16 — sufficient for real-time retrieval in most RAG setups.

Quick Start with Ollama

The fastest way to run Octen-Embedding-8B locally is via Ollama (once a GGUF variant is available). Alternatively, use the official INT8 model from HuggingFace with sentence-transformers:

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("Octen/Octen-Embedding-8B-INT8")
4embeddings = model.encode(["Your query text here"])

For production deployments, consider using vLLM with the Nomic embedding backend or a custom Triton server.

Hardware Considerations

Best GPU for Octen-Embedding-8B: RTX 4090 (24GB) — fits FP16 with headroom for batch processing.
Budget option: RTX 3060 12GB + Q4_K_M quantization.
Apple Silicon: M4 Max 64GB unified memory can run FP16 comfortably. M2 Ultra also works.

If you need to run a 7.6B model on a consumer GPU like the RTX 3060, use Q4_K_M and keep the batch size under 8. For latency-sensitive applications, the INT8 variant on an RTX 4080 Super is a solid middle ground.

How It Compares

Octen-Embedding-8B competes primarily with two models at similar scale:

vs Voyage-3-large (API-only, 1024 dim)

Voyage-3-large scores 0.7812 Mean (Task) on RTEB vs Octen’s 0.8045. Octen also offers 4× the embedding dimensions (4096 vs 1024), which can improve recall on fine-grained retrieval. The trade-off: Octen requires local hardware and is larger (7.6B vs Voyage’s ~1.5B). If you need zero-maintenance cloud retrieval, Voyage is simpler; if you want control and better scores, Octen wins.

vs Qwen3-Embedding-8B (base model)

Octen-Embedding-8B is a refined version of Qwen3-Embedding-8B, which scores 0.7547. Octen’s domain tuning adds ~5 points on Mean (Task) and significantly improves performance on legal, finance, and medical tasks. If you’re already using Qwen3-Embedding-8B, upgrading to Octen costs nothing (license compatibility) and gives you a measurable quality lift without changing your inference stack.

When to choose Octen-Embedding-8B:

You need top-tier retrieval quality on proprietary or domain-specific documents.
You want to avoid API costs and data egress.
You have at least 6 GB VRAM available for quantization.

When to choose an alternative:

You have extremely limited hardware (under 4 GB VRAM) — consider Octen-Embedding-0.6B.
You need 128K+ context (Cohere Embed v4) — Octen caps at 32K.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Octen AI

Octen-Embedding-8B

RTEB #1 domain-tuned 8B retrieval embedder from Octen AI, a LoRA fine-tune of Qwen3-Embedding-8B.

7.6B paramsDense

View on Hugging Face Official Page

Our Take

Best for: Open-source embedding text workloads

A solid 7.6B-parameter dense embedding model from Octen AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters7.6B

Active Params6.9B

ArchitectureDense

ProviderOcten AI

Download Size30.3 GB

Community

Monthly Downloads6.4K

Likes173

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

71.6

Classification

66.7

Clustering

55.7

STS

81.3

MBA Open Score

55.2BB

Benchmark60%

68.8

Popularity25%

31.1

Efficiency15%

40.7

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.7 GB
Acer Veriton GN100 AI MiniAcer	SS	4.7 GB
AMD Instinct MI300XAMD	SS	4.7 GB
AMD Instinct MI325XAMD	SS	4.7 GB
AMD Instinct MI355XAMD	SS	4.7 GB
AMD Radeon RX 7600 8GBAMD	SS	4.7 GB
AMD Radeon RX 7700 XTAMD	SS	4.7 GB
AMD Radeon RX 7800 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTXAMD	SS	4.7 GB
AMD Radeon RX 9070AMD	SS	4.7 GB
AMD Radeon RX 9070 XTAMD	SS	4.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.7 GB
Apple M4Apple	SS	4.7 GB
Apple M4 Max (40-core GPU)Apple	SS	4.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple M5Apple	SS	4.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Embedding dimension: 4096
Max tokens: 32,768
Base model: Qwen3-Embedding-8B
Fine-tuning method: LoRA (low-rank adaptation)
Supported quantization: INT8 (available via HuggingFace Octen/Octen-Embedding-8B-INT8), plus community Q4_K_M and Q5_K_M through GGUF.

Capabilities & Use Cases

Octen-Embedding-8B excels in domain-specific retrieval where generic embeddings fall short. Octen AI tuned it on vertical datasets spanning four key areas:

Legal: Contract clauses, case law, regulatory documents
Finance: Earnings reports, Q&A datasets, personal finance content
Healthcare: Medical Q&A, clinical dialogues, health consultation logs
Code: Programming problems, code search, SQL queries

Semantic search over long internal documents (32K token context)
RAG pipelines where retrieval quality directly impacts LLM output
Code search in multi-repo environments
Multilingual document matching without language-specific pipelines

Running Octen-Embedding-8B Locally

This is where Octen-Embedding-8B differentiates itself from API-only models. You can deploy it on consumer GPUs, and the open license means no rate limits or data privacy risks.

VRAM Requirements by Quantization

Quantization	VRAM (approx.)	Recommended Hardware
FP16 (full)	15–16 GB	RTX 4080 Super (16GB), RTX 4090 (24GB), M4 Max (64GB unified)
INT8	8–9 GB	RTX 4060 Ti 16GB, RTX 3080 10GB (with swap)
Q4_K_M	5–6 GB	RTX 3060 12GB, Apple Silicon 18GB+ unified memory

Expected Performance (Embedding Throughput)

Embedding models are measured in tokens per second during inference, not generation speed. On a single RTX 4090:

FP16: ~800–1000 tokens/second
INT8: ~1100–1400 tokens/second
Q4_K_M: ~1300–1600 tokens/second

Quick Start with Ollama

The fastest way to run Octen-Embedding-8B locally is via Ollama (once a GGUF variant is available). Alternatively, use the official INT8 model from HuggingFace with sentence-transformers:

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("Octen/Octen-Embedding-8B-INT8")
4embeddings = model.encode(["Your query text here"])

For production deployments, consider using vLLM with the Nomic embedding backend or a custom Triton server.

Hardware Considerations

Best GPU for Octen-Embedding-8B: RTX 4090 (24GB) — fits FP16 with headroom for batch processing.
Budget option: RTX 3060 12GB + Q4_K_M quantization.
Apple Silicon: M4 Max 64GB unified memory can run FP16 comfortably. M2 Ultra also works.

How It Compares

Octen-Embedding-8B competes primarily with two models at similar scale:

vs Voyage-3-large (API-only, 1024 dim)

vs Qwen3-Embedding-8B (base model)

When to choose Octen-Embedding-8B:

You need top-tier retrieval quality on proprietary or domain-specific documents.
You want to avoid API costs and data egress.
You have at least 6 GB VRAM available for quantization.

When to choose an alternative:

You have extremely limited hardware (under 4 GB VRAM) — consider Octen-Embedding-0.6B.
You need 128K+ context (Cohere Embed v4) — Octen caps at 32K.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.