
A Qwen3-4B text embedder trained with bagging-based model merging for OOD-robust retrieval.
A workable 4B-parameter dense embedding model from Institute of Computing Technology. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
BOOM_4B_v1 is a 4 billion parameter dense text embedding model developed by the Institute of Computing Technology (ICT), Chinese Academy of Sciences. It is built on the Qwen3-4B architecture and fine-tuned for general-purpose text representation, with a specific focus on out-of-domain (OOD) robustness. The model uses a novel bagging-based model merging technique (BOOM) that trains multiple embedding models on sampled subsets of training data then merges them into a single model. This approach improves both in-domain and OOD retrieval performance while keeping inference identical to a single model.
The model occupies a specific niche: large-scale text embedders. Most embedding models run under 1B parameters (e.g., BGE-M3, E5, GTE). BOOM_4B_v1 pushes that ceiling to 4B, targeting applications that demand higher representational capacity—particularly in retrieval augmented generation (RAG), enterprise search, and semantic similarity tasks where robustness across unseen domains matters. It competes with other large embedders like intfloat/e5-mistral-7b-instruct (7B) and Alibaba’s gte-Qwen2-1.5B (1.5B), but offers a middle ground in size and a unique training methodology.
BOOM_4B_v1 is a dense transformer, not MoE. All 4B parameters are active during inference. The model is initialized from Qwen3-4B, then trained for sentence embedding using last-token pooling. The base model supports a context length of 32,000 tokens (per the Qwen3-4B specification), enabling encoding of long documents in a single pass.
The key architectural innovation is in training, not in model design. Five embedding models are trained on different random subsets (20%, 40%, 60%, 80%, 100%) of the full 2.8M multi-task corpus. These are then merged using Multi-SLERP (spherical linear interpolation) with weighted coefficients (0.2, 0.4, 0.6, 0.8, 1.0). The resulting model retains the inference cost of a single dense 4B network while capturing the variance-reducing benefits of bagging. This technique avoids the OOD generalization limitations of standard batch-level shuffling and supports incremental updates without full retraining.
The model is distributed in float32 precision. Quantization (e.g., to 8-bit or 4-bit) is supported via common inference engines.
BOOM_4B_v1 is a text-only embedding model designed for:
Training data covered retrieval (MS MARCO, NQ, HotpotQA, FEVER, FiQA, etc.), reranking (StackOverflowDupQuestions), classification (Amazon Reviews, Banking77, IMDB), clustering (Arxiv, Reddit), STS (STS12-22), and code (Cornstack – JavaScript, Java, Python, PHP, Ruby). The model handles English-dominant text but can encode other languages present in the training mix (e.g., MIRACL, Mr. TyDi). It does not generate text; it produces fixed-length vectors.
Concrete use cases:
At 4B parameters, BOOM_4B_v1 is a mid-size model that fits on consumer GPUs with reasonable quantization.
| Quantization | VRAM (approximate, with 32k context) | Recommended Hardware |
|---|---|---|
| FP16 | ~8 GB | RTX 3090 / 4090, M4 Max (64GB unified) |
| Q8_0 | ~5 GB | RTX 3080 12GB, RTX 4060 Ti 16GB |
| Q4_K_M | ~3 GB | RTX 3060 12GB, M4 Pro, Apple M2 Ultra |
| Q3_K_S | ~2.5 GB | RTX 2060 6GB (tight) |
Realistic GPU: An RTX 4090 24GB can run FP16 with headroom for batching or longer contexts. An RTX 3060 12GB handles Q4_K_M well. Apple Silicon users with unified memory >16GB can run Q4_K_M comfortably.
Inference performance (estimated, Q4_K_M, batch size 1 on RTX 4090): ~200-300 tokens per second for encoding. Throughput scales with batch size – a batch of 32 documents of 512 tokens each should process at ~5000 tokens/sec.
Ollama offers the fastest path. Install Ollama, then pull the model (if available in Ollama library) or use a custom Modelfile pointing to the HuggingFace repo. Alternatively use Sentence Transformers:
1from sentence_transformers import SentenceTransformer2model = SentenceTransformer("ICT-TIME-and-Querit/BOOM_4B_v1")3embeddings = model.encode(["Your text here"])
For speed, enable Flash Attention 2:
1model = SentenceTransformer(2 "ICT-TIME-and-Querit/BOOM_4B_v1",3 model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},4 tokenizer_kwargs={"padding_side": "left"},5)
Best quantization for most users: Q4_K_M offers the best tradeoff of quality retention (~98% of FP16) and VRAM (~3 GB). Avoid Q3_K_S for production retrieval; it may degrade OOD performance due to cumulative precision loss. For memory-constrained hardware, Q8_0 (5 GB) is safer than aggressive 4-bit.
Bottom line: BOOM_4B_v1 occupies a sweet spot: near-top performance with a VRAM footprint that fits most modern consumer GPUs when quantized. Its bagging-based training is a practical alternative to increasing model size for robustness. For local RAG deployments where hardware is a constraint and OOD quality matters, this is a strong candidate.