
Mid-range 4B multilingual embedding workhorse, quality vs. cost sweet spot for the F2LLM family.
A solid 4B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
F2LLM-v2-4B is a 4-billion-parameter dense embedding model from CodeFuse-AI, a research group within Ant Group. It is part of the F2LLM-v2 family—a lineup of eight multilingual embedding models ranging from 80M to 14B parameters, trained on 60 million curated public samples covering over 200 languages.
Where this model lands in the landscape: it occupies the same niche as other mid-size embedding models (e.g., BGE-Large, E5-mistral-7b) but with a specific focus on broad language coverage, including dozens of low- and mid-resource languages that most competitors handle poorly or ignore. If you need an embedding pipeline that works reliably across multiple languages on a single consumer GPU, F2LLM-v2-4B is a strong candidate.
The model is released under Apache 2.0, and the full training data, code, and intermediate checkpoints are open. This makes it a genuinely transparent option for teams that need to audit or fine-tune.
---
F2LLM-v2-4B uses a standard dense transformer architecture. No mixture-of-experts—every forward pass activates all 4 billion parameters. This means inference latency is predictable and VRAM consumption scales linearly with batch size.
Dense 4B models hit a practical sweet spot: they are large enough to capture nuanced semantics across many languages, yet small enough to run on mid-range hardware with quantization. The tradeoff vs. MoE models is that you pay full VRAM cost even for short sequences.
The model supports Matryoshka Representation Learning (MRL), meaning you can truncate the output embedding to smaller dimensions (e.g., 512, 256) and still get surprisingly good retrieval performance. This is a practical feature if you are indexing large datasets and want to reduce storage or search latency.
---
F2LLM-v2-4B is an embedding model (pipeline tag: feature-extraction). It produces dense vector representations of text—not free-form generation. It is designed for semantic search, clustering, retrieval-augmented generation (RAG), and cross-lingual information retrieval.
Key capabilities:
Concrete use cases:
Because it is an embedding model, it cannot act as a chat agent. You will pair it with a retriever (e.g., FAISS, Chroma, or a simple KNN) and a separate generative model for RAG.
---
Minimum (4-bit quantized) :
Recommended (FP16 or Q8) :
Apple Silicon : M2 Pro / M3 Max / M4 Max with 36 GB+ unified memory can run FP16 easily, but quantization (Q4_K_M) is recommended to leave RAM for the retriever and generative model.
| Quantization | VRAM (approx.) | Quality vs FP16 | Use when |
|---|---|---|---|
| FP16 | 8.0 GB | Reference | Need maximum recall, have GPU with ≥10 GB VRAM |
| Q8_0 | 4.5 GB | Near lossless | Good GPU, but memory-constrained |
| Q4_K_M | 2.5 GB | Slight degradation (<2%) | Consumer GPUs (RTX 3060, 4060) – best balance |
| Q4_0 | 2.3 GB | Noticeable drop | Last resort, very limited VRAM |
The best quantization for most practitioners is Q4_K_M. It retains semantic quality within 1–2% of FP16 on standard MTEB benchmarks, halves VRAM, and speeds up inference on memory-bandwidth-limited cards.
Measured on a single RTX 4090 (FP16, batch_size=1, sequence length 128 tokens):
These numbers assume single-query embedding. For batch embedding (e.g., indexing 10k documents), throughput scales near-linearly with batch size until you hit VRAM limits.
Quickest start:
1from sentence_transformers import SentenceTransformer23model = SentenceTransformer("codefuse-ai/F2LLM-v2-4B")4embeddings = model.encode(["Your text here"], normalize_embeddings=True)
To quantize: use bitsandbytes 4-bit loading via transformers, or a quantized version on Hugging Face (currently the official repo provides only FP16; community quantized versions may appear).
For Ollama: while Ollama primarily supports generative models, you can load embedding models via its embed API if you create a Modelfile pointing to the HF model. However, native transformers is simpler and more predictable.
---
BGE-Large is smaller and faster, but English-only. If your data is exclusively English, BGE-Large with Q4 may be more efficient. F2LLM-v2-4B wins on multilingual coverage and matryoshka representations.
E5-base-multilingual covers 100 languages but is half the size. F2LLM-v2-4B’s 14x larger parameter count yields better recall on low-resource languages and complex queries. Tradeoff: higher VRAM and slower inference.
BGE-M3 is a strong multilingual dense-retrieval model but does not support matryoshka embeddings. F2LLM-v2-4B provides flexibility to reduce embedding dimensions without retraining. On the other hand, BGE-M3 is more widely adopted and has more community quantizations.
When to choose F2LLM-v2-4B:
When to choose an alternative:
F2LLM-v2-4B hits the quality-cost sweet spot for teams that need broad language support on a single consumer GPU without compromising on performance.