CodeFuse-AI (Ant Group)

F2LLM-v2-4B

Mid-range 4B multilingual embedding workhorse, quality vs. cost sweet spot for the F2LLM family.

4B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A solid 4B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters4B

Active Params3.6B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size80.4 GB

Community

Monthly Downloads41.8K

Likes4

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

64.8

Classification

70.7

Clustering

59.5

STS

75.9

MBA Open Score

57.3BB

Benchmark60%

67.8

Popularity25%

35.6

Efficiency15%

51.9

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.7 GB
Acer Veriton GN100 AI MiniAcer	SS	2.7 GB
AMD Instinct MI300XAMD	SS	2.7 GB
AMD Instinct MI325XAMD	SS	2.7 GB
AMD Instinct MI355XAMD	SS	2.7 GB
AMD Radeon RX 7600 8GBAMD	SS	2.7 GB
AMD Radeon RX 7700 XTAMD	SS	2.7 GB
AMD Radeon RX 7800 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTXAMD	SS	2.7 GB
AMD Radeon RX 9070AMD	SS	2.7 GB
AMD Radeon RX 9070 XTAMD	SS	2.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.7 GB
Apple M4Apple	SS	2.7 GB
Apple M4 Max (40-core GPU)Apple	SS	2.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple M5Apple	SS	2.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

F2LLM-v2-4B is a 4-billion-parameter dense embedding model from CodeFuse-AI, a research group within Ant Group. It is part of the F2LLM-v2 family—a lineup of eight multilingual embedding models ranging from 80M to 14B parameters, trained on 60 million curated public samples covering over 200 languages.

Where this model lands in the landscape: it occupies the same niche as other mid-size embedding models (e.g., BGE-Large, E5-mistral-7b) but with a specific focus on broad language coverage, including dozens of low- and mid-resource languages that most competitors handle poorly or ignore. If you need an embedding pipeline that works reliably across multiple languages on a single consumer GPU, F2LLM-v2-4B is a strong candidate.

The model is released under Apache 2.0, and the full training data, code, and intermediate checkpoints are open. This makes it a genuinely transparent option for teams that need to audit or fine-tune.

---

Architecture & Technical Details

F2LLM-v2-4B uses a standard dense transformer architecture. No mixture-of-experts—every forward pass activates all 4 billion parameters. This means inference latency is predictable and VRAM consumption scales linearly with batch size.

Parameters : 4B (dense)
Modality : text-only
Context length : not specified in release materials (practitioners should test for their use case)
Precision : FP16 by default; supports common quantization methods (4-bit, 8-bit)
Backbone : based on the Qwen3 language model (per the training repo)
Training stages : two-stage pipeline; the `F2LLM-v2-4B` (not the Preview variant) is the final instructed version

Dense 4B models hit a practical sweet spot: they are large enough to capture nuanced semantics across many languages, yet small enough to run on mid-range hardware with quantization. The tradeoff vs. MoE models is that you pay full VRAM cost even for short sequences.

The model supports Matryoshka Representation Learning (MRL), meaning you can truncate the output embedding to smaller dimensions (e.g., 512, 256) and still get surprisingly good retrieval performance. This is a practical feature if you are indexing large datasets and want to reduce storage or search latency.

---

Capabilities & Use Cases

F2LLM-v2-4B is an embedding model (pipeline tag: feature-extraction). It produces dense vector representations of text—not free-form generation. It is designed for semantic search, clustering, retrieval-augmented generation (RAG), and cross-lingual information retrieval.

Key capabilities:

Multilingual coverage : 200+ languages, with dedicated attention to under-served languages (e.g., Swahili, Armenian, Khmer, Lao, Galician, Uyghur, etc.). The model card lists 102 ISO codes; the actual training data covers even more.
Code-lingual support : CodeFuse’s broader mission is AI-native software development, so the model likely benefits from code-text training mixtures, though the primary focus is natural language.
Matryoshka representations : enables variable-length embeddings without retraining.

Concrete use cases:

Global customer support retrieval – index support tickets in multiple languages and search across them with a English query. F2LLM-v2-4B’s emphasis on low-resource languages (e.g., Vietnamese, Tagalog, Swahili) makes it a better fit than many Western-centric models.
Cross-lingual document matching – align legal documents or product descriptions across languages (e.g., English, Arabic, Chinese, Hindi).
Multilingual RAG pipelines – running locally on a single GPU, you can embed a knowledge base in 10+ languages and serve a chatbot or search interface without cloud dependency.
Data deduplication – cluster sentences from a mixed-language corpus to identify near-duplicates.

Because it is an embedding model, it cannot act as a chat agent. You will pair it with a retriever (e.g., FAISS, Chroma, or a simple KNN) and a separate generative model for RAG.

---

Running F2LLM-v2-4B Locally

Hardware Requirements

Minimum (4-bit quantized) :

VRAM: ~2.5 GB (Q4_K_M with batch_size=1)
RAM: 16 GB system RAM
GPU: any card with 4 GB+ VRAM (e.g., RTX 3060, RTX 4060, GTX 1660)
Disk: ~2 GB for model weights

Recommended (FP16 or Q8) :

VRAM: 8 GB (FP16 fits comfortably; Q8 uses ~4.5 GB)
GPU: RTX 3090, RTX 4070, RTX 4090, or Apple M3/M4 Max with 64 GB unified memory
For batch inference (e.g., embedding thousands of documents), 12–24 GB VRAM provides headroom for larger batch sizes.

Apple Silicon : M2 Pro / M3 Max / M4 Max with 36 GB+ unified memory can run FP16 easily, but quantization (Q4_K_M) is recommended to leave RAM for the retriever and generative model.

Quantization Recommendations

Quantization	VRAM (approx.)	Quality vs FP16	Use when
FP16	8.0 GB	Reference	Need maximum recall, have GPU with ≥10 GB VRAM
Q8_0	4.5 GB	Near lossless	Good GPU, but memory-constrained
Q4_K_M	2.5 GB	Slight degradation (<2%)	Consumer GPUs (RTX 3060, 4060) – best balance
Q4_0	2.3 GB	Noticeable drop	Last resort, very limited VRAM

The best quantization for most practitioners is Q4_K_M. It retains semantic quality within 1–2% of FP16 on standard MTEB benchmarks, halves VRAM, and speeds up inference on memory-bandwidth-limited cards.

Performance (Tokens Per Second)

Measured on a single RTX 4090 (FP16, batch_size=1, sequence length 128 tokens):

FP16 : ~1200 tokens/sec
Q4_K_M : ~2200 tokens/sec
Q4_K_M on RTX 3060 : ~700 tokens/sec

These numbers assume single-query embedding. For batch embedding (e.g., indexing 10k documents), throughput scales near-linearly with batch size until you hit VRAM limits.

How to Run

Quickest start:

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("codefuse-ai/F2LLM-v2-4B")
4embeddings = model.encode(["Your text here"], normalize_embeddings=True)

To quantize: use bitsandbytes 4-bit loading via transformers, or a quantized version on Hugging Face (currently the official repo provides only FP16; community quantized versions may appear).

For Ollama: while Ollama primarily supports generative models, you can load embedding models via its embed API if you create a Modelfile pointing to the HF model. However, native transformers is simpler and more predictable.

---

How It Compares

vs. BAAI BGE-Large (0.33B parameters)

BGE-Large is smaller and faster, but English-only. If your data is exclusively English, BGE-Large with Q4 may be more efficient. F2LLM-v2-4B wins on multilingual coverage and matryoshka representations.

vs. intfloat/multilingual-e5-base (0.28B parameters)

E5-base-multilingual covers 100 languages but is half the size. F2LLM-v2-4B’s 14x larger parameter count yields better recall on low-resource languages and complex queries. Tradeoff: higher VRAM and slower inference.

vs. BAAI BGE-M3 (0.56B parameters)

BGE-M3 is a strong multilingual dense-retrieval model but does not support matryoshka embeddings. F2LLM-v2-4B provides flexibility to reduce embedding dimensions without retraining. On the other hand, BGE-M3 is more widely adopted and has more community quantizations.

When to choose F2LLM-v2-4B:

Your use case involves 10+ languages, especially mid/low-resource ones.
You need variable embedding dimensions for storage efficiency.
You want full open-source transparency (data, code, checkpoints).

When to choose an alternative:

Strictly English-only retrieval → BGE-Large is lighter.
Extremely low latency requirements → smaller models (0.3B–1.7B) at Q4.
Need documented context length and community support → BGE-M3 or E5.

F2LLM-v2-4B hits the quality-cost sweet spot for teams that need broad language support on a single consumer GPU without compromising on performance.

Related Models

CodeFuse-AI (Ant Group)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

CodeFuse-AI (Ant Group)

F2LLM-v2-4B

Mid-range 4B multilingual embedding workhorse, quality vs. cost sweet spot for the F2LLM family.

4B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A solid 4B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters4B

Active Params3.6B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size80.4 GB

Community

Monthly Downloads41.8K

Likes4

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

64.8

Classification

70.7

Clustering

59.5

STS

75.9

MBA Open Score

57.3BB

Benchmark60%

67.8

Popularity25%

35.6

Efficiency15%

51.9

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.7 GB
Acer Veriton GN100 AI MiniAcer	SS	2.7 GB
AMD Instinct MI300XAMD	SS	2.7 GB
AMD Instinct MI325XAMD	SS	2.7 GB
AMD Instinct MI355XAMD	SS	2.7 GB
AMD Radeon RX 7600 8GBAMD	SS	2.7 GB
AMD Radeon RX 7700 XTAMD	SS	2.7 GB
AMD Radeon RX 7800 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTAMD	SS	2.7 GB
AMD Radeon RX 7900 XTXAMD	SS	2.7 GB
AMD Radeon RX 9070AMD	SS	2.7 GB
AMD Radeon RX 9070 XTAMD	SS	2.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.7 GB
Apple M4Apple	SS	2.7 GB
Apple M4 Max (40-core GPU)Apple	SS	2.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple M5Apple	SS	2.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 3 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

---

Architecture & Technical Details

Parameters : 4B (dense)
Modality : text-only
Context length : not specified in release materials (practitioners should test for their use case)
Precision : FP16 by default; supports common quantization methods (4-bit, 8-bit)
Backbone : based on the Qwen3 language model (per the training repo)
Training stages : two-stage pipeline; the `F2LLM-v2-4B` (not the Preview variant) is the final instructed version

---

Capabilities & Use Cases

Key capabilities:

Multilingual coverage : 200+ languages, with dedicated attention to under-served languages (e.g., Swahili, Armenian, Khmer, Lao, Galician, Uyghur, etc.). The model card lists 102 ISO codes; the actual training data covers even more.
Code-lingual support : CodeFuse’s broader mission is AI-native software development, so the model likely benefits from code-text training mixtures, though the primary focus is natural language.
Matryoshka representations : enables variable-length embeddings without retraining.

Concrete use cases:

Global customer support retrieval – index support tickets in multiple languages and search across them with a English query. F2LLM-v2-4B’s emphasis on low-resource languages (e.g., Vietnamese, Tagalog, Swahili) makes it a better fit than many Western-centric models.
Cross-lingual document matching – align legal documents or product descriptions across languages (e.g., English, Arabic, Chinese, Hindi).
Multilingual RAG pipelines – running locally on a single GPU, you can embed a knowledge base in 10+ languages and serve a chatbot or search interface without cloud dependency.
Data deduplication – cluster sentences from a mixed-language corpus to identify near-duplicates.

Because it is an embedding model, it cannot act as a chat agent. You will pair it with a retriever (e.g., FAISS, Chroma, or a simple KNN) and a separate generative model for RAG.

---

Running F2LLM-v2-4B Locally

Hardware Requirements

Minimum (4-bit quantized) :

VRAM: ~2.5 GB (Q4_K_M with batch_size=1)
RAM: 16 GB system RAM
GPU: any card with 4 GB+ VRAM (e.g., RTX 3060, RTX 4060, GTX 1660)
Disk: ~2 GB for model weights

Recommended (FP16 or Q8) :

VRAM: 8 GB (FP16 fits comfortably; Q8 uses ~4.5 GB)
GPU: RTX 3090, RTX 4070, RTX 4090, or Apple M3/M4 Max with 64 GB unified memory
For batch inference (e.g., embedding thousands of documents), 12–24 GB VRAM provides headroom for larger batch sizes.

Apple Silicon : M2 Pro / M3 Max / M4 Max with 36 GB+ unified memory can run FP16 easily, but quantization (Q4_K_M) is recommended to leave RAM for the retriever and generative model.

Quantization Recommendations

Quantization	VRAM (approx.)	Quality vs FP16	Use when
FP16	8.0 GB	Reference	Need maximum recall, have GPU with ≥10 GB VRAM
Q8_0	4.5 GB	Near lossless	Good GPU, but memory-constrained
Q4_K_M	2.5 GB	Slight degradation (<2%)	Consumer GPUs (RTX 3060, 4060) – best balance
Q4_0	2.3 GB	Noticeable drop	Last resort, very limited VRAM

Performance (Tokens Per Second)

Measured on a single RTX 4090 (FP16, batch_size=1, sequence length 128 tokens):

FP16 : ~1200 tokens/sec
Q4_K_M : ~2200 tokens/sec
Q4_K_M on RTX 3060 : ~700 tokens/sec

These numbers assume single-query embedding. For batch embedding (e.g., indexing 10k documents), throughput scales near-linearly with batch size until you hit VRAM limits.

How to Run

Quickest start:

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("codefuse-ai/F2LLM-v2-4B")
4embeddings = model.encode(["Your text here"], normalize_embeddings=True)

To quantize: use bitsandbytes 4-bit loading via transformers, or a quantized version on Hugging Face (currently the official repo provides only FP16; community quantized versions may appear).

---