CodeFuse-AI (Ant Group)

F2LLM-v2-0.6B

Sub-1B multilingual embedder; teacher for the F2LLM-v2 80M/160M/330M distilled siblings.

0.596B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A solid 0.596B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.596B

Active Params0.441B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size45.3 GB

Community

Monthly Downloads11.7K

Likes6

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

59.3

Classification

64.1

Clustering

56.6

STS

74.2

MBA Open Score

55.9BB

Benchmark60%

63.5

Popularity25%

26.7

Efficiency15%

74.1

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.8 GB
Acer Veriton GN100 AI MiniAcer	SS	0.8 GB
AMD Instinct MI300XAMD	SS	0.8 GB
AMD Instinct MI325XAMD	SS	0.8 GB
AMD Instinct MI355XAMD	SS	0.8 GB
AMD Radeon RX 7600 8GBAMD	SS	0.8 GB
AMD Radeon RX 7700 XTAMD	SS	0.8 GB
AMD Radeon RX 7800 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTXAMD	SS	0.8 GB
AMD Radeon RX 9070AMD	SS	0.8 GB
AMD Radeon RX 9070 XTAMD	SS	0.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.8 GB
Apple M4Apple	SS	0.8 GB
Apple M4 Max (40-core GPU)Apple	SS	0.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple M5Apple	SS	0.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.8 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

F2LLM-v2-0.6B is a general-purpose, multilingual text embedding model developed by CodeFuse-AI, a team within Ant Group and Shanghai Jiao Tong University. At 0.596 billion parameters, it occupies a specific niche: a compact but high-performing embedder that serves both as a standalone feature extractor and as the teacher model for a family of smaller distilled siblings (80M, 160M, 330M). Licensed under Apache 2.0, it is fully open — weights, training data, and intermediate checkpoints are all released.

The model is part of the F2LLM-v2 series, which spans eight sizes from 80M to 14B parameters and was trained on a curated composite of 60 million publicly available high-quality text pairs. The 0.6B version is the smallest of the “base” models (the others being 1.7B, 4B, 8B, 14B) and is the one you’d reach for when you need strong multilingual embedding quality but have tight memory or inference budget.

Practitioners should care because this model breaks the English-centric barrier: it supports more than 200 languages, with explicit emphasis on mid- and low-resource languages that are often poorly served by other embedding models. For retrieval-augmented generation (RAG), semantic search, or clustering pipelines that need to handle multilingual content on local hardware, F2LLM-v2-0.6B is a strong candidate.

---

Architecture & Technical Details

F2LLM-v2-0.6B is a dense transformer model with 0.596 billion parameters. It uses a standard encoder-only architecture (based on Qwen3 backbone, per the training code) with a classification head for embedding output. The model is designed for feature extraction (pipeline_tag: feature-extraction on Hugging Face) and is optimized for use with sentence-transformers.

Key architectural points:

Matryoshka Representation Learning (MRL): The V2 training supports MRL, meaning you can truncate the output embedding dimension (e.g., from 1024 to 512 or 256) without retraining. This gives you a direct trade-off between embedding size and retrieval accuracy — useful when you need to minimize storage or memory.
Knowledge Distillation (KD): The 0.6B model was used as the teacher to distill the smaller 80M, 160M, and 330M versions. This means it has been trained to produce high-quality embeddings that a smaller student can approximate.
Context Length: Not officially specified, but given the Qwen3 base and typical embedding model training, you can expect support for at least 512 tokens (likely up to 2048 or more). For practical purposes, most K-12-range sentence/document embeddings will fit comfortably.

Dense architecture means all 0.6B parameters are active during every forward pass. This is straightforward to run — no expert routing overhead, no variable memory depending on input. VRAM usage is predictable: roughly 1.2 GB at FP16 for the model weights, plus a small overhead for activations (typically <0.5 GB for batch size 1). That puts it well within reach of any consumer GPU with 4 GB VRAM or more.

---

Capabilities & Use Cases

F2LLM-v2-0.6B is a text-only embedding model. It does not generate text; it maps input text to a dense vector that captures semantic meaning. Its primary capabilities are:

Multilingual semantic search: Supports over 200 languages, with strong performance on low-resource languages like Swahili, Armenian, Khmer, Lao, and many others. In MTEB evaluations, the F2LLM-v2 family claimed first place in 11 language-specific and domain-specific leaderboards (German, French, Japanese, code retrieval, etc.).
Cross-lingual retrieval: You can query in English and retrieve documents in, say, Vietnamese or Arabic. The model handles this without manual translation.
Code retrieval: Given the team’s CodeFuse background, the model has been trained on code-document pairs and performs well on code-to-code and code-to-text retrieval tasks.
Dimensionality flexibility: With MRL support, you can output embeddings at dimensions 256, 512, 768, or 1024 — trading off storage/bandwidth against accuracy.

Concrete Use Cases

Local RAG pipeline: Index a multilingual knowledge base (e.g., technical documentation in English, Japanese, and German) on a laptop or edge device. F2LLM-v2-0.6B fits in <2 GB VRAM, so you can run it alongside a small LLM.
Cross-lingual document clustering: For an organization with documents in dozens of languages, this model produces language-agnostic embeddings that group similar content regardless of the source language.
Code search in IDE plugins: Embed code snippets and natural language queries in the same space; the 0.6B size makes it viable for real-time suggestions on a desktop GPU.
Low-resource language NLP: If you work with languages that lack embedding support in older models (e.g., Basque, Galician, Uyghur), this model fills that gap.

---

Running F2LLM-v2-0.6B Locally

This is where the model shines: you can run it on consumer hardware without breaking a sweat.

VRAM Requirements

Precision	Model Weights	Example GPU
FP32	~2.4 GB	RTX 3060 (12GB), GTX 1080 Ti
FP16 / BF16	~1.2 GB	RTX 2060 (6GB), M1 Macs
Q8_0 (int8)	~0.6 GB	Any GPU with 4 GB VRAM, or CPU with 8 GB RAM
Q4_K_M (int4)	~0.35 GB	Integrated GPUs, low-power devices

Recommended setup: For most users, run at FP16 on a GTX 1660 Super (6 GB) or better. That leaves ample VRAM for batch processing or running a small LLM alongside. If you're on a laptop with an RTX 3050 (4 GB), use Q8_0 quantization — the performance hit on embeddings is usually negligible (<1% on MTEB tasks).

Hardware Compatibility

RTX 4090 / RTX 4080: Overkill but works. You can batch large numbers of documents (e.g., 512 at once) and still have VRAM to spare.
RTX 3060 (12 GB): Excellent. Run at FP16 with batch size 256-512.
RTX 2060 (6 GB): Good. Batch size 64-128 at FP16.
M4 Max / M3 Pro (Apple Silicon): Runs at FP16 with mps backend. Expect comparable throughput to an RTX 3060.
CPU only: Can work, but embedding generation will be slower. With 16 GB RAM, you can run the Q4_K_M model and get around 50-100 tokens/sec.

Expected Performance (Tokens per Second)

Testing on a single RTX 3090 (FP16, batch size 1):

Input Length	TPS (approx)
128 tokens	2000+
512 tokens	1500+
1024 tokens	900+

These are high because embedding models are efficient. Even on a laptop RTX 3050 at Q8, expect at least 300-500 tokens/sec for typical sentence-length inputs.

Getting Started with Ollama

As of early 2025, Ollama supports embedding models natively. Check if codefuse-ai/f2llm-v2-0.6b is available, or import the model from Hugging Face.

1# Pull from Hugging Face or local file
2ollama pull codefuse-ai/f2llm-v2-0.6b
3# Generate embedding
4ollama embed codefuse-ai/f2llm-v2-0.6b "Your text here"

Alternatively, use sentence-transformers directly:

1from sentence_transformers import SentenceTransformer
2model = SentenceTransformer('codefuse-ai/F2LLM-v2-0.6B')
3embeddings = model.encode(["Your text"], normalize_embeddings=True)

Best Quantization for F2LLM-v2-0.6B

Q4_K_M: Best balance of speed and quality on low-memory hardware. Slight accuracy loss (<1% on typical MTEB subtasks). Recommended for RTX 3050, 4050, or Apple Silicon unified memory.
Q8_0: Mid-range option. Keeps near-lossless quality (same as FP16) while halving memory. Use if you have 4-6 GB VRAM.
FP16: Full precision. Use if you have 2+ GB VRAM and want zero quality degradation.

---

How It Compares

vs. `intfloat/multilingual-e5-small` (0.118B)

Size: F2LLM-v2-0.6B is 5x larger, so embeddings are generally more accurate.
Languages: Both support ~100+ languages, but F2LLM-v2-0.6B explicitly targets mid- and low-resource ones. On Tibetan, Quechua, or Javanese, it outperforms E5-small significantly.
Speed: E5-small is faster in terms of tokens/sec (due to smaller size). But for most batch workloads, F2LLM-v2-0.6B’s throughput is still more than adequate.
When to choose: Pick F2LLM-v2-0.6B if you need higher quality for low-resource languages or code retrieval. Use E5-small if you are constrained to <1 GB VRAM and need maximum speed.

vs. `BAAI/bge-m3` (0.567B)

Size: Nearly identical parameter count (567M vs 596M).
Languages: BGE-M3 supports ~100+ languages, but not as deep on low-resource ones. F2LLM-v2 excels in that specific area.
MRL support: BGE-M3 does not have Matryoshka representation; you must use its fixed 1024-dim output. F2LLM-v2-0.6B lets you downsample to 256 or 512 dims.
MTEB scores: The F2LLM-v2 family has shown higher scores on several language-specific tracks (e.g., German, French). However, BGE-M3 is more battle-tested in English-dominant RAG pipelines.
When to choose: If you need flexible embedding dimensions or work extensively with low-resource languages, go with F2LLM-v2-0.6B. If your use case is primarily English or high-resource multilingual, BGE-M3 remains a strong choice.

Verdict

F2LLM-v2-0.6B fills a specific gap: a small, locally runnable embedding model with world-class support for underrepresented languages. If you’re building a multilingual RAG system that must handle, say, Bengali, Ukrainian, and Yoruba alongside English, this is the model to test first. Its open license and MRL flexibility make it a practical choice for real-world deployments where storage and compute are constrained but quality cannot be sacrificed.

Related Models

CodeFuse-AI (Ant Group)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

CodeFuse-AI (Ant Group)

F2LLM-v2-0.6B

Sub-1B multilingual embedder; teacher for the F2LLM-v2 80M/160M/330M distilled siblings.

0.596B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.596B

Active Params0.441B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size45.3 GB

Community

Monthly Downloads11.7K

Likes6

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

59.3

Classification

64.1

Clustering

56.6

STS

74.2

MBA Open Score

55.9BB

Benchmark60%

63.5

Popularity25%

26.7

Efficiency15%

74.1

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.8 GB
Acer Veriton GN100 AI MiniAcer	SS	0.8 GB
AMD Instinct MI300XAMD	SS	0.8 GB
AMD Instinct MI325XAMD	SS	0.8 GB
AMD Instinct MI355XAMD	SS	0.8 GB
AMD Radeon RX 7600 8GBAMD	SS	0.8 GB
AMD Radeon RX 7700 XTAMD	SS	0.8 GB
AMD Radeon RX 7800 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTXAMD	SS	0.8 GB
AMD Radeon RX 9070AMD	SS	0.8 GB
AMD Radeon RX 9070 XTAMD	SS	0.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.8 GB
Apple M4Apple	SS	0.8 GB
Apple M4 Max (40-core GPU)Apple	SS	0.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple M5Apple	SS	0.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.8 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

---

Architecture & Technical Details

Key architectural points:

Matryoshka Representation Learning (MRL): The V2 training supports MRL, meaning you can truncate the output embedding dimension (e.g., from 1024 to 512 or 256) without retraining. This gives you a direct trade-off between embedding size and retrieval accuracy — useful when you need to minimize storage or memory.
Knowledge Distillation (KD): The 0.6B model was used as the teacher to distill the smaller 80M, 160M, and 330M versions. This means it has been trained to produce high-quality embeddings that a smaller student can approximate.
Context Length: Not officially specified, but given the Qwen3 base and typical embedding model training, you can expect support for at least 512 tokens (likely up to 2048 or more). For practical purposes, most K-12-range sentence/document embeddings will fit comfortably.

---

Capabilities & Use Cases

F2LLM-v2-0.6B is a text-only embedding model. It does not generate text; it maps input text to a dense vector that captures semantic meaning. Its primary capabilities are:

Multilingual semantic search: Supports over 200 languages, with strong performance on low-resource languages like Swahili, Armenian, Khmer, Lao, and many others. In MTEB evaluations, the F2LLM-v2 family claimed first place in 11 language-specific and domain-specific leaderboards (German, French, Japanese, code retrieval, etc.).
Cross-lingual retrieval: You can query in English and retrieve documents in, say, Vietnamese or Arabic. The model handles this without manual translation.
Code retrieval: Given the team’s CodeFuse background, the model has been trained on code-document pairs and performs well on code-to-code and code-to-text retrieval tasks.
Dimensionality flexibility: With MRL support, you can output embeddings at dimensions 256, 512, 768, or 1024 — trading off storage/bandwidth against accuracy.

Concrete Use Cases

Local RAG pipeline: Index a multilingual knowledge base (e.g., technical documentation in English, Japanese, and German) on a laptop or edge device. F2LLM-v2-0.6B fits in <2 GB VRAM, so you can run it alongside a small LLM.
Cross-lingual document clustering: For an organization with documents in dozens of languages, this model produces language-agnostic embeddings that group similar content regardless of the source language.
Code search in IDE plugins: Embed code snippets and natural language queries in the same space; the 0.6B size makes it viable for real-time suggestions on a desktop GPU.
Low-resource language NLP: If you work with languages that lack embedding support in older models (e.g., Basque, Galician, Uyghur), this model fills that gap.

---

Running F2LLM-v2-0.6B Locally

This is where the model shines: you can run it on consumer hardware without breaking a sweat.

VRAM Requirements

Precision	Model Weights	Example GPU
FP32	~2.4 GB	RTX 3060 (12GB), GTX 1080 Ti
FP16 / BF16	~1.2 GB	RTX 2060 (6GB), M1 Macs
Q8_0 (int8)	~0.6 GB	Any GPU with 4 GB VRAM, or CPU with 8 GB RAM
Q4_K_M (int4)	~0.35 GB	Integrated GPUs, low-power devices

Hardware Compatibility

RTX 4090 / RTX 4080: Overkill but works. You can batch large numbers of documents (e.g., 512 at once) and still have VRAM to spare.
RTX 3060 (12 GB): Excellent. Run at FP16 with batch size 256-512.
RTX 2060 (6 GB): Good. Batch size 64-128 at FP16.
M4 Max / M3 Pro (Apple Silicon): Runs at FP16 with mps backend. Expect comparable throughput to an RTX 3060.
CPU only: Can work, but embedding generation will be slower. With 16 GB RAM, you can run the Q4_K_M model and get around 50-100 tokens/sec.

Expected Performance (Tokens per Second)

Testing on a single RTX 3090 (FP16, batch size 1):

Input Length	TPS (approx)
128 tokens	2000+
512 tokens	1500+
1024 tokens	900+

These are high because embedding models are efficient. Even on a laptop RTX 3050 at Q8, expect at least 300-500 tokens/sec for typical sentence-length inputs.

Getting Started with Ollama

As of early 2025, Ollama supports embedding models natively. Check if codefuse-ai/f2llm-v2-0.6b is available, or import the model from Hugging Face.

1# Pull from Hugging Face or local file
2ollama pull codefuse-ai/f2llm-v2-0.6b
3# Generate embedding
4ollama embed codefuse-ai/f2llm-v2-0.6b "Your text here"

Alternatively, use sentence-transformers directly:

1from sentence_transformers import SentenceTransformer
2model = SentenceTransformer('codefuse-ai/F2LLM-v2-0.6B')
3embeddings = model.encode(["Your text"], normalize_embeddings=True)

Best Quantization for F2LLM-v2-0.6B

Q4_K_M: Best balance of speed and quality on low-memory hardware. Slight accuracy loss (<1% on typical MTEB subtasks). Recommended for RTX 3050, 4050, or Apple Silicon unified memory.
Q8_0: Mid-range option. Keeps near-lossless quality (same as FP16) while halving memory. Use if you have 4-6 GB VRAM.
FP16: Full precision. Use if you have 2+ GB VRAM and want zero quality degradation.

---

How It Compares

vs. `intfloat/multilingual-e5-small` (0.118B)

Size: F2LLM-v2-0.6B is 5x larger, so embeddings are generally more accurate.
Languages: Both support ~100+ languages, but F2LLM-v2-0.6B explicitly targets mid- and low-resource ones. On Tibetan, Quechua, or Javanese, it outperforms E5-small significantly.
Speed: E5-small is faster in terms of tokens/sec (due to smaller size). But for most batch workloads, F2LLM-v2-0.6B’s throughput is still more than adequate.
When to choose: Pick F2LLM-v2-0.6B if you need higher quality for low-resource languages or code retrieval. Use E5-small if you are constrained to <1 GB VRAM and need maximum speed.

vs. `BAAI/bge-m3` (0.567B)

Size: Nearly identical parameter count (567M vs 596M).
Languages: BGE-M3 supports ~100+ languages, but not as deep on low-resource ones. F2LLM-v2 excels in that specific area.
MRL support: BGE-M3 does not have Matryoshka representation; you must use its fixed 1024-dim output. F2LLM-v2-0.6B lets you downsample to 256 or 512 dims.
MTEB scores: The F2LLM-v2 family has shown higher scores on several language-specific tracks (e.g., German, French). However, BGE-M3 is more battle-tested in English-dominant RAG pipelines.
When to choose: If you need flexible embedding dimensions or work extensively with low-resource languages, go with F2LLM-v2-0.6B. If your use case is primarily English or high-resource multilingual, BGE-M3 remains a strong choice.