CodeFuse-AI (Ant Group)

F2LLM-v2-1.7B

Compact 1.7B multilingual embedder for resource-constrained deployments.

1.7B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A workable 1.7B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.7B

Active Params1.4B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size130.8 GB

Community

Monthly Downloads3.3K

Likes6

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

62.0

Classification

67.7

Clustering

58.8

STS

75.8

MBA Open Score

53.5CC

Benchmark60%

66.0

Popularity25%

15.6

Efficiency15%

66.7

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.4 GB
Acer Veriton GN100 AI MiniAcer	SS	1.4 GB
AMD Instinct MI300XAMD	SS	1.4 GB
AMD Instinct MI325XAMD	SS	1.4 GB
AMD Instinct MI355XAMD	SS	1.4 GB
AMD Radeon RX 7600 8GBAMD	SS	1.4 GB
AMD Radeon RX 7700 XTAMD	SS	1.4 GB
AMD Radeon RX 7800 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTXAMD	SS	1.4 GB
AMD Radeon RX 9070AMD	SS	1.4 GB
AMD Radeon RX 9070 XTAMD	SS	1.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.4 GB
Apple M4Apple	SS	1.4 GB
Apple M4 Max (40-core GPU)Apple	SS	1.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple M5Apple	SS	1.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.4 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.4 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.4 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.4 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

F2LLM-v2-1.7B is a general-purpose, multilingual embedding model developed by CodeFuse-AI (Ant Group). It’s part of the F2LLM-v2 family, which spans eight sizes from 80M to 14B parameters, all trained on a curated composite of 60 million publicly available high-quality text samples. The 1.7B variant is the sweet spot for practitioners who need strong multilingual retrieval and semantic understanding without the VRAM footprint of larger models.

This is not a chatbot or generative model—it’s a dense encoder optimized for feature extraction and sentence embeddings. Its primary value is producing high-quality vector representations for search, clustering, classification, and retrieval-augmented generation (RAG) pipelines, especially in environments where you cannot offload to cloud APIs. The model supports over 200 languages, with particular attention to mid- and low-resource languages that are often underserved by mainstream embedders.

F2LLM-v2-1.7B competes directly with other small multilingual embedders like multilingual-e5-small and bge-m3 at similar parameter counts. What sets it apart is its open-source transparency: Ant Group released the full training recipe, intermediate checkpoints, and data, making it a reproducible, auditable choice for production systems.

Architecture & Technical Details

F2LLM-v2-1.7B is a dense transformer encoder with 1.7 billion parameters. It uses a decoder-only backbone (based on Qwen3) but is fine-tuned exclusively for embedding tasks via a two-stage pipeline:

Stage 1: Contrastive pretraining on the 60M multilingual dataset, learning to align semantically similar texts across languages.
Stage 2: Instruction tuning with Matryoshka Representation Learning (MRL) and knowledge distillation. MRL lets the model produce variable-dimension embeddings (e.g., 256, 512, 1024) from a single forward pass, giving you control over storage vs. accuracy tradeoffs without retraining.

The model is designed for the transformers library and integrates directly with Sentence Transformers. It uses a standard attention mechanism—no mixture-of-experts (MoE)—so inference is straightforward and predictable in terms of memory and latency.

Context length is not officially specified, but based on the Qwen3 backbone and typical embedding model defaults, you can expect at least 8192 tokens. For most embedding use cases (sentences, paragraphs, documents under 8K tokens), this is sufficient.

Capabilities & Use Cases

F2LLM-v2-1.7B excels at multilingual semantic search and cross-lingual retrieval. It achieves state-of-the-art results on 11 language-specific MTEB leaderboards, including European, Scandinavian, Indic, and East Asian languages. Key strengths:

Cross-lingual retrieval: Query in English, retrieve relevant documents in Hindi, Vietnamese, or Persian with near-native accuracy.
Low-resource language support: Covers languages like Swahili, Burmese, Khmer, and Lao that most embedders ignore.
Matryoshka embeddings: Reduce storage costs by using 256‑dim vectors for approximate search, then re-rank with 1024‑dim for precision—all from the same model.
RAG pipelines: Use it as the embedding backbone for local RAG systems where the knowledge base spans multiple languages.

Concrete use cases:

A multilingual FAQ system for a global customer base, where user queries come in Spanish, Arabic, or Thai.
Document clustering for a research team working with papers in Japanese, German, and Dutch.
Local-first search over a company’s internal knowledge base in 20+ languages, running on a single consumer GPU.

Running F2LLM-v2-1.7B Locally

This model is designed for resource-constrained deployments. Here’s what you need to run it on your own hardware.

VRAM Requirements

Quantization	VRAM (approx.)	Notes
FP16 (full precision)	~3.5 GB	Best accuracy, but overkill for most retrieval tasks
Q8_0	~2.0 GB	Near-lossless compression, recommended for high-precision use
Q4_K_M	~1.2 GB	Sweet spot for most users—good accuracy, minimal memory
Q4_0	~1.0 GB	Slightly lower quality, fits on GPUs with 1 GB VRAM

Recommended Hardware

Minimum: Any GPU with 2 GB VRAM (e.g., GTX 1060 6GB, RTX 3050) can run Q4_K_M comfortably.
Recommended: RTX 3060 12GB or RTX 4060—allows FP16 inference with headroom for batching.
Apple Silicon: M1/M2/M3/M4 with 8 GB RAM can run Q4_K_M via llama.cpp or MLX. Expect 50–100 tokens/second on an M4 Max.
CPU-only: Possible with Q4_K_M and 8 GB system RAM, but expect 10–20 tokens/second.

Performance Expectations

On an RTX 4090 with Q4_K_M, you can process 1000+ tokens/second (batch size 1). For embedding a sentence (~32 tokens), expect <1 ms latency. Throughput scales linearly with batch size up to the GPU’s memory limit.

Quick Start with Ollama

The fastest way to get started:

1ollama run codefuse-ai/f2llm-v2-1.7b

This pulls the Q4_K_M quantized model and provides a simple API for embedding. For more control, use the transformers library with sentence-transformers:

1from sentence_transformers import SentenceTransformer
2model = SentenceTransformer('codefuse-ai/F2LLM-v2-1.7B')
3embeddings = model.encode(["Your text here"])

How It Compares

Model	Parameters	Languages	Strengths	Tradeoffs
F2LLM-v2-1.7B	1.7B	200+	Strong low-resource language support, open-source training data, MRL	Slightly larger than 0.6B alternatives, no generative capability
multilingual-e5-small	118M	100+	Very small, fast	Weaker on low-resource languages, lower accuracy overall
bge-m3	567M	100+	Good general multilingual performance	Larger than e5-small, less transparent training data

Choose F2LLM-v2-1.7B when you need the best accuracy per parameter for multilingual retrieval, especially if your use case includes languages like Hindi, Arabic, or Vietnamese. If you’re strictly English-only or need the smallest possible model, the 0.6B variant of F2LLM-v2 may be a better fit. For pure speed and minimal VRAM, multilingual-e5-small is still a solid option—but you’ll sacrifice accuracy on mid- and low-resource languages.

Related Models

CodeFuse-AI (Ant Group)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

CodeFuse-AI (Ant Group)

F2LLM-v2-1.7B

Compact 1.7B multilingual embedder for resource-constrained deployments.

1.7B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.7B

Active Params1.4B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size130.8 GB

Community

Monthly Downloads3.3K

Likes6

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

62.0

Classification

67.7

Clustering

58.8

STS

75.8

MBA Open Score

53.5CC

Benchmark60%

66.0

Popularity25%

15.6

Efficiency15%

66.7

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.4 GB
Acer Veriton GN100 AI MiniAcer	SS	1.4 GB
AMD Instinct MI300XAMD	SS	1.4 GB
AMD Instinct MI325XAMD	SS	1.4 GB
AMD Instinct MI355XAMD	SS	1.4 GB
AMD Radeon RX 7600 8GBAMD	SS	1.4 GB
AMD Radeon RX 7700 XTAMD	SS	1.4 GB
AMD Radeon RX 7800 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTXAMD	SS	1.4 GB
AMD Radeon RX 9070AMD	SS	1.4 GB
AMD Radeon RX 9070 XTAMD	SS	1.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.4 GB
Apple M4Apple	SS	1.4 GB
Apple M4 Max (40-core GPU)Apple	SS	1.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple M5Apple	SS	1.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.4 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.4 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.4 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.4 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

F2LLM-v2-1.7B is a dense transformer encoder with 1.7 billion parameters. It uses a decoder-only backbone (based on Qwen3) but is fine-tuned exclusively for embedding tasks via a two-stage pipeline:

Stage 1: Contrastive pretraining on the 60M multilingual dataset, learning to align semantically similar texts across languages.
Stage 2: Instruction tuning with Matryoshka Representation Learning (MRL) and knowledge distillation. MRL lets the model produce variable-dimension embeddings (e.g., 256, 512, 1024) from a single forward pass, giving you control over storage vs. accuracy tradeoffs without retraining.

Capabilities & Use Cases

Cross-lingual retrieval: Query in English, retrieve relevant documents in Hindi, Vietnamese, or Persian with near-native accuracy.
Low-resource language support: Covers languages like Swahili, Burmese, Khmer, and Lao that most embedders ignore.
Matryoshka embeddings: Reduce storage costs by using 256‑dim vectors for approximate search, then re-rank with 1024‑dim for precision—all from the same model.
RAG pipelines: Use it as the embedding backbone for local RAG systems where the knowledge base spans multiple languages.

Concrete use cases:

A multilingual FAQ system for a global customer base, where user queries come in Spanish, Arabic, or Thai.
Document clustering for a research team working with papers in Japanese, German, and Dutch.
Local-first search over a company’s internal knowledge base in 20+ languages, running on a single consumer GPU.

Running F2LLM-v2-1.7B Locally

This model is designed for resource-constrained deployments. Here’s what you need to run it on your own hardware.

VRAM Requirements

Quantization	VRAM (approx.)	Notes
FP16 (full precision)	~3.5 GB	Best accuracy, but overkill for most retrieval tasks
Q8_0	~2.0 GB	Near-lossless compression, recommended for high-precision use
Q4_K_M	~1.2 GB	Sweet spot for most users—good accuracy, minimal memory
Q4_0	~1.0 GB	Slightly lower quality, fits on GPUs with 1 GB VRAM

Recommended Hardware

Minimum: Any GPU with 2 GB VRAM (e.g., GTX 1060 6GB, RTX 3050) can run Q4_K_M comfortably.
Recommended: RTX 3060 12GB or RTX 4060—allows FP16 inference with headroom for batching.
Apple Silicon: M1/M2/M3/M4 with 8 GB RAM can run Q4_K_M via llama.cpp or MLX. Expect 50–100 tokens/second on an M4 Max.
CPU-only: Possible with Q4_K_M and 8 GB system RAM, but expect 10–20 tokens/second.

Performance Expectations

Quick Start with Ollama

The fastest way to get started:

1ollama run codefuse-ai/f2llm-v2-1.7b

This pulls the Q4_K_M quantized model and provides a simple API for embedding. For more control, use the transformers library with sentence-transformers:

1from sentence_transformers import SentenceTransformer
2model = SentenceTransformer('codefuse-ai/F2LLM-v2-1.7B')
3embeddings = model.encode(["Your text here"])

How It Compares

Model	Parameters	Languages	Strengths	Tradeoffs
F2LLM-v2-1.7B	1.7B	200+	Strong low-resource language support, open-source training data, MRL	Slightly larger than 0.6B alternatives, no generative capability
multilingual-e5-small	118M	100+	Very small, fast	Weaker on low-resource languages, lower accuracy overall
bge-m3	567M	100+	Good general multilingual performance	Larger than e5-small, less transparent training data

Related Models

CodeFuse-AI (Ant Group)

F2LLM-v2-14B

14BDense

CodeFuse-AI (Ant Group)

F2LLM-v2-8B

7.6BDense

CodeFuse-AI (Ant Group)

F2LLM-v2-4B

4BDense

CodeFuse-AI (Ant Group)

F2LLM-v2-0.6B

0.596BDense

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.