Alibaba

gte-Qwen2-7B-instruct

Alibaba's Qwen2-7B-based GTE that topped MTEB English and Chinese in mid-2024.

7.1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A solid 7.1B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters7.1B

Active Params6.5B

ArchitectureDense

ProviderAlibaba

Download Size58.7 GB

Community

Monthly Downloads92.1K

Likes482

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

MTEB Overall

70.2

Retrieval

60.1

Classification

61.5

Clustering

52.8

STS

74.0

MBA Open Score

60.7BB

Benchmark60%

63.7

Popularity25%

72.2

Efficiency15%

29.6

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.5 GB
Acer Veriton GN100 AI MiniAcer	SS	4.5 GB
AMD Instinct MI300XAMD	SS	4.5 GB
AMD Instinct MI325XAMD	SS	4.5 GB
AMD Instinct MI355XAMD	SS	4.5 GB
AMD Radeon RX 7600 8GBAMD	SS	4.5 GB
AMD Radeon RX 7700 XTAMD	SS	4.5 GB
AMD Radeon RX 7800 XTAMD	SS	4.5 GB
AMD Radeon RX 7900 XTAMD	SS	4.5 GB
AMD Radeon RX 7900 XTXAMD	SS	4.5 GB
AMD Radeon RX 9070AMD	SS	4.5 GB
AMD Radeon RX 9070 XTAMD	SS	4.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.5 GB
Apple M4Apple	SS	4.5 GB
Apple M4 Max (40-core GPU)Apple	SS	4.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.5 GB
Apple M5Apple	SS	4.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.5 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Alibaba’s gte-Qwen2-7B-instruct is a dense, 7.1B-parameter text embedding model that topped both the English and Chinese Massive Text Embedding Benchmark (MTEB) in mid-2024. It belongs to the General Text Embedding (GTE) family and is built on the Qwen2 architecture. Unlike generative LLMs, this model is designed to produce high-quality vector representations of text for tasks like semantic search, clustering, classification, and retrieval-augmented generation (RAG). It competes directly with other 7B-class embedding models such as BGE-M3 (BAAI) and intfloat/e5-mistral-7b-instruct. Under Apache 2.0 license, it’s free for both research and commercial use—no strings attached.

For practitioners running AI on their own hardware, gte-Qwen2-7B-instruct offers state-of-the-art embedding quality without requiring cloud API calls. Its dense architecture means consistent memory usage across batch sizes, making it a predictable choice for local deployment.

Architecture & Technical Details

The model uses a dense transformer architecture with 7.1B parameters. Unlike mixture-of-experts (MoE) models that activate only a subset of parameters per token, a dense model uses all parameters for every forward pass. This leads to higher VRAM consumption per inference but eliminates the variable latency and memory spikes common in MoE designs. For embedding models, this is often preferable: you get stable throughput and can batch inputs efficiently.

Key architectural specs:

Parameters: 7.1B (dense)
Embedding dimension: 3584 (as per external references; verify with your own config if running from source)
Context length: Not officially specified by Alibaba; third-party sources list 32,000 tokens. Test with your own data if long-document embedding is critical.
Attention: Bidirectional (typical for embedding models), allowing full contextualization of each token relative to all others in the input.
Instruction tuning: The model supports instruction prefixes (e.g., "query: " or "document: ") to guide the embedding space for retrieval vs. classification tasks.

The 7.1B size places it in the “heavy” tier for embedding models. Most production embedding models are under 1B parameters, but the extra capacity buys significantly better performance on complex semantic tasks, especially cross-lingual and fine-grained classification.

Capabilities & Use Cases

gte-Qwen2-7B-instruct excels at tasks evaluated on MTEB and C-MTEB (Chinese MTEB). Based on published benchmark results, its strengths include:

Semantic textual similarity: Measuring how similar two pieces of text are (e.g., duplicate detection, paraphrase matching).
Retrieval: Ranking documents by relevance to a query—ideal for RAG pipelines.
Classification: Both binary and multi-class text classification, with strong accuracy on Amazon reviews, sentiment, and counterfactual detection.
Clustering: Grouping documents by topic or intent without supervised labels.
Multilingual: Strong performance in English and Chinese; reasonable capability in French, Polish, and other languages (included in MTEB multilingual tasks).

Concrete use cases:

Enterprise search: Embed internal documents and query vectors to retrieve relevant policies, manuals, or knowledge base articles.
RAG for local LLMs: Pair with a local 7B–13B generative model (e.g., Llama 3, Mistral) to build a completely offline question-answering system.
Content deduplication: For large text corpora, compute embeddings and cluster near-duplicates.
Cross-lingual retrieval: Search English queries against Chinese document collections and vice versa.

The model is not a generative or conversational AI—it outputs fixed-size embeddings (vectors), not text. It is best used as a component in a larger pipeline.

Running gte-Qwen2-7B-instruct Locally

Running a 7.1B dense embedding model locally is feasible on modern consumer hardware, but VRAM is the main constraint. Here’s what to expect:

VRAM Requirements by Quantization

Quantization	Approx. VRAM	Notes
FP16 (full)	~14 GB	Full precision; high quality but high VRAM.
Q8_0	~8 GB	Good quality, fits many 10GB+ GPUs.
Q4_K_M	~5–6 GB	Recommended balance: quality close to FP16, fits most 8GB GPUs.
Q4_0	~4.5 GB	Lower quality but usable for retrieval if benchmarked.

Hardware Recommendations

Minimum: An RTX 3060 12GB or RTX 4060 Ti 16GB can run Q4_K_M or Q8_0 comfortably. Batch size may need to be limited to 1–4.
Recommended: RTX 4090 24GB or RTX 3090 24GB allows FP16 with batch sizes of 8–16, maximizing throughput.
Apple Silicon: M2 Pro/Max or M4 Max with 36–64GB unified memory can run Q8_0 or FP16 via MLX or llama.cpp. Expect 50–100 tokens/second on M4 Max.
CPU-only: Not practical for interactive use; embedding a single short text may take 5–10 seconds. Use only for offline batch processing with RAM >32GB.

Expected Performance

Throughput depends heavily on quantization, batch size, and GPU memory bandwidth. Realistic ranges for a single GPU:

RTX 4090: 150–300 tokens/second (FP16, batch=1); 400–700 tokens/second (Q4_K_M, batch=8).
RTX 3060 12GB: 40–80 tokens/second (Q4_K_M, batch=1).
M4 Max (64GB): 80–150 tokens/second (Q8_0, batch=1).

For retrieval pipelines, you typically embed documents once (offline) and queries online. The key metric is latency per query, which at Q4_K_M on a 4090 is under 50ms for a short query.

Quickstart with Ollama

Ollama is the fastest way to run this model locally. After installing Ollama, run:

1ollama pull alibaba-nlp/gte-qwen2-7b-instruct

Then use the embedding API:

1curl http://localhost:11434/api/embeddings -d '{
2  "model": "alibaba-nlp/gte-qwen2-7b-instruct",
3  "prompt": "What is the capital of France?"
4}'

Ollama automatically applies the best quantization for your GPU. For more control, you can specify a quantization file (e.g., Q4_K_M) via the import mechanism.

How It Compares

Model	Parameters	Architecture	MTEB (avg)	Strengths
gte-Qwen2-7B-instruct	7.1B	Dense	~70.2 (en)	Top on English & Chinese; strong instruction tuning.
BGE-M3 (BAAI)	567M	Dense	~69.5 (en)	Much smaller VRAM (~1.5GB FP16); good for low-resource hardware, but lower quality on complex retrieval.
intfloat/e5-mistral-7b-instruct	7.1B	Dense	~69.8 (en)	Also strong; based on Mistral; slightly worse on Chinese.

When to choose gte-Qwen2-7B-instruct: You need the best possible embedding quality for multilingual (EN/ZH) retrieval or classification, and you have at least 8–12GB VRAM. It outranks BGE-M3 on MTEB by ~0.7 points and handles long documents better due to larger context (32k reported).

When to choose BGE-M3: Your hardware is limited (e.g., RTX 3060 8GB, M1 Mac with 16GB), or you need faster inference with minimal resource usage. BGE-M3 also supports dense + sparse hybrid retrieval, which can improve recall in domain-specific cases.

When to choose e5-mistral-7b-instruct: You are working primarily with English and prefer the Mistral-based architecture for reasons of community support or ecosystem compatibility. Its MTEB scores are very close, but gte-Qwen2 edges ahead on Chinese benchmarks.

Related Models

Alibaba

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Alibaba

gte-Qwen2-7B-instruct

Alibaba's Qwen2-7B-based GTE that topped MTEB English and Chinese in mid-2024.

7.1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A solid 7.1B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters7.1B

Active Params6.5B

ArchitectureDense

ProviderAlibaba

Download Size58.7 GB

Community

Monthly Downloads92.1K

Likes482

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

MTEB Overall

70.2

Retrieval

60.1

Classification

61.5

Clustering

52.8

STS

74.0

MBA Open Score

60.7BB

Benchmark60%

63.7

Popularity25%

72.2

Efficiency15%

29.6

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.5 GB
Acer Veriton GN100 AI MiniAcer	SS	4.5 GB
AMD Instinct MI300XAMD	SS	4.5 GB
AMD Instinct MI325XAMD	SS	4.5 GB
AMD Instinct MI355XAMD	SS	4.5 GB
AMD Radeon RX 7600 8GBAMD	SS	4.5 GB
AMD Radeon RX 7700 XTAMD	SS	4.5 GB
AMD Radeon RX 7800 XTAMD	SS	4.5 GB
AMD Radeon RX 7900 XTAMD	SS	4.5 GB
AMD Radeon RX 7900 XTXAMD	SS	4.5 GB
AMD Radeon RX 9070AMD	SS	4.5 GB
AMD Radeon RX 9070 XTAMD	SS	4.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.5 GB
Apple M4Apple	SS	4.5 GB
Apple M4 Max (40-core GPU)Apple	SS	4.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.5 GB
Apple M5Apple	SS	4.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.5 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key architectural specs:

Parameters: 7.1B (dense)
Embedding dimension: 3584 (as per external references; verify with your own config if running from source)
Context length: Not officially specified by Alibaba; third-party sources list 32,000 tokens. Test with your own data if long-document embedding is critical.
Attention: Bidirectional (typical for embedding models), allowing full contextualization of each token relative to all others in the input.
Instruction tuning: The model supports instruction prefixes (e.g., "query: " or "document: ") to guide the embedding space for retrieval vs. classification tasks.

Capabilities & Use Cases

gte-Qwen2-7B-instruct excels at tasks evaluated on MTEB and C-MTEB (Chinese MTEB). Based on published benchmark results, its strengths include:

Semantic textual similarity: Measuring how similar two pieces of text are (e.g., duplicate detection, paraphrase matching).
Retrieval: Ranking documents by relevance to a query—ideal for RAG pipelines.
Classification: Both binary and multi-class text classification, with strong accuracy on Amazon reviews, sentiment, and counterfactual detection.
Clustering: Grouping documents by topic or intent without supervised labels.
Multilingual: Strong performance in English and Chinese; reasonable capability in French, Polish, and other languages (included in MTEB multilingual tasks).

Concrete use cases:

Enterprise search: Embed internal documents and query vectors to retrieve relevant policies, manuals, or knowledge base articles.
RAG for local LLMs: Pair with a local 7B–13B generative model (e.g., Llama 3, Mistral) to build a completely offline question-answering system.
Content deduplication: For large text corpora, compute embeddings and cluster near-duplicates.
Cross-lingual retrieval: Search English queries against Chinese document collections and vice versa.

The model is not a generative or conversational AI—it outputs fixed-size embeddings (vectors), not text. It is best used as a component in a larger pipeline.

Running gte-Qwen2-7B-instruct Locally

Running a 7.1B dense embedding model locally is feasible on modern consumer hardware, but VRAM is the main constraint. Here’s what to expect:

VRAM Requirements by Quantization

Quantization	Approx. VRAM	Notes
FP16 (full)	~14 GB	Full precision; high quality but high VRAM.
Q8_0	~8 GB	Good quality, fits many 10GB+ GPUs.
Q4_K_M	~5–6 GB	Recommended balance: quality close to FP16, fits most 8GB GPUs.
Q4_0	~4.5 GB	Lower quality but usable for retrieval if benchmarked.

Hardware Recommendations

Minimum: An RTX 3060 12GB or RTX 4060 Ti 16GB can run Q4_K_M or Q8_0 comfortably. Batch size may need to be limited to 1–4.
Recommended: RTX 4090 24GB or RTX 3090 24GB allows FP16 with batch sizes of 8–16, maximizing throughput.
Apple Silicon: M2 Pro/Max or M4 Max with 36–64GB unified memory can run Q8_0 or FP16 via MLX or llama.cpp. Expect 50–100 tokens/second on M4 Max.
CPU-only: Not practical for interactive use; embedding a single short text may take 5–10 seconds. Use only for offline batch processing with RAM >32GB.

Expected Performance

Throughput depends heavily on quantization, batch size, and GPU memory bandwidth. Realistic ranges for a single GPU:

RTX 4090: 150–300 tokens/second (FP16, batch=1); 400–700 tokens/second (Q4_K_M, batch=8).
RTX 3060 12GB: 40–80 tokens/second (Q4_K_M, batch=1).
M4 Max (64GB): 80–150 tokens/second (Q8_0, batch=1).

For retrieval pipelines, you typically embed documents once (offline) and queries online. The key metric is latency per query, which at Q4_K_M on a 4090 is under 50ms for a short query.

Quickstart with Ollama

Ollama is the fastest way to run this model locally. After installing Ollama, run:

1ollama pull alibaba-nlp/gte-qwen2-7b-instruct

Then use the embedding API:

1curl http://localhost:11434/api/embeddings -d '{
2  "model": "alibaba-nlp/gte-qwen2-7b-instruct",
3  "prompt": "What is the capital of France?"
4}'

Ollama automatically applies the best quantization for your GPU. For more control, you can specify a quantization file (e.g., Q4_K_M) via the import mechanism.

How It Compares

Model	Parameters	Architecture	MTEB (avg)	Strengths
gte-Qwen2-7B-instruct	7.1B	Dense	~70.2 (en)	Top on English & Chinese; strong instruction tuning.
BGE-M3 (BAAI)	567M	Dense	~69.5 (en)	Much smaller VRAM (~1.5GB FP16); good for low-resource hardware, but lower quality on complex retrieval.
intfloat/e5-mistral-7b-instruct	7.1B	Dense	~69.8 (en)	Also strong; based on Mistral; slightly worse on Chinese.