
Alibaba's Qwen2-7B-based GTE that topped MTEB English and Chinese in mid-2024.
A solid 7.1B-parameter dense embedding model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Alibaba’s gte-Qwen2-7B-instruct is a dense, 7.1B-parameter text embedding model that topped both the English and Chinese Massive Text Embedding Benchmark (MTEB) in mid-2024. It belongs to the General Text Embedding (GTE) family and is built on the Qwen2 architecture. Unlike generative LLMs, this model is designed to produce high-quality vector representations of text for tasks like semantic search, clustering, classification, and retrieval-augmented generation (RAG). It competes directly with other 7B-class embedding models such as BGE-M3 (BAAI) and intfloat/e5-mistral-7b-instruct. Under Apache 2.0 license, it’s free for both research and commercial use—no strings attached.
For practitioners running AI on their own hardware, gte-Qwen2-7B-instruct offers state-of-the-art embedding quality without requiring cloud API calls. Its dense architecture means consistent memory usage across batch sizes, making it a predictable choice for local deployment.
The model uses a dense transformer architecture with 7.1B parameters. Unlike mixture-of-experts (MoE) models that activate only a subset of parameters per token, a dense model uses all parameters for every forward pass. This leads to higher VRAM consumption per inference but eliminates the variable latency and memory spikes common in MoE designs. For embedding models, this is often preferable: you get stable throughput and can batch inputs efficiently.
Key architectural specs:
"query: " or "document: ") to guide the embedding space for retrieval vs. classification tasks.The 7.1B size places it in the “heavy” tier for embedding models. Most production embedding models are under 1B parameters, but the extra capacity buys significantly better performance on complex semantic tasks, especially cross-lingual and fine-grained classification.
gte-Qwen2-7B-instruct excels at tasks evaluated on MTEB and C-MTEB (Chinese MTEB). Based on published benchmark results, its strengths include:
Concrete use cases:
The model is not a generative or conversational AI—it outputs fixed-size embeddings (vectors), not text. It is best used as a component in a larger pipeline.
Running a 7.1B dense embedding model locally is feasible on modern consumer hardware, but VRAM is the main constraint. Here’s what to expect:
| Quantization | Approx. VRAM | Notes |
|---|---|---|
| FP16 (full) | ~14 GB | Full precision; high quality but high VRAM. |
| Q8_0 | ~8 GB | Good quality, fits many 10GB+ GPUs. |
| Q4_K_M | ~5–6 GB | Recommended balance: quality close to FP16, fits most 8GB GPUs. |
| Q4_0 | ~4.5 GB | Lower quality but usable for retrieval if benchmarked. |
Throughput depends heavily on quantization, batch size, and GPU memory bandwidth. Realistic ranges for a single GPU:
For retrieval pipelines, you typically embed documents once (offline) and queries online. The key metric is latency per query, which at Q4_K_M on a 4090 is under 50ms for a short query.
Ollama is the fastest way to run this model locally. After installing Ollama, run:
1ollama pull alibaba-nlp/gte-qwen2-7b-instruct
Then use the embedding API:
1curl http://localhost:11434/api/embeddings -d '{2 "model": "alibaba-nlp/gte-qwen2-7b-instruct",3 "prompt": "What is the capital of France?"4}'
Ollama automatically applies the best quantization for your GPU. For more control, you can specify a quantization file (e.g., Q4_K_M) via the import mechanism.
| Model | Parameters | Architecture | MTEB (avg) | Strengths |
|---|---|---|---|---|
| gte-Qwen2-7B-instruct | 7.1B | Dense | ~70.2 (en) | Top on English & Chinese; strong instruction tuning. |
| BGE-M3 (BAAI) | 567M | Dense | ~69.5 (en) | Much smaller VRAM (~1.5GB FP16); good for low-resource hardware, but lower quality on complex retrieval. |
| intfloat/e5-mistral-7b-instruct | 7.1B | Dense | ~69.8 (en) | Also strong; based on Mistral; slightly worse on Chinese. |
When to choose gte-Qwen2-7B-instruct: You need the best possible embedding quality for multilingual (EN/ZH) retrieval or classification, and you have at least 8–12GB VRAM. It outranks BGE-M3 on MTEB by ~0.7 points and handles long documents better due to larger context (32k reported).
When to choose BGE-M3: Your hardware is limited (e.g., RTX 3060 8GB, M1 Mac with 16GB), or you need faster inference with minimal resource usage. BGE-M3 also supports dense + sparse hybrid retrieval, which can improve recall in domain-specific cases.
When to choose e5-mistral-7b-instruct: You are working primarily with English and prefer the Mistral-based architecture for reasons of community support or ecosystem compatibility. Its MTEB scores are very close, but gte-Qwen2 edges ahead on Chinese benchmarks.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.