Tests whether a model obeys precise formatting, length, and constraint rules.
IFBench measures whether a model actually follows precise instructions. Each prompt has verifiable rules: respond in exactly four sentences, include the word "elephant" twice, output JSON with these exact keys. The benchmark cares about obedience, not creativity, which makes it a strong signal for production reliability.
Each prompt has one or more programmatic verifiers. The model’s output is checked rule-by-rule. The score is the fraction of rules satisfied, averaged across prompts.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Grok 4.3 | xAI | Closed | 81.3 |
| 02 | MiMo-V2.5-Pro | Xiaomi | Closed | 79.9 |
| 03 | DeepSeek-V4-Flash | DeepSeek | Open | 79.2 |
| 04 | Qwen3.5-397B-A17B | Alibaba | Open | 78.8 |
| 05 | Gemini 3.1 Flash Lite Preview | Closed | 77.2 | |
| 06 | Gemini 3.1 Pro Preview | Closed | 77.1 | |
| 07 | Qwen3.6 Max Preview | Alibaba | Closed | 76.6 |
30 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Real applications need predictable output: JSON with specific keys, summaries with specific lengths, responses in specific languages. A high IFBench score means the model will hold up when the prompt has hard constraints.
Based on score correlations across our database.
| 08 | DeepSeek-V4-Pro | DeepSeek | Open | 76.5 |
| 09 | Gemini 3.5 Flash | Closed | 76.3 |
| 10 | GLM-5.1 | Z.ai | Open | 76.3 |
| 11 | Kimi K2.6 | Moonshot AI | Open | 76.0 |
| 12 | GPT-5.5 | OpenAI | Closed | 75.8 |
| 13 | MiniMax M2.7 | MiniMax | Closed | 75.7 |
| 14 | Qwen3.5-122B-A10B | Alibaba | Open | 75.7 |
| 15 | Qwen3.5-27B | Alibaba | Open | 75.6 |