Benchmarks · 2023

IFBench: Instruction Following Benchmark

Name: IFBench: Instruction Following Benchmark
Creator: Google Research
Published: 2023
Keywords: IFBench, AI benchmark, text model evaluation, Google Research

Tests whether a model obeys precise formatting, length, and constraint rules.

Open Dataset Read Paper

Models Tested

130

Top Score

82.9

Published

2023

Source

Google Research

How It Works

IFBench measures whether a model actually follows precise instructions. Each prompt has verifiable rules: respond in exactly four sentences, include the word "elephant" twice, output JSON with these exact keys. The benchmark cares about obedience, not creativity, which makes it a strong signal for production reliability.

Each prompt has one or more programmatic verifiers. The model’s output is checked rule-by-rule. The score is the fraction of rules satisfied, averaged across prompts.

Dataset size

500+ prompts with verifiable instructions like word counts, JSON formatting, language constraints, and section structure.

Mean score

56.3

Median score

55.3

Open / Closed

57 / 73

Top Scorers

#	Model	Lab	Source	Score
01	Grok 4.20 Beta 0309 Reasoning	xAI	Closed	82.9
02	MiniMax M3	MiniMax	Open	82.9
03	Grok 4.3	xAI	Closed	81.3
04	Grok 4.3 beta	xAI	Closed	81.3
05	Qwen 3.7 Max	Alibaba	Closed	80.5
06	MiMo-V2.5-Pro	Xiaomi	Closed	79.9
07	DeepSeek-V4-Flash	DeepSeek	Open	79.2
08	Qwen3.5-397B-A17B	Alibaba	Open	78.8
09	Gemini 3 Flash (Thinking Minimal)	Google	Closed	78.0
10	Gemini 3.1 Flash Lite Preview	Google	Closed	77.2
11	Gemini 3.1 Pro Preview	Google	Closed	77.1
12	Qwen3.6 Max Preview	Alibaba	Closed	76.6
13	DeepSeek-V4-Pro	DeepSeek	Open	76.5
14	Gemini 3.5 Flash	Google	Closed	76.3
15	GLM-5.1	Z.ai	Open	76.3

Score Distribution

Open vs Closed Source

Gap on IFBench:+0.1pts closed leads

Top Open-Source Models

1MiniMax M382.9
2DeepSeek-V4-Flash79.2
3Qwen3.5-397B-A17B78.8

Top Closed-Source Models

1Grok 4.20 Beta 0309 Reasoning82.9
2Grok 4.381.3
3Grok 4.3 beta81.3

Score vs Parameter Count

71 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

MiniMax
76.7n = 3
Xiaomi
71.9n = 3
NVIDIA
67.3n = 2
OpenAI
65.3n = 21
Z.ai
61.8n = 6
Alibaba
61.7n = 15
Moonshot AI
60.1n = 6
xAI
59.2n = 9
Google
53.2n = 20
DeepSeek
52.8n = 6

Most Correlated Benchmarks

SWE-Pro
+0.94n = 14
LiveCodeBench
+0.84n = 75
HLE
+0.79n = 130
GPQA
+0.78n = 130
AA Intelligence Index
+0.76n = 130
AA LCR
+0.76n = 130
Terminal Bench Hard
+0.71n = 125
Terminal Bench
+0.70n = 17
SciCode
+0.67n = 130
MMLU-PRO
+0.64n = 86
MATH-500
+0.64n = 41
Arena Score
+0.54n = 105
HMMT 2026
+0.26n = 9
SWE-Verified
+0.24n = 19
AIME 2026
-0.21n = 10
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Programmatic scoring means no judge model is needed.
Strong predictor of production usefulness for structured output.
Catches models that are clever but disobedient.

Where It Falls Short

Does not measure response quality, only rule adherence.
A model can pass every rule and still give a bad answer.
Some rules are subjective (tone, formality).

Frequently Asked Questions

Why does IFBench matter for production?

Real applications need predictable output: JSON with specific keys, summaries with specific lengths, responses in specific languages. A high IFBench score means the model will hold up when the prompt has hard constraints.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.94

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2023

IFBench: Instruction Following Benchmark

Tests whether a model obeys precise formatting, length, and constraint rules.

Open Dataset Read Paper

Models Tested

130

Top Score

82.9

Published

2023

Source

Google Research

How It Works

Each prompt has one or more programmatic verifiers. The model’s output is checked rule-by-rule. The score is the fraction of rules satisfied, averaged across prompts.

Dataset size

500+ prompts with verifiable instructions like word counts, JSON formatting, language constraints, and section structure.

Mean score

56.3

Median score

55.3

Open / Closed

57 / 73

Top Scorers

#	Model	Lab	Source	Score
01	Grok 4.20 Beta 0309 Reasoning	xAI	Closed	82.9
02	MiniMax M3	MiniMax	Open	82.9
03	Grok 4.3	xAI	Closed	81.3
04	Grok 4.3 beta	xAI	Closed	81.3
05	Qwen 3.7 Max	Alibaba	Closed	80.5
06	MiMo-V2.5-Pro	Xiaomi	Closed	79.9
07	DeepSeek-V4-Flash	DeepSeek	Open	79.2
08	Qwen3.5-397B-A17B	Alibaba	Open	78.8
09	Gemini 3 Flash (Thinking Minimal)	Google	Closed	78.0
10	Gemini 3.1 Flash Lite Preview	Google	Closed	77.2
11	Gemini 3.1 Pro Preview	Google	Closed	77.1
12	Qwen3.6 Max Preview	Alibaba	Closed	76.6
13	DeepSeek-V4-Pro	DeepSeek	Open	76.5
14	Gemini 3.5 Flash	Google	Closed	76.3
15	GLM-5.1	Z.ai	Open	76.3

Score Distribution

Open vs Closed Source

Gap on IFBench:+0.1pts closed leads

Top Open-Source Models

1MiniMax M382.9
2DeepSeek-V4-Flash79.2
3Qwen3.5-397B-A17B78.8

Top Closed-Source Models

1Grok 4.20 Beta 0309 Reasoning82.9
2Grok 4.381.3
3Grok 4.3 beta81.3

Score vs Parameter Count

71 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

MiniMax
76.7n = 3
Xiaomi
71.9n = 3
NVIDIA
67.3n = 2
OpenAI
65.3n = 21
Z.ai
61.8n = 6
Alibaba
61.7n = 15
Moonshot AI
60.1n = 6
xAI
59.2n = 9
Google
53.2n = 20
DeepSeek
52.8n = 6

Most Correlated Benchmarks

SWE-Pro
+0.94n = 14
LiveCodeBench
+0.84n = 75
HLE
+0.79n = 130
GPQA
+0.78n = 130
AA Intelligence Index
+0.76n = 130
AA LCR
+0.76n = 130
Terminal Bench Hard
+0.71n = 125
Terminal Bench
+0.70n = 17
SciCode
+0.67n = 130
MMLU-PRO
+0.64n = 86
MATH-500
+0.64n = 41
Arena Score
+0.54n = 105
HMMT 2026
+0.26n = 9
SWE-Verified
+0.24n = 19
AIME 2026
-0.21n = 10
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Programmatic scoring means no judge model is needed.
Strong predictor of production usefulness for structured output.
Catches models that are clever but disobedient.

Where It Falls Short

Does not measure response quality, only rule adherence.
A model can pass every rule and still give a bad answer.
Some rules are subjective (tone, formality).

Frequently Asked Questions

Why does IFBench matter for production?

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.94

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

IFBench: Instruction Following Benchmark

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

SWE-Pro

LiveCodeBench

HLE

GPQA

The AI Build Report

IFBench: Instruction Following Benchmark

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

SWE-Pro

LiveCodeBench

HLE

GPQA

The AI Build Report