Benchmarks · 2024

Global MMLU Lite: Global MMLU Lite

Name: Global MMLU Lite: Global MMLU Lite
Creator: Hugging Face
Published: 2024
Keywords: Global MMLU Lite, AI benchmark, text model evaluation, Hugging Face

Lighter, multilingual variant of MMLU covering 14 languages and the original subject mix.

Open Dataset

Models Tested

Top Score

—

Published

2024

Source

Hugging Face

How It Works

Global MMLU Lite tests broad academic knowledge across languages, not just English. Questions are translated from the original MMLU set into 14 languages, with cultural adjustments where literal translations would not work. It is the cleanest signal for how well a model holds up outside English.

Multiple-choice questions are presented in each target language. Models are evaluated zero-shot, and the score is percent correct, averaged across languages or reported per-language.

Dataset size

4,400 multiple-choice questions across 14 languages and the original MMLU subject mix.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Direct measure of multilingual reasoning, not just translation.
Wide language coverage, including lower-resource languages.
Same subject mix as the original MMLU, so the score is interpretable.

Where It Falls Short

Multiple choice masks reasoning quality.
Translation quality varies across languages.
Some languages have small per-language sample sizes.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2024

Global MMLU Lite: Global MMLU Lite

Lighter, multilingual variant of MMLU covering 14 languages and the original subject mix.

Open Dataset

Models Tested

Top Score

—

Published

2024

Source

Hugging Face

How It Works

Multiple-choice questions are presented in each target language. Models are evaluated zero-shot, and the score is percent correct, averaged across languages or reported per-language.

Dataset size

4,400 multiple-choice questions across 14 languages and the original MMLU subject mix.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Direct measure of multilingual reasoning, not just translation.
Wide language coverage, including lower-resource languages.
Same subject mix as the original MMLU, so the score is interpretable.

Where It Falls Short

Multiple choice masks reasoning quality.
Translation quality varies across languages.
Some languages have small per-language sample sizes.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Global MMLU Lite: Global MMLU Lite

How It Works

Top Scorers

Score Distribution

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

GPQA

MMLU-PRO

GSM8K

SWE-Verified

The AI Build Report

Global MMLU Lite: Global MMLU Lite

How It Works

Top Scorers

Score Distribution

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

GPQA

MMLU-PRO

GSM8K

SWE-Verified

The AI Build Report