logo PhyArena

PhyArena

Benchmarking the physics reasoning of LLMs & MLLMs — from Olympiads to problem sets
🏆
First Physics-Olympiad leaderboard — 13 physics Olympiads from 2024–2025 with official medal cutoffs.
🤖
Direct human-model comparison — HiPhO benchmarks (M)LLM exam scores against human contestants.
🎓
Rich difficulty spectrum — PHYSICS spans high school, Olympiad, undergraduate, and graduate levels.

HiPhO

Physics Olympiad

PHYSICS

Mixed Difficulty
HiPhO
PHYSICS

HiPhO (High School Physics Olympiad Benchmark)





Metric: exam score. Tip: click any column header to sort.
Legend: Closed-source MLLM Open-source MLLM Open-source LLM
1. Cells are color-coded based on official medal thresholds. Models are ranked by Gold ↓, then Silver ↓, then Bronze ↓, with ties broken by IPhO 2025 score ↓.
2. Medal cutoffs are derived from the theoretical exam scores of human medalists.
3. Only the theoretical components of each exam are evaluated; experimental and diagram-drawing problems are excluded, so Full Mark (Model) ≤ Full Mark (Human).
4. Each model was run 8 times. Problem scores were averaged and summed to compute the final exam score.

What is PhyArena?

PhyArena is an open leaderboard and evaluation suite to measure physics reasoning in LLMs and MLLMs. It features recent Physics Olympiad exams and curated problem sets, offering a comprehensive evaluation.

We provide two complementary benchmarks. Click the links below to explore datasets, projects, and papers.

HiPhO — Physics Olympiad (multimodal)

HiPhO compiles 13 latest physics Olympiads from 2024–2025 (IPhO, APhO, EuPhO, NBPhO, PanPhO, PanMechanics, F=MA). It spans both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based.

Hugging Face

dataset

GitHub

project

Paper

pdf

PHYSICS — Mixed Difficulty (text-only)

PHYSICS spans high-school, Olympiad, undergraduate, and graduate levels, offering text-only problems that test scaling and generalization beyond Olympiad exams. It is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control.

Hugging Face

dataset

GitHub

project

Paper

pdf