1. Cells are color-coded based on official medal thresholds. Models are ranked by Gold ↓, then Silver ↓, then Bronze ↓, with ties broken by IPhO 2025 score ↓.
2. Medal cutoffs are derived from the theoretical exam scores of human medalists.
3. Only the theoretical components of each exam are evaluated; experimental and diagram-drawing problems are excluded, so Full Mark (Model) ≤ Full Mark (Human).
4. Each model was run 8 times. Problem scores were averaged and summed to compute the final exam score.
Metric: accuracy (%). Tip: click any column header to sort.
① Hybrid Acc / Rule Acc = rule+model / rule-based accuracies
② Subjects = Mech. (Mechanics), Electromag. (Electromagnetism), Thermo. (Thermodynamics), Optics, Mod. (Modern Physics).
③ Difficulty Levels = 1: High School & below, 2: Olympiad-level, 3: Undergraduate (non-physics), 4: Undergraduate/Graduate (physics).
Legend:Closed-source reasoning modelOpen-source reasoning modelClosed-source chat modelOpen-source chat model
1. Models are sorted by Hybrid Acc ↓ by default.
2. You can filter by domain (Accuracy / Subject / Difficulty Level / Language) and select specific sub-categories.
3. Values are rounded to 1 decimal place, consistent with the PHYSICS dataset paper.
What is PhyArena?
PhyArena is an open leaderboard and evaluation suite to measure physics reasoning in LLMs and MLLMs.
It features recent Physics Olympiad exams and curated problem sets, offering a comprehensive evaluation.
We provide two complementary benchmarks. Click the links below to explore datasets, projects, and papers.
HiPhO — Physics Olympiad (multimodal)
HiPhO compiles 13 latest physics Olympiads from 2024–2025 (IPhO, APhO, EuPhO, NBPhO, PanPhO, PanMechanics, F=MA). It spans both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based.
PHYSICS spans high-school, Olympiad, undergraduate, and graduate levels, offering text-only problems that test scaling and generalization beyond Olympiad exams. It is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control.