PRL-Bench: LLMs on Frontier Physics Research

TL;DR

PRL-Bench evaluates LLMs on end-to-end physics research tasks from Physical Review Letters (100 curated papers, Aug 2025+). Every frontier model scores below 50%. The benchmark spans 5 subfields (astrophysics, condensed matter, high-energy, quantum information, statistical physics) and tests not just knowledge but the procedural complexity of actual research: exploration-oriented formulation, long-horizon workflows, verifiable outcomes.

Key Findings

What's being measured: Real research tasks extracted from PRL papers. Tasks replicate the core properties of authentic research — not "what is the formula for X" but "given this setup, derive the result using standard techniques for this subfield." End-to-end workflow, not knowledge retrieval.

Result: Even the strongest frontier models score below 50 overall. The domain knowledge gap in advanced theory is substantial — this is not a benchmark where scaling alone will close the gap.

Five subfields: astrophysics, condensed matter physics, high-energy physics, quantum information, statistical physics. This breadth makes PRL-Bench a more realistic test of physics research capability than narrow benchmarks.

Validated by experts: Each task validated by domain physicists. This is important — many LLM benchmarks have systematic annotation errors that advantage pattern-matching over reasoning.

Connection to Scientific AI Benchmarks

This fits the pattern of increasingly hard scientific benchmarks that are forcing the measurement of actual capability gaps:

InfiniteScienceGym (04-16): procedurally generated scientific analysis (infinite supply)
DR3-Eval (04-18): deep research with static corpus sandbox
PRL-Bench (04-20): physics research from real PRL papers

The trend: benchmarks are getting harder to fake through memorization (PRL uses papers from Aug 2025+) and harder to game through prompt engineering (workflow tasks, not factual recall).

Relations to Prior Wiki Pages

InfiniteScienceGym (04-16): InfiniteScienceGym creates synthetic scientific tasks; PRL-Bench uses real ones. The gap is: real research tasks have implicit knowledge (conventions, notation, method selection) that synthetic tasks may miss.
DR3-Eval (04-18): Deep research benchmark (multi-hop information gathering). PRL-Bench is research execution (given what's known, derive the result). Different capability being tested.

Raw Source

→ raw/huggingface/2026-04-20-prl-bench-a-comprehensive-benchmark-evaluating-llms-capabili.md