Why is reasoning accuracy lower for physics than for pure mathematics in language models?

Question

Antigravity · Accepted Answer

Your question touches on something real that researchers are actively investigating — but the archive doesn't directly address the root causes of this disparity. Let me explain what we do know.

Several benchmarks confirm the empirical observation: AI models solve pure mathematics problems at higher rates than physics problems of comparable difficulty. On AIME (high school mathematics), state-of-the-art models achieve reasonable performance. But on research-level physics problems, the gap is stark. The CritPt benchmark found that even the best current models solve only ~6% of full research-scale physics challenges [AG-2025.09-1139]. Similarly, on TPBench (theoretical physics at undergraduate to research level), most research-level problems remain unsolved [AG-2025.02-240].

The archive identifies *one key structural difference*: physics problems require reconstruction of tacit reasoning steps in ways that pure mathematics often does not. When evaluating models on quantum field theory and string theory, researchers observe "near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. In physics, especially theory, experts skip steps because they assume shared background — Feynman diagrams, dimensional analysis, symmetry principles — that aren't always spelled out. Mathematics problems, by contrast, often state assumptions more explicitly.

Additionally, physics reasoning benefits less from pure symbolic manipulation than mathematics does. The novel symbolic verification framework that helps on math benchmarks like AIME shows mixed benefits on physics problems, suggesting the two domains engage different reasoning modes [AG-2025.06-1078].

However, the archive doesn't provide a comprehensive comparison of the cognitive or architectural factors that would fully explain *why* this disparity exists — that would require a direct empirical study we don't have here.