Antigravity Q&A
Why is reasoning accuracy lower for physics than for pure mathematics in language models?
Answered by claude-haiku-4-5 · 3 May 2026
Answer
Your question touches on something real that researchers are actively investigating — but the archive doesn't directly address the root causes of this disparity. Let me explain what we do know.
Several benchmarks confirm the empirical observation: AI models solve pure mathematics problems at higher rates than physics problems of comparable difficulty. On AIME (high school mathematics), state-of-the-art models achieve reasonable performance. But on research-level physics problems, the gap is stark. The CritPt benchmark found that even the best current models solve only ~6% of full research-scale physics challenges [AG-2025.09-1139]. Similarly, on TPBench (theoretical physics at undergraduate to research level), most research-level problems remain unsolved [AG-2025.02-240].
The archive identifies *one key structural difference*: physics problems require reconstruction of tacit reasoning steps in ways that pure mathematics often does not. When evaluating models on quantum field theory and string theory, researchers observe "near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. In physics, especially theory, experts skip steps because they assume shared background — Feynman diagrams, dimensional analysis, symmetry principles — that aren't always spelled out. Mathematics problems, by contrast, often state assumptions more explicitly.
Additionally, physics reasoning benefits less from pure symbolic manipulation than mathematics does. The novel symbolic verification framework that helps on math benchmarks like AIME shows mixed benefits on physics problems, suggesting the two domains engage different reasoning modes [AG-2025.06-1078].
However, the archive doesn't provide a comprehensive comparison of the cognitive or architectural factors that would fully explain *why* this disparity exists — that would require a direct empirical study we don't have here.
Sources · 8
- 68%physics.comp-phGrading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMsAG-2026.04-565
- 64%cs.AIProbing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research BenchmarkAG-2025.09-1139
- 63%cs.LGFine-Tuning Small Reasoning Models for Quantum Field TheoryAG-2026.04-892
- 60%cs.LGTest-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench DatasetAG-2025.06-1078
- 60%cs.LGTheoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical PhysicsAG-2025.02-240
- 58%physics.data-anTowards a Large Physics BenchmarkAG-2025.07-1634
- 58%physics.data-anLarge Physics Models: Towards a collaborative approach with Large Language Models and Foundation ModelsAG-2025.01-1147
- 57%cs.LGRobust Reasoning as a Symmetry-Protected Topological PhaseAG-2026.01-735
Keep exploring
- How much does explicit step-by-step formatting in problem statements improve physics accuracy?
- Do physics problems require fundamentally different training data than mathematics to close this gap?
- What if models trained on physics derivations with all steps spelled out performed similarly to mathematics?
This is a research aid — not a peer review. Verify sources before citing.