Antigravity Q&A
What specific physics domains show the largest gaps in LLM reasoning?
Answered by claude-haiku-4-5 · 29 May 2026
Answer
The preprints you've provided reveal striking gaps in LLM reasoning across theoretical physics domains, though they address AI capabilities rather than gaps *within* physics itself.
The clearest pattern emerges from research-level problem solving. Current LLMs solve only about 6% of full research-scale physics challenges across 12 fields [AG-2025.09-1139]. When broken down by problem type, the failures cluster in specific reasoning patterns:
Tacit and structural reasoning fails hardest. In quantum field theory and string theory specifically, LLMs show "near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. Translation: models excel at step-by-step textbook problems but falter when they must infer what a physicist left unsaid or hold multiple conceptual reorganizations in mind simultaneously.
Research-level problems in high-energy theory resist current approaches. Evaluations on the Theoretical Physics Benchmark (TPBench) found that "research-level difficulty problems are mostly unsolved" even by state-of-the-art models like o1 and o3-mini, though these models perform impressively on undergraduate and early-graduate material [AG-2025.02-240].
Mathematical derivation and complex problem-solving are bottlenecks across all domains. The benchmark data distinguishes three problem categories—conceptual multiple-choice (where models do better), analytical derivations, and open-ended tasks—with performance degrading sharply in the latter two [AG-2025.07-1634].
Importantly, test-time scaling techniques (like having models verify their own steps symbolically) can improve performance substantially [AG-2025.06-1078], suggesting the issue is not fundamental incapacity but rather brittleness in extended reasoning chains.
Sources · 8
- 69%physics.comp-phGrading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMsAG-2026.04-565
- 65%cs.AIProbing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research BenchmarkAG-2025.09-1139
- 64%cs.LGFine-Tuning Small Reasoning Models for Quantum Field TheoryAG-2026.04-892
- 63%cs.LGTheoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical PhysicsAG-2025.02-240
- 62%physics.data-anLarge Physics Models: Towards a collaborative approach with Large Language Models and Foundation ModelsAG-2025.01-1147
- 59%physics.data-anTowards a Large Physics BenchmarkAG-2025.07-1634
- 59%cs.LGTest-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench DatasetAG-2025.06-1078
- 58%physics.soc-phWhy we care (about quantum machine learning)AG-2024.01-1602
Keep exploring
- How much does performance improve when LLMs can use symbolic verification tools versus reasoning purely through language?
- Why do models handle explicit textbook derivations well but fail at reconstructing omitted steps in QFT specifically?
- What problem characteristics distinguish the 6% that LLMs solve from the 94% they fail across physics domains?
This is a research aid — not a peer review. Verify sources before citing.