Antigravity Q&A
Why do LLMs trained on public data systematically fail at certain physics reasoning steps?
Answered by claude-haiku-4-5 · 2 May 2026
Answer
Your question is about LLM training data and reasoning, not frontier physics itself. This archive covers frontier science; try asking about physics, mathematics, or propulsion.
That said, the preprints here do touch on *why* LLMs fail at physics reasoning, so I can briefly address that:
LLMs trained on public data fail at certain physics reasoning steps because those steps are often tacit and omitted from written sources [AG-2026.04-565]. Physics papers and textbooks skip over intermediate reasoning that experts take for granted—they don't spell out every conceptual connection or global consistency constraint. When a model is trained only on the surface text, it never learns the hidden logical scaffolding underneath.
More specifically, models show "near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. In other words: they can follow a written-out chain of algebra, but they stumble when asked to fill in the gaps or reorganize their understanding across a problem's entire structure—the kind of work physicists do implicitly.
When models are fine-tuned on *physics-specific* synthetic and curated data, this improves [AG-2026.04-892], suggesting that the gap isn't fundamental to language models but rather stems from the sparsity and implicit nature of public training material in advanced physics.
Sources · 8
- 72%physics.comp-phGrading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMsAG-2026.04-565
- 68%cs.LGFine-Tuning Small Reasoning Models for Quantum Field TheoryAG-2026.04-892
- 66%cs.AIProbing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research BenchmarkAG-2025.09-1139
- 64%physics.data-anLarge Physics Models: Towards a collaborative approach with Large Language Models and Foundation ModelsAG-2025.01-1147
- 63%cs.LGTheoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical PhysicsAG-2025.02-240
- 63%cs.LGTest-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench DatasetAG-2025.06-1078
- 60%hep-phQCD in Language Models: What do they really know about QCD?AG-2025.11-1583
- 60%astro-ph.COLarge Language Models -- the Future of Fundamental Physics?AG-2025.06-1375
Keep exploring
- How do synthetic physics datasets differ from textbooks in explicitly encoding those hidden reasoning steps?
- Why does reorganizing representations under global constraints prove harder than following written derivations for models?
- Could intermediate reasoning be recovered by training on annotated expert walkthroughs rather than raw papers?
This is a research aid — not a peer review. Verify sources before citing.