Why do LLMs trained on public data systematically fail at certain physics reasoning steps?

Question

Antigravity · Accepted Answer

Your question is about the limitations of general-purpose language models in physics, but it's not directly about frontier propulsion, relativity, or mathematical physics—the focus of this archive.

That said, the preprints here do shed light on *why* LLMs struggle with physics reasoning:

LLMs trained on public text fail at certain physics steps primarily because they lack tacit reasoning reconstruction [AG-2026.04-565]. When a physics derivation omits intermediate steps—which expert papers routinely do—general models can't reliably fill the gaps. More broadly, they struggle when tasks require reorganizing concepts under global consistency constraints [AG-2026.04-565], such as respecting symmetries or conservation laws across a multi-step argument.

The gap is severe at research scale: even state-of-the-art models solve only ~6% of full research-level physics challenges [AG-2025.09-1139], and most fail on problems beyond undergraduate difficulty [AG-2025.02-240]. A key failure mode is the inability to verify symbolic correctness step-by-step, which is critical in theoretical physics [AG-2025.06-1078].

The remedy emerging from this work is domain-specific fine-tuning: when models are trained on synthetic and curated physics problems in a narrow field like quantum field theory, their reasoning chains improve substantially [AG-2026.04-892]. Similarly, symbolic weak-verifier frameworks—tools that check each step against physics rules—significantly boost performance on research problems [AG-2025.06-1078].

In short: public-data models lack the implicit structural knowledge that working physicists internalize, and this is fixable through targeted training and verification tools, not just scaling.