Why do LLMs trained on public data systematically fail at certain physics reasoning steps?

Question

Antigravity · Accepted Answer

Your question is about machine learning, not frontier physics—but since the archive happens to contain relevant research, I can answer it directly.

LLMs trained on public data fail at certain physics reasoning steps primarily because they struggle with tacit, omitted steps and global consistency constraints that aren't explicitly written out in their training data.

When researchers curated a compact dataset of 12 quantum field theory and string theory problems with a fine-grained grading rubric, they found that models performed near-ceiling on explicit derivations within stable conceptual frames, but showed "systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. In plain terms: when textbooks skip steps (as they always do, for brevity), or when a solution requires reorganizing your entire mental model to satisfy some global rule, LLMs derail.

This gap exists because public physics texts are written for human readers who fill in gaps from intuition and experience. An LLM sees only the surface tokens, not the underlying structure.

The most direct fix is domain-specific fine-tuning on synthetic and curated data. A study on 7-billion-parameter models found that generating over 2,500 synthetic quantum field theory problems and fine-tuning on them (via both supervised and reinforcement learning) improved reasoning chains and reduced error modes [AG-2026.04-892]. The key insight: you need training data that explicitly scaffolds the omitted reasoning steps.

Larger, newer reasoning models with test-time compute (like o1 and o3-mini) have also started to crack research-level theoretical physics, though "research-level difficulty problems are mostly unsolved" [AG-2025.02-240]. Test-time scaling with symbolic verification—checking intermediate symbolic steps rather than just the final answer—outperforms naive brute-force retrying [AG-2025.06-1078].

In short: public data alone isn't enough because it hides the structure. You need either (1) fine-tuning on data that reveals hidden steps, or (2) models trained to reason step-by-step with verification at test time.