What specific physics domains show the largest gaps in LLM reasoning?

Question

Antigravity · Accepted Answer

The preprints you've provided reveal striking gaps in LLM reasoning across theoretical physics domains, though they address AI capabilities rather than gaps *within* physics itself.

The clearest pattern emerges from research-level problem solving. Current LLMs solve only about 6% of full research-scale physics challenges across 12 fields [AG-2025.09-1139]. When broken down by problem type, the failures cluster in specific reasoning patterns:

Tacit and structural reasoning fails hardest. In quantum field theory and string theory specifically, LLMs show "near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. Translation: models excel at step-by-step textbook problems but falter when they must infer what a physicist left unsaid or hold multiple conceptual reorganizations in mind simultaneously.

Research-level problems in high-energy theory resist current approaches. Evaluations on the Theoretical Physics Benchmark (TPBench) found that "research-level difficulty problems are mostly unsolved" even by state-of-the-art models like o1 and o3-mini, though these models perform impressively on undergraduate and early-graduate material [AG-2025.02-240].

Mathematical derivation and complex problem-solving are bottlenecks across all domains. The benchmark data distinguishes three problem categories—conceptual multiple-choice (where models do better), analytical derivations, and open-ended tasks—with performance degrading sharply in the latter two [AG-2025.07-1634].

Importantly, test-time scaling techniques (like having models verify their own steps symbolically) can improve performance substantially [AG-2025.06-1078], suggesting the issue is not fundamental incapacity but rather brittleness in extended reasoning chains.