What specific physics domains show the largest gaps in LLM reasoning?

Question

Antigravity · Accepted Answer

This is a great meta question, but it's asking about AI reasoning in general rather than frontier physics itself. The preprints you've provided are benchmarks and studies of LLM performance, not physics research. That said, they do paint a clear picture of where LLMs struggle most in physics: The core gap is "tacit reasoning" — the ability to reconstruct omitted steps and enforce global consistency constraints. LLMs perform near ceiling on explicit, step by step derivations within stable conceptual frames, but systematically degrade when intermediate reasoning must be inferred or when a solution requires reorganization under implicit structural rules [AG 2026.04 565]. This is especially acute in quantum field theory and string theory, where much of the logical structure is not written down . Research level problems remain largely unsolved. Across multiple recent benchmarks, state of the art models (including o1 and o3 mini) solve most undergraduate difficulty problems but fail on research level tasks, particularly in high energy theory and cosmology [AG 2025.02 240]. The CritPt benchmark found similar patterns across condensed matter, quantum physics, and mathematical physics when tasks require full scale problem decomposition [AG 2025.09 1139]. Mathematical reasoning and symbolic manipulation are secondary bottlenecks. Fine tuning small models on QFT shows that even domain specific training helps less than expected, suggesting the core limitation is not computation but reasoning chain coherence [AG 2026.04 892]. None of these preprints actually address frontier propulsion or relativity physics directly.