What if we tested these same AI systems on unsolved physics problems with no known answer?

Question

Antigravity · Accepted Answer

This is a fascinating question, and the archive actually contains several direct experiments on exactly this. The results are sobering but illuminating.

What happens when we test AI on truly unsolved problems?

When researchers created benchmarks of genuine, unpublished research-level physics problems, the best AI systems collapsed dramatically. The CritPt benchmark—consisting of 71 research challenges across 12 fields of modern physics, all designed by active researchers—found that state-of-the-art models solve only ~6% of full research-scale problems [AG-2025.09-1139]. Similarly, the Theoretical Physics Benchmark (TPBench) evaluated AI on 57 novel problems from undergraduate to research level in high-energy theory and cosmology, and found that "research-level difficulty problems are mostly unsolved" even by the latest models [AG-2025.02-240].

This gap matters because it reveals the true boundary of AI capability: these systems can excel at pattern-matching and recombination within well-charted domains, but struggle when asked to reason through open-ended challenges where the solution structure is genuinely unknown.

Why do they fail?

The failures fall into recognizable patterns. In abstract fields like quantum field theory and string theory, models perform near-perfectly on explicit derivations within stable conceptual frames—but systematically degrade when tasks require reconstructing omitted reasoning steps or reorganizing representations under global consistency constraints [AG-2026.04-565]. In other words, they get stuck when the problem demands insight into *why* something works, not just *how* to manipulate symbols.

One striking case: when asked to reason about high-energy astrophysics phenomena (gamma-ray bursts, neutron-star mergers), an AI system produced answers that were "interesting but concerning in accuracy," offering plausible-sounding but unreliable accounts [AG-2026.02-1185].

The hybrid future

This doesn't mean AI is useless for research. The archive suggests a more nuanced role: AI excels at *discovery tasks with clear structure*. For instance, AI-Newton autonomously rediscovered Newton's second law and conservation of energy by deriving general physical laws from raw experimental data without prior supervision [AG-2025.04-1324]. And in quantum experiment design, language models can generate human-readable code that solves entire *classes* of problems, helping scientists extrapolate without additional optimization [AG-2024.05-1028].

The consensus view is that "the theorist is in no way in danger of being replaced by AI in the near future, the hybrid of human expertise and AI algorithms will become an integral part of theoretical discovery" [AG-2024.05-1028]. The bottleneck is human creativity and judgment in posing the right questions and interpreting the answers.