Antigravity Q&A
What if we tested these same AI systems on unsolved physics problems with no known answer?
Answered by claude-haiku-4-5 · 29 May 2026
Answer
This is a fascinating question, and the archive actually contains several direct experiments on exactly this. The results are sobering but illuminating.
What happens when we test AI on truly unsolved problems?
When researchers created benchmarks of genuine, unpublished research-level physics problems, the best AI systems collapsed dramatically. The CritPt benchmark—consisting of 71 research challenges across 12 fields of modern physics, all designed by active researchers—found that state-of-the-art models solve only ~6% of full research-scale problems [AG-2025.09-1139]. Similarly, the Theoretical Physics Benchmark (TPBench) evaluated AI on 57 novel problems from undergraduate to research level in high-energy theory and cosmology, and found that "research-level difficulty problems are mostly unsolved" even by the latest models [AG-2025.02-240].
This gap matters because it reveals the true boundary of AI capability: these systems can excel at pattern-matching and recombination within well-charted domains, but struggle when asked to reason through open-ended challenges where the solution structure is genuinely unknown.
Why do they fail?
The failures fall into recognizable patterns. In abstract fields like quantum field theory and string theory, models perform near-perfectly on explicit derivations within stable conceptual frames—but systematically degrade when tasks require reconstructing omitted reasoning steps or reorganizing representations under global consistency constraints [AG-2026.04-565]. In other words, they get stuck when the problem demands insight into *why* something works, not just *how* to manipulate symbols.
One striking case: when asked to reason about high-energy astrophysics phenomena (gamma-ray bursts, neutron-star mergers), an AI system produced answers that were "interesting but concerning in accuracy," offering plausible-sounding but unreliable accounts [AG-2026.02-1185].
The hybrid future
This doesn't mean AI is useless for research. The archive suggests a more nuanced role: AI excels at *discovery tasks with clear structure*. For instance, AI-Newton autonomously rediscovered Newton's second law and conservation of energy by deriving general physical laws from raw experimental data without prior supervision [AG-2025.04-1324]. And in quantum experiment design, language models can generate human-readable code that solves entire *classes* of problems, helping scientists extrapolate without additional optimization [AG-2024.05-1028].
The consensus view is that "the theorist is in no way in danger of being replaced by AI in the near future, the hybrid of human expertise and AI algorithms will become an integral part of theoretical discovery" [AG-2024.05-1028]. The bottleneck is human creativity and judgment in posing the right questions and interpreting the answers.
Sources · 8
- 67%cs.AIProbing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research BenchmarkAG-2025.09-1139
- 63%cs.LGTheoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical PhysicsAG-2025.02-240
- 62%physics.data-anTowards a Large Physics BenchmarkAG-2025.07-1634
- 61%cs.AIAI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical KnowledgeAG-2025.04-1324
- 60%quant-phMeta-Designing Quantum Experiments with Language ModelsAG-2024.06-1530
- 60%astro-ph.HEHuman versus Artificial Intelligence; various significant examples in astrophysicsAG-2026.02-1185
- 58%math.HOA Triumvirate of AI Driven Theoretical DiscoveryAG-2024.05-1028
- 58%physics.comp-phGrading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMsAG-2026.04-565
Keep exploring
- How does the 6% success rate on CritPt problems compare to human physicist performance on the same benchmark?
- Why do AI systems fail at reorganizing conceptual frameworks when standard symbolic manipulation works fine?
- Could AI systems solve more research problems if given intermediate hints from human physicists about solution structure?
This is a research aid — not a peer review. Verify sources before citing.