Antigravity Q&A
Could these benchmarks predict how LLMs perform on entirely new physics domains?
Answered by claude-haiku-4-5 · 28 April 2026
Answer
# Do Physics Benchmarks Predict LLM Performance on New Domains?
The honest answer: probably not very well yet, and the papers you've provided highlight exactly why.
What the benchmarks show
The existing physics benchmarks—TPBench, CritPt, and the Large Physics Benchmark—measure LLM performance on carefully curated research-level problems [AG-2025.02-240, AG-2025.09-1139, AG-2025.07-1634]. They reveal something important: even state-of-the-art models like o1 solve most research-difficulty problems poorly, despite excelling at undergraduate material [AG-2025.02-240]. But these are *snapshots* of performance on *specific* problem sets, not predictors of future capability on unseen domains.
The generalization problem
Test-time scaling methods (like best-of-N sampling or chain-of-thought verification) transfer *somewhat* between math competition benchmarks (AIME) and physics problems, but the transfer is imperfect [AG-2025.06-1078]. More telling: when researchers fine-tuned models on Quantum Field Theory and then tested generalization to other physics domains, the gains didn't transfer cleanly [AG-2026.04-892]. A model trained to solve QFT problems still struggles with condensed matter or astrophysics reasoning at the research level.
Why prediction fails
The deeper issue is that "physics reasoning" isn't monolithic. Tacit reasoning—the implicit conceptual moves physicists make without writing them down—shows systematic failure modes that vary by subfield [AG-2026.04-565]. A benchmark in high-energy physics won't tell you how an LLM handles the different kind of approximation-reasoning needed in statistical mechanics or fluid dynamics [AG-2025.09-1139].
What might help
The path forward involves "living benchmarks" where physicists continuously contribute new problems [AG-2025.07-1634], and development of physics-specific foundation models rather than relying on general LLMs [AG-2025.01-1147]. Only through iterative, domain-rich evaluation can we understand whether an LLM trained on one frontier truly transfers to another.
In sum: current benchmarks are excellent diagnostic tools—they tell us where models fail—but they don't yet predict performance on truly novel physics domains. That requires more granular understanding of what kinds of reasoning each domain demands.
Sources · 8
- 69%cs.LGTest-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench DatasetAG-2025.06-1078
- 68%physics.data-anTowards a Large Physics BenchmarkAG-2025.07-1634
- 66%cs.LGTheoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical PhysicsAG-2025.02-240
- 65%physics.comp-phGrading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMsAG-2026.04-565
- 65%cs.AIProbing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research BenchmarkAG-2025.09-1139
- 64%physics.data-anLarge Physics Models: Towards a collaborative approach with Large Language Models and Foundation ModelsAG-2025.01-1147
- 64%astro-ph.COLarge Language Models -- the Future of Fundamental Physics?AG-2025.06-1375
- 64%cs.LGFine-Tuning Small Reasoning Models for Quantum Field TheoryAG-2026.04-892
Keep exploring
- Do tacit reasoning failures in high-energy physics predict similar failures in condensed matter systems?
- How much does fine-tuning on one physics domain actively harm transfer to structurally different fields?
- Could living benchmarks identify which LLM reasoning gaps are universal versus domain-specific?
This is a research aid — not a peer review. Verify sources before citing.