Could these benchmarks predict how LLMs perform on entirely new physics domains?

Question

Antigravity · Accepted Answer

Do Physics Benchmarks Predict LLM Performance on New Domains? The honest answer: probably not very well yet , and the papers you've provided highlight exactly why. What the benchmarks show The existing physics benchmarks—TPBench, CritPt, and the Large Physics Benchmark—measure LLM performance on carefully curated research level problems [AG 2025.02 240, AG 2025.09 1139, AG 2025.07 1634]. They reveal something important: even state of the art models like o1 solve most research difficulty problems poorly, despite excelling at undergraduate material [AG 2025.02 240]. But these are snapshots of performance on specific problem sets, not predictors of future capability on unseen domains. The generalization problem Test time scaling methods (like best of N sampling or chain of thought verification) transfer somewhat between math competition benchmarks (AIME) and physics problems, but the transfer is imperfect [AG 2025.06 1078]. More telling: when researchers fine tuned models on Quantum Field Theory and then tested generalization to other physics domains, the gains didn't transfer cleanly [AG 2026.04 892]. A model trained to solve QFT problems still struggles with condensed matter or astrophysics reasoning at the research level. Why prediction fails The deeper issue is that "physics reasoning" isn't monolithic. Tacit reasoning—the implicit conceptual moves physicists make without writing them down—shows systematic failure modes that vary by subfield [AG 2026.04 565]. A benchmark in high energy physics won't tell you how an LLM handles the different kind of approximation reasoning needed in statistical mechanics or fluid dynamics [AG 2025.09 1139]. What might help The path forward involves "living benchmarks" where physicists continuously contribute new problems [AG 2025.07 1634], and development of physics specific foundation models rather than relying on general LLMs [AG 2025.01 1147]. Only through iterative, domain rich evaluation can we understand whether an LLM trained on one frontier truly transfers to another. In sum: current benchmarks are excellent diagnostic tools—they tell us where models fail—but they don't yet predict performance on truly novel physics domains. That requires more granular understanding of what kinds of reasoning each domain demands.