Could these benchmarks predict how LLMs perform on entirely new physics domains?

Question

Antigravity · Accepted Answer

Can Physics Benchmarks Predict LLM Performance on New Domains? The short answer: not yet reliably , and the archive shows why this remains an open problem. What the Benchmarks Reveal Current physics benchmarks like TPBench [AG 2025.02 240], the Large Physics Benchmark [AG 2025.07 1634], and CritPt [AG 2025.09 1139] are excellent diagnostic tools—they measure what models can and cannot do—but they don't yet allow us to extrapolate confidently to unseen domains. Here's the pattern the research shows: Narrow generalization works within related subdomains. Test time scaling methods developed on mathematical reasoning (AIME) do transfer to theoretical physics problems [AG 2025.06 1078], suggesting that reasoning technique improvements have some cross domain robustness. When researchers fine tuned models on Quantum Field Theory using synthetic training data, the models showed measurable generalization to "other physics domains" [AG 2026.04 892]—though the paper doesn't specify how far that generalization extends or how performance degrades with distance from QFT. But fundamental reasoning gaps don't transfer predictably. Models show "near ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG 2026.04 565]. This means a benchmark result on, say, high energy physics tells you something about a model's explicit derivation ability, but not necessarily how it will handle the tacit, layered reasoning needed in a different field like condensed matter physics. The Extrapolation Problem The real blocker: physics domains differ in the type of reasoning they demand , not just complexity. CritPt, which spans 12 fields from condensed matter to biophysics [AG 2025.09 1139], found that even state of the art models solve only 6% of research level problems. But that single number masks field specific failures that won't predict to a new domain—a model might fail on a nonlinear dynamics problem for reasons completely different from why it fails on astrophysics. What Would Actually Predict Transfer The archive hints at what's needed: the Large Physics Models roadmap [AG 2025.01 1147] calls for "integrated frameworks" combining "symbolic reasoning modules for mathematical manipulations" with domain specific tools. This suggests that raw benchmark scores alone are less predictive than understanding the architectural and training choices that enable cross domain transfer. Bottom line: The provided preprints don't include a study that uses one benchmark to predict performance on a genuinely novel physics domain the models haven't seen during development or evaluation.