Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

Zhiqi Gao; Tianyi Li; Yurii Kvasiuk; Sai Chaitanya Tadepalli; Maja Rudolph; Daniel J. H. Chung; Frederic Sala; Moritz Münchmeyer

doi:10.48550/arXiv.2506.20729

← Recent

AG-2025.06-1078·cs.LG·cross-listed: astro-ph.COcs.AIhep-phhep-th

Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

Authors

Zhiqi Gao
Tianyi Li
Yurii Kvasiuk
Sai Chaitanya Tadepalli
Maja Rudolph
Daniel J. H. Chung
Frederic Sala
Moritz Münchmeyer

Abstract

Large language models (LLMs) have shown strong capabilities in complex reasoning, and test-time scaling techniques can enhance their performance with comparably low cost. Many of these methods have been developed and evaluated on mathematical reasoning benchmarks such as AIME. This paper investigates whether the lessons learned from these benchmarks generalize to the domain of advanced theoretical physics. We evaluate a range of common test-time scaling methods on the TPBench physics dataset and compare their effectiveness with results on AIME. To better leverage the structure of physics problems, we develop a novel, symbolic weak-verifier framework to improve parallel scaling results. Our empirical results demonstrate that this method significantly outperforms existing test-time scaling approaches on TPBench. We also evaluate our method on AIME, confirming its effectiveness in solving advanced mathematical problems. Our findings highlight the power of step-wise symbolic verification for tackling complex scientific problems.

Submitted

25 June 20251 year ago

Version

v1

License

CC-BY-4.0

DOI

10.48550/arXiv.2506.20729

Cite this preprint

BibTeX RIS

Imports into BibLaTeX, Zotero, Mendeley, EndNote.

PDF

Open PDF

Opens in a new tab · v1.

Chat with this PDF

Ask questions, probe assumptions, request a plain-English summary. Answers cite sections from the preprint itself.

Community

Questions and answers about this paper from other readers. No formal peer review — just a place to think out loud.