AG-2026.02-1215·hep-ex·cross-listed: cs.LGhep-phphysics.data-an
Neural Scaling Laws for Boosted Jet Tagging
Authors
- Matthias Vigl
- Nicole Hartman
- Michael Kagan
- Lukas Heinrich
Abstract
The success of Large Language Models (LLMs) has established that scaling compute, through joint increases in model capacity and dataset size, is the primary driver of performance in modern machine learning. While machine learning has long been an integral component of High Energy Physics (HEP) data analysis workflows, the compute used to train state-of-the-art HEP models remains orders of magnitude below that of industry foundation models. With scaling laws only beginning to be studied in the field, we investigate neural scaling laws for boosted jet classification using the public JetClass dataset. We derive compute optimal scaling laws and identify an effective performance limit that can be consistently approached through increased compute. We study how data repetition, common in HEP where simulation is expensive, modifies the scaling yielding a quantifiable effective dataset size gain. We then study how the scaling coefficients and asymptotic performance limits vary with the choice of input features and particle multiplicity, demonstrating that increased compute reliably drives performance toward an asymptotic limit, and that more expressive, lower-level features can raise the performance limit and improve results at fixed dataset size.
Submitted
17 February 20262 months ago
Version
v1
License
CC-BY-4.0
DOI
10.48550/arXiv.2602.15781
Summary
Researchers show that physics particle detectors can benefit from the same scaling strategy as large language models—throwing more computing power and data at neural networks reliably improves jet classification performance, even with physics-specific constraints like expensive simulations.
- Like GPT models, physics ML systems follow predictable scaling laws where bigger models and more data systematically improve accuracy, revealing an asymptotic performance ceiling that can be approached with sufficient compute.
- Repeating expensive physics simulations (a common cost-saving trick) acts like having a smaller effective dataset, but the paper quantifies exactly how much performance you lose—useful for budget planning in real experiments.
- Raw, low-level particle detector data (like individual particle energies) outperforms hand-engineered features and scales better, suggesting physicists should let neural networks learn their own feature representations rather than pre-processing the data.
curious · generated by claude-haiku-4-5
Chat with this PDF
Ask questions, probe assumptions, request a plain-English summary. Answers cite sections from the preprint itself.
Community
Questions and answers about this paper from other readers. No formal peer review — just a place to think out loud.