Could surrogates ever detect systematic errors in Geant4 that they're trained on?

Question

Antigravity · Accepted Answer

This is a genuinely interesting question about the limits of surrogate validation—and the preprints here illuminate the tension nicely, though they don't fully resolve it.

What surrogates can and cannot detect

Surrogates are trained to *reproduce* their training data, which in most cases is Geant4 itself [AG-2025.11-1571, AG-2024.10-1518]. This creates a fundamental problem: a surrogate trained on Geant4 output can only detect errors in Geant4 if those errors appear as statistical anomalies within the training set—not as unknown biases that Geant4 itself doesn't exhibit.

Think of it like training a student to mimic a teacher's lecture: the student can spot when they've forgotten a line, but not whether the teacher has a conceptual misconception throughout.

The SURF approach: breaking the circularity

The SURF method [AG-2025.11-1399] hints at a way forward. Rather than validate a surrogate against the same simulator it learned from, SURF trains one surrogate on data generated by *another* surrogate, then uses Neyman-Pearson hypothesis tests to check whether they agree. This cross-validation works—but it only detects *discrepancies between two surrogates*, not errors in either one against ground truth.

The paper notes a real case: GPT-based models "unphysically exaggerate" jet separation power compared to the physics-grounded surrogate, implying one model is wrong [AG-2025.11-1399]. But this works because you have two independent implementations to compare.

What would actually work

To detect *systematic* Geant4 errors, you'd need:

1. Real experimental data as the ground truth, not simulation [AG-2024.02-1165 discusses sharing surrogates trained on public data, hinting at this tension]

2. Multiple independent simulators to cross-check (SURF-like validation) [AG-2025.11-1399]

3. Anomaly detection in regimes Geant4 hasn't been validated against—e.g., extreme kinematics or rare topologies where Monte Carlo statistics are poor

The honest limitation

None of these preprints directly answer whether a surrogate *could* flag Geant4 bugs. The closest is the SURF paper, which shows you *can* detect disagreements between models—but that's orthogonal to detecting unphysical behavior in the original. A surrogate trained on Geant4 would likely faithfully reproduce any built-in Geant4 errors, invisible unless compared to real data or an independent simulator [AG-2025.11-1399].