Why can't adversarial training detect biases baked into the simulation's initialization or boundary conditions?

Question

Antigravity · Accepted Answer

This is a sharp question, but the provided preprints don't directly address it. Let me explain why, and what the archive does cover instead.

Your question assumes a distinction between:

Biases in initialization or boundary conditions (baked into the problem setup before training starts)
Learned model biases (acquired during training, which adversarial training might catch)

The preprints focus on adversarial *robustness* — how to make models resistant to input perturbations — rather than on detecting or correcting systematic biases inherited from simulation design.

That said, one paper hints at the underlying issue: [AG-2024.11-1226] notes that supervised models trained on Monte Carlo simulations "may depend excessively on artifacts and approximations from Monte Carlo simulations, potentially limiting their ability to generalize well to real data." This is exactly your concern — adversarial training improves local minima sharpness and generalization, but doesn't address whether the *entire simulator* contains a hidden bias in how it initializes or bounds the problem space.

The intuition: adversarial training explores perturbations *within the learned manifold* of the training data. If that manifold itself was shaped by biased initial conditions in the simulator, adversarial attacks won't venture outside it to discover the bias. It's like training a model to be robust to weather variations in photographs — useful, but it won't catch if your camera has a systematic color shift baked in from the factory.

The archive does not contain a preprint directly explaining why initialization or boundary condition biases escape adversarial detection.