How much attention sparsity emerges if you remove the physics-inspired interaction matrix entirely?

Question

Antigravity · Accepted Answer

The archive directly addresses your question. When researchers removed the physics-inspired interaction matrix from the Particle Transformer (ParT), they found that attention sparsity emerges almost entirely from the attention mechanism itself [AG-2025.11-1573].

Here's what they discovered: ParT exhibits striking sparse, nearly binary attention patterns—the model essentially learns to focus sharply on a few key particles rather than distributing attention evenly. The natural question was whether this sparsity comes from the interaction matrix (a physics-based component encoding domain knowledge about particle interactions) or from the transformer's attention computation. By systematically comparing models trained with and without this matrix, they showed that the interaction matrix plays only a secondary role [AG-2025.11-1573].

This is significant because it suggests the sparsity isn't imposed by the physics structure built into the model—it's an emergent property of how the attention mechanism learns to solve the jet-tagging problem. Remarkably, even without explicit help from the interaction matrix, ParT can still identify physically meaningful features like leptons in top-quark decays [AG-2025.11-1573]. This finding clarifies how much of the model's interpretability comes from pure learning dynamics versus inductive bias.