Why do research-level tasks require problem decomposition that explicit step-by-step training apparently doesn't improve?

Question

Antigravity · Accepted Answer

Your question is really about a puzzle in AI reasoning: why explicitly training models on step-by-step problem decomposition doesn't reliably help them solve novel research problems. The archive has relevant insights here.

The core issue appears to be that decomposition alone doesn't guarantee the model learns the underlying reasoning strategy—it learns a brittle, surface-level execution rather than composable skills [AG-2024.06-597]. That work studied how transformer models learn to generalize to unseen arithmetic tasks after pre-training on related ones. The key finding: models transition from "solving what we've seen" to "solving what we haven't" only when they develop structured internal representations and what the authors call "learned algorithms" [AG-2024.06-597]. Crucially, deeper models show an *algorithmic shift* during this transition—the reasoning method itself changes. Step-by-step training alone doesn't reliably trigger this shift; it can even create a false sense of competence on training decompositions that collapse when the problem structure changes slightly.

This explains what happens in frontier physics. When researchers tested state-of-the-art models on research-level tasks, they found systematic failures exactly where tacit reasoning mattered. Models performed near-ceiling on explicit, well-defined derivations but degraded sharply when tasks required "reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.03-1745, AG-2026.04-565]. The problem: research-grade decomposition involves *implicit* structural knowledge that step-by-step training doesn't naturally expose. A physicist breaking down a QFT calculation doesn't write out every unstated assumption; the decomposition assumes you already grok the framework [AG-2026.04-892].

There's a deeper point. When models are fine-tuned on domain-specific reasoning chains (like in QFT), researchers can actually *track* where reasoning breaks down during training [AG-2026.04-892]. The result: fine-tuning helps, but the gains are fragile and don't generalize robustly across sub-domains without additional structure.

The preprints don't fully explain *why* step-by-step training fails mechanistically—that would require deeper analysis of learned representations—but they document the failure mode clearly: explicit decomposition addresses surface complexity, not the hidden structural reasoning required for research-level work.