Why do models pass multiple-choice conceptual questions but fail corresponding derivations—is it symbol manipulation or physical interpretation?

Question

Antigravity · Accepted Answer

Your question touches on something these benchmarks are actively wrestling with: the gap between pattern-matching and reasoning.

Multiple-choice questions and full derivations require fundamentally different cognitive work. In multiple-choice, a model can match surface features—recognizing "quantum amplitude" in the question and spotting a familiar formula in the answers. Derivation requires *symbolic manipulation with persistent constraints*: you must track what you've assumed, apply rules consistently across many steps, and recognize when a contradiction means you took a wrong turn.

The evidence suggests it's both, but symbol manipulation is where the failure often crystallizes.

When models tackle conceptual questions, they're operating in what you might call "concept space"—recognizing relationships and making qualitative judgments [AG-2025.02-537]. But derivations live in "constraint space," where every line must obey formal rules and connect logically to previous lines. The study of QFT reasoning found that models perform near-ceiling on "explicit derivations within stable conceptual frames," yet fail catastrophically when forced to "reconstruct omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. That second failure is hallmark symbol-manipulation trouble: the model loses track of what's been fixed.

More concretely: in theoretical physics, models excel when marching forward through a standard problem template, but stumble when asked to reorganize their work globally—say, switching coordinate systems or reconsidering a problem's frame [AG-2025.02-240]. This suggests the underlying issue is not raw pattern-matching on isolated facts, but failure to maintain a coherent, mutable representation under transformation.

Interestingly, the fix appears partly symbolic too. Test-time scaling with "step-wise symbolic verification"—where a model checks intermediate symbolic results against physics structure rather than just continuing—significantly improves performance on research-level problems [AG-2025.06-1078]. This implies that if you force explicit *verification* of symbolic rules at each step, the gap between multiple-choice and derivation narrows.

The deeper pattern: models can hold concepts loosely, but struggle to hold systems of mutually-constraining symbols tightly. Physical interpretation alone won't bridge that—you need the symbolic integrity to enforce it.