Antigravity Q&A
Why do models pass multiple-choice conceptual questions but fail corresponding derivations—is it symbol manipulation or physical interpretation?
Answered by claude-haiku-4-5 · 29 April 2026
Answer
Your question touches on something these benchmarks are actively wrestling with: the gap between pattern-matching and reasoning.
Multiple-choice questions and full derivations require fundamentally different cognitive work. In multiple-choice, a model can match surface features—recognizing "quantum amplitude" in the question and spotting a familiar formula in the answers. Derivation requires *symbolic manipulation with persistent constraints*: you must track what you've assumed, apply rules consistently across many steps, and recognize when a contradiction means you took a wrong turn.
The evidence suggests it's both, but symbol manipulation is where the failure often crystallizes.
When models tackle conceptual questions, they're operating in what you might call "concept space"—recognizing relationships and making qualitative judgments [AG-2025.02-537]. But derivations live in "constraint space," where every line must obey formal rules and connect logically to previous lines. The study of QFT reasoning found that models perform near-ceiling on "explicit derivations within stable conceptual frames," yet fail catastrophically when forced to "reconstruct omitted reasoning steps or reorganization of representations under global consistency constraints" [AG-2026.04-565]. That second failure is hallmark symbol-manipulation trouble: the model loses track of what's been fixed.
More concretely: in theoretical physics, models excel when marching forward through a standard problem template, but stumble when asked to reorganize their work globally—say, switching coordinate systems or reconsidering a problem's frame [AG-2025.02-240]. This suggests the underlying issue is not raw pattern-matching on isolated facts, but failure to maintain a coherent, mutable representation under transformation.
Interestingly, the fix appears partly symbolic too. Test-time scaling with "step-wise symbolic verification"—where a model checks intermediate symbolic results against physics structure rather than just continuing—significantly improves performance on research-level problems [AG-2025.06-1078]. This implies that if you force explicit *verification* of symbolic rules at each step, the gap between multiple-choice and derivation narrows.
The deeper pattern: models can hold concepts loosely, but struggle to hold systems of mutually-constraining symbols tightly. Physical interpretation alone won't bridge that—you need the symbolic integrity to enforce it.
Sources · 8
- 60%physics.comp-phGrading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMsAG-2026.04-565
- 56%physics.ed-phThe Role of Conceptual Problem Solving in Learning Physics: A Study in a General Relativity University CourseAG-2025.02-537
- 55%cs.AIProbing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research BenchmarkAG-2025.09-1139
- 51%physics.data-anTowards a Large Physics BenchmarkAG-2025.07-1634
- 51%cs.LGTheoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical PhysicsAG-2025.02-240
- 50%cs.LGFine-Tuning Small Reasoning Models for Quantum Field TheoryAG-2026.04-892
- 49%cs.LGTest-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench DatasetAG-2025.06-1078
- 49%cs.LGRobust Reasoning as a Symmetry-Protected Topological PhaseAG-2026.01-735
Keep exploring
- How does performance change when models must switch reference frames midway through a single derivation?
- Does step-wise verification work equally well across different physics domains, or does it depend on symbol density?
- Can models catch their own symbolic errors if forced to derive the same result using two independent methods?
This is a research aid — not a peer review. Verify sources before citing.