Could living benchmarks identify which LLM reasoning gaps are universal versus domain-specific?

Question

Antigravity · Accepted Answer

Your question asks whether living benchmarks—evolving repositories curated by active practitioners—could reveal whether LLM reasoning failures are intrinsic to the models or specific to particular domains. That's a thoughtful question about scientific methodology.

The archive contains exactly one explicit proposal for a living benchmark: [AG-2025.07-1634]. The authors propose a framework where physicists contribute new questions "alongside new publications," with each question scored for correctness, difficulty, and surprise. The motivating idea is clear: a static benchmark becomes stale as research advances and models improve; a living one lets the community steer development toward real problems. This would let you watch whether a model's weakness on, say, quantum field theory reasoning persists or generalizes as it encounters new variants.

Several other benchmarks provide the *necessary infrastructure* for such comparisons. [AG-2025.09-1139] decomposed 71 research challenges into 190 checkpoint tasks to enable fine-grained failure analysis across condensed matter, quantum physics, astrophysics, and other domains. [AG-2026.04-565] introduced a five-level grading rubric distinguishing explicit derivations from tacit reasoning steps—crucial for spotting whether failures occur at the same logical level across domains or shift with context. And [AG-2025.06-1078] showed that test-time scaling methods developed on math (AIME) don't transfer equally to physics, hinting that some gaps may be domain-specific.

The pattern across these works is encouraging: yes, you could use living benchmarks to separate universal from domain-specific gaps, but only if you do what [AG-2025.07-1634] proposes—keep the benchmark growing and apply consistent grading rubrics (like those in [AG-2026.04-565]) so you compare apples to apples across fields. The harder part, which none of these papers fully solve, is automating that curation and grading at scale without losing expert judgment.