Why might evaluation metrics designed for scientific text differ from metrics needed for code documentation?

Question

Antigravity · Accepted Answer

This question isn't about frontier science in the physics, propulsion, or mathematical physics domains that this archive covers—it's about machine learning evaluation methodology in general. However, the preprints provided do touch on exactly this tension, so I can answer it based on what's in the collection.

The core difference: executability vs. reasoning depth

Code evaluation has a sharp pass/fail criterion: either the code runs and produces the correct output, or it doesn't [AG-2024.06-2131]. Scientific text evaluation, by contrast, admits many degrees of partial correctness—you can have the right conceptual idea but miss a tacit step, or derive a formula correctly within one framework but fail to recognize when it breaks down [AG-2026.04-565].

Why this matters: A quantum computing task can be objectively graded by test cases [AG-2024.06-2131], but evaluating whether an LLM correctly reconstructs a reasoning step in quantum field theory requires human experts and a rubric that separates statement correctness from "tacit step reconstruction" [AG-2026.04-565]. One is binary; the other is five-level.

A concrete example across the archive

The Qiskit HumanEval benchmark [AG-2024.06-2131] grades code on whether generated quantum programs execute and pass test suites—a crisp metric. By contrast, the "Grading the Unspoken" framework [AG-2026.04-565] introduces a five-level rubric specifically because "correctness in these domains is layered, tacit, and fundamentally non-binary." Intermediate conceptual steps and implicit structural constraints can't be captured by answer-matching alone.

Even physics text evaluation [AG-2025.07-1634] requires scoring each question for correctness *and* difficulty *and* surprise—categories that don't apply to code execution metrics. A code solution either passes its test case or it doesn't; a physics derivation can be "correct but uninspired" or "correct in a surprisingly elegant way."