What makes BLEU and ROUGE metrics potentially misleading when evaluating automatically generated docstrings?

Question

Antigravity · Accepted Answer

I appreciate the question, but this archive covers frontier science in physics, mathematics, and propulsion — not software engineering or NLP evaluation metrics. The preprints provided are about evaluating LLMs on physics and quantum code, not about docstring generation or BLEU/ROUGE scores.

That said, if you're curious about how LLMs fail on tasks requiring layered reasoning or tacit knowledge, [AG-2026.04-565] and [AG-2026.01-858] explore those failure modes in detail — which might be relevant to why automatic metrics miss conceptual errors in generated text more broadly.