YanSeven
3 月 18 日
"Our extensive evaluation of 18 models from 8 different providers reveals a consistent pattern: within the same provider family, newer models always achieve higher scores, with models released after 2026 showing markedly larger gains than their predecessors. This suggests that the code capabilities of current LLMs are rapidly evolving beyond static bug-fixing toward sustained, long-term code maintenance. Among all evaluated models, the Claude Opus series demonstrates a commanding lead throughout the entire observation period, with GLM-5 also standing out as a strong performer.
我们对来自 8 家不同供应商的 18 个模型进行了广泛评估,发现一个稳定规律:在同一供应商系列中,新发布的模型始终获得更高评分,且 2026 年后发布的模型相较前代提升幅度尤为显著。这表明当前 LLMs 的代码能力正快速从静态缺陷修复向持续、长期的代码维护演进。在所有评估模型中,Claude Opus 系列在整个观察周期内保持显著领先优势,GLM-5 同样表现突出。
"
严重怀疑 GLM 提供了研究资金🐶