Skip to main content
← Tech Stackups News
On Hacker News

FrontierCode

225
points
45
comments
1
notable voices

The 5-second version

  • FrontierCode is a new benchmark measuring code mergeability and quality rather than just correctness, using maintainer-defined rubrics for real open-source repos.
  • Top models struggle significantly: Claude Opus 4.8 leads with only 13.4% on Diamond, while GPT-5.5 achieves 6.3% with 4x fewer tokens, and open-source Kimi K2.6 reaches just 3.8%.
  • The benchmark achieves 81% fewer false positives than SWE-Bench Pro through rigorous QC including adversarial testing, calibration, and manual review by researchers.
  • Tasks are hand-crafted by 20+ world-class maintainers spending 40+ hours each, featuring shorter prompts, multi-PR chains, and triple the language diversity of prior benchmarks.
  • Difficulty scales through quality rubrics rather than patch size, with three nested subsets—Diamond (50 tasks), Main (100), and Extended (150)—where even Extended remains unsaturated at 51.8% for the best model.

Top voices

Verbatim comments from the thread's most notable / highest-karma participants.

swyxnotable23.9k karma13 comments
:wave: i was on the team! AMA. some headlines - 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?" - 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste. - total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from…
Read on HN ↗
epolanski12.9k karma
I wish there was a new kind of benchmark that...wasn't focused on prompt-to-complete-task completion, rather on how well a model can act an assistant. At my day job, despite all the harnessing and providing extensive documentation and user stories via E2Es, I cannot trust models to deliver quality output. They are unable to, and reviewing 18 files of changes is the kind of work that increases my load and effort. And yes, we have already split and optimized our documentation to not overwhelm th…
Read on HN ↗
vessenes12.7k karma2 comments
This looks great. Well reasoned, tons of work put into eval, thanks for building it. It strikes me as kind of wild that good evals can drive tens to hundreds of millions of dollars of compute deployment in the wild — there’s something new and collaborative and competitive about the eval / frontier model race that’s quite interesting.. In this case “shorter actually mergable patches that open source maintainers would accept” feels like a great thing to deliver to the world. I didn’t deep dive…
Read on HN ↗
fouc6.8k karma
I'm a bit disappointed that Opus 4.6 wasn't in this because the tokenizer changed quite a bit from 4.7 onward. I was so annoyed by 4.7 that I've been forcing 4.6 ever since. I've been annoyed by 4.8 a bit too, so I haven't felt the urge to move on.
Read on HN ↗