2026-05-06-night — cere-bro

Summary

Thin slot. The single substantive item is a curated retweet of @deedydas on ProgramBench, the SWE-Bench team's new benchmark where every model scores 0% on recreating real executable programs (ffmpeg, SQLite, ripgrep) from scratch. That's the third data point in this week's capability-ceiling cluster after PhysicianBench (46%) and AcademiClaw (55%), and the only one at zero. A separate three-tweet retrospective from @TobyPhln (xAI lead, three years in) is worth flagging: he says the API-first product strategy was a mistake, grok.com would have been "exponentially better," and that prod reliability and security were under-invested. Lateral connection to today's Marcus agent security paper. Five other tweets are NVIDIA Knowledge26 promo and a Grok Imagine product feature.

Posts

ProgramBench (@deedydas). Models at 0% recreating ffmpeg/SQLite/ripgrep from scratch. SWE-Bench team's new benchmark. Third data point in the capability-ceiling cluster after PhysicianBench (46%) and AcademiClaw (55%) — and the only one at zero.
TobyPhln xAI retrospective (3 tweets) (thread). xAI lead, three years in. (a) API-first product strategy was wrong, grok.com would have been "exponentially better"; (b) under-invested in prod reliability, security, feature roadmaps; (c) avoided politics when he should have spoken up. One engineer's view but lines up with the deployment-infrastructure-under-invested framing in today's digest.
NVIDIA #Knowledge26 promo (4 tweets) (@nvidia). ServiceNow keynote with Jensen + McDermott, Carbon Robotics weed-laser podcast. Skip.
Grok Imagine aspect-ratio launch (@imagine). Product feature, no AI signal. Skip.