May 10, 2026 · daily digest

cere-bro | 2026-05-10

cere-bro | 2026-05-10

A consolidation day. HuggingFace re-ran yesterday's 38 papers. The new signal arrives elsewhere: a Fields Medalist says ChatGPT 5.5 Pro produced original research-grade math in two hours, and Jiayi Weng publishes a quiet bear case for RL itself.


TL;DR


The Big Picture

Yesterday's batch was the largest of the month. Today is the consolidation. The HF Daily Papers page was re-ran with the same 38 IDs, the only RSS new on the day are five items dated 05-09 that the farmer pulled overnight, and Twitter contributed two tweets total (one of them, Tesla saying goodbye to the Model S production line, is not on-topic). The honest read is that the live story is still the MoE convergence (UniPool + EMO), the skill-curation cluster (StraTA + Skill1 + SkillOS), KernelBench-X, and Anthropic's NLAs. Yesterday's Connecting the Dots is the right starting point this morning.

What today does add is a second, harder-to-rank thread, two pieces written from very different angles that both push back on the RL substrate. Gowers' note is the empirical pull, a frontier chat model produced a structurally novel result on a real research problem with zero scaffolding, no skill-curator, no agentic workbench, no closed-loop iteration. Jiayi Weng's "Learning Beyond Gradients" is the theoretical pull, code edits in a Codex-driven loop may be a viable alternative to gradient-based learning for at least one well-structured task. Neither piece is a paper that the wiki can grade. Together they suggest the field's next 90-day argument is going to be about whether the RL apparatus the wiki has been tracking (TIP, LongAct, PreRL, VGF, ResRL, Balanced Aggregation) is the right substrate for agentic capability or just the most accessible one.

Industry-side, the Broadcom / Microsoft / OpenAI chip story keeps the capacity-binding-constraint thread alive. Pre-commit anchoring has now propagated from the cloud-vendor side to the silicon-vendor side. Frontier labs are now financing through two-sided pre-commits, the lab needs a cloud anchor, and the silicon vendor will not start production without that cloud anchor either.


Deep Dives


Fields Medalist + ChatGPT 5.5 Pro: a structural improvement on an open problem

Timothy Gowers gave ChatGPT 5.5 Pro an open number-theory problem. In under an hour it improved an exponential bound to a polynomial one. An MIT collaborator called the key idea "completely original."

Source: The Decoder Links: The Decoder article · Wiki Tier: 2 — Active learning (intersects research-agent thread)

Yesterday's research-agent stack:        Today's datapoint:
  AI Co-Mathematician (workbench, agentic)   Gowers + ChatGPT 5.5 Pro
  Auto Research (closed-loop iteration)        single-shot inference
  Skill curation cluster (memory)              no scaffold
  benchmark: FrontierMath Tier 4 48%           expert reviewer attests
                                               "completely original"

The reason this is worth a Deep Dive instead of a Quick Hit is the combination of three things rare in any prior anecdote of this type. The problem is a real open problem in analytic number theory, not a benchmark. The reviewer is a Fields Medalist actively working on the same problem. The improvement is structural (exponential to polynomial), not a constant tightening. Any one of those three could be wrong, the bound could fail to clear refereeing, the "originality" claim could be a literature search miss, the problem could be smaller than its framing suggests. But all three lined up in one run is a different category of evidence than the previous "GPT-X solved a Putnam problem" stories.

The contrast with yesterday's AI Co-Mathematician (HF, FrontierMath Tier 4 48%) is informative. Co-Mathematician spends its compute on a stateful workbench, tracked failed-hypothesis memory, and asynchronous specialist agents. Gowers' run uses none of that. It is a chat session. The implication is not that the workbench is unnecessary, the workbench is what gets you predictable performance across many problems. The implication is that single-shot capability has crossed a threshold where a workbench is no longer required for occasional success. That is the regime where the marginal value of agentic infrastructure starts to be measured against what a vanilla chat session can already do.

Gowers' meta-claim is the load-bearing line: the bar for human mathematical contribution is now defined by what LLMs cannot do. Any researcher reading this in May 2026 has to make a portfolio decision about which problems are still worth their time. That decision will reshape the next 12 months of research mathematics regardless of whether any individual ChatGPT result holds up.

Why it matters: This is the cleanest single-shot evidence so far that frontier chat models, with no scaffolding, can produce research-grade structural improvements on real open problems. If it reproduces, the agentic-workbench papers move from "necessary for capability" to "necessary for reliability." Two different value propositions.

Full summary


Jiayi Weng — Learning Beyond Gradients

Codex iterated a pure NumPy + cv2 closed-loop heuristic policy for VizDoom D3. No neural net, no map, no object coordinates. The policy is code, the agent edits it, it works.

Source: Personal blog (Jiayi Weng, ex-OpenAI / Tianshou author), amplified via @MillionInt Links: Blog post · @MillionInt tweet · Wiki Tier: 2 — Agentic systems / RL critique

Standard RL stack:                     Iterated heuristic learning:
  policy = neural net θ                  policy = Python program P
  update = ∂L/∂θ                         update = AI edits P after failure
  forgetting = parameter overwrite       forgetting = code revert (rare)
  state = replay buffer                  state = source control + tests

The argument compresses to one observation. Heuristic policies (handcrafted rules, programmatic policies) were never bad on capability grounds, they were bad on maintenance grounds. A handwritten rule got you 80% of the behavior, but maintaining it across edge cases required a dedicated engineer. The maintenance cost was the entire reason RL won. Weng's claim is that this cost equation has flipped. Code edits are now cheap because a coding agent can read a failure trace, write a test, edit the rule, and re-run, end to end, without human supervision. The artifact lives in source control, which means continual learning is a git history rather than a parameter drift.

The bear case for RL embedded in this is sharp but narrow. Weng is not claiming gradients are obsolete for frontier model training. He is claiming that one of the strongest motivations for end-to-end neural policies, "everything else is too expensive to maintain," may not hold in the agentic era. For tasks where state is structured and action choices are discrete, you can write the policy as code and have the agent maintain it. The tweet thread amplifying this called it "mostly a bearish take on RL." That is the right read, but the bearishness is on the substrate, not on the math.

The composition with the skill-curation cluster (StraTA / Skill1 / SkillOS, 05-09) is direct. SkillOS already separates a frozen executor from a trainable curator that maintains an external SkillRepo. Weng's framing is the next step, the SkillRepo is just code, the curator is the coding agent, and the whole loop runs without weight updates. The skill-curation thread arrived at this architecture from the agent side. Weng arrives at it from the RL-critique side. Two independent paths, same architectural target.

The honest weak point is the demo. VizDoom D3 has dense observable state and discrete actions. The leap from this to "frontier agentic tasks like multi-document research" is not justified by the demo alone. Anyone betting on the iterated-heuristic frame should expect at least one large-scale paper from a major lab in the next 90 days, or the framing remains a thoughtful blog post.

Why it matters: The wiki has been logging RL-improvement papers (TIP, LongAct, PreRL, VGF, ResRL, Balanced Aggregation) under the assumption that RL is the substrate to fix. Weng's piece is the first credible argument in 2026 that for some classes of agentic learning, the substrate may be the wrong choice and not the right thing to fix.

Research angle: What does iterated heuristic learning look like at frontier agentic scale? The natural test is multi-document research or web browsing, not VizDoom. The skill-curator architecture (SkillOS) plus a code-only SkillRepo is the obvious composition. If this becomes a paper from a frontier lab within Q3 2026, the framing-shift becomes load-bearing for the wiki's RL coverage going forward.

Full summary


Industry Pulse


Connecting the Dots

Across days — the RL-substrate question is now contested from two directions. Yesterday's Balanced Aggregation was the latest in a long thread of careful structural fixes to GRPO (TIP 04-16, LongAct 04-18, PreRL 04-16, VGF 04-19, ResRL 05-08). All of those papers assume the gradient is the right substrate; the work is in fixing the gradient signal. Today's two non-paper inputs both push back. Gowers shows that single-shot inference from a chat model can produce research-grade output without any RL scaffold. Weng argues that for at least one class of structured task, the substrate itself can be replaced with a coding-agent-edits-Python loop. Neither argument settles the question, but two independent pushbacks within 24 hours is a thread to start tracking. The falsifiable claim, by Q3 2026 expect at least one frontier-scale paper that either (a) eliminates gradient-based policy updates from an agentic stack, or (b) shows that gradient-based RL strictly dominates iterated-heuristic on a task with dense, structured state.

Across days — the research-agent stack consolidates. AI Co-Mathematician (05-09) produced 48% on FrontierMath Tier 4 using a full agentic workbench. Today's Gowers result is the same domain (research math) without any of that scaffolding. Together they suggest the workbench buys reliability across many problems rather than capability on any single problem. That is a more useful framing than the typical "agents beat chat" headline because it gives a falsifiable test, an agentic system should beat a vanilla chat session at median problem performance even if both can clear individual problems.

Cross-source HF + Twitter (continued from yesterday). Yesterday DCI was the only paper amplified from both HF and the @bayesiansapien retweet feed. Today there are no @bayesiansapien retweets, the curated-signal channel was quiet. The only Twitter signal is @MillionInt's repost of Jiayi Weng. The amplification pattern says that Weng's blog post is the single most important social-stream item today, which matches the analysis above.

HF vs Kurate. No exact HF / Kurate top-20 overlap today (typical, Kurate lags HF by 1-2 weeks). The new Kurate weekly leaderboard (cs.AI + cs.LG, 40 papers) is mostly the same papers as last week's run with minor reranking. Two Kurate items are worth flagging in light of today's themes:


Worth Watching


Quick Hits


Sources ingested today: HF (38 papers, identical to 2026-05-09 batch), RSS (5 items dated 2026-05-09 + 1 Simon Willison item already covered yesterday), Gmail (0 starred), Twitter (2 tweets, 0 curated retweets), Kurate (cs.AI top-20 + cs.LG top-20 + rising-authors, weekly snapshot mostly unchanged from last week). Wiki pages updated: 5.