AI Co-Mathematician

TL;DR

A workbench-style agent system for open-ended mathematical research. Asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts. SOTA on hard benchmarks: 48% on FrontierMath Tier 4, a new high among AI systems. Frames mathematical workflow as collaborative not solver-style: ideation, literature search, computational exploration, theorem proving, theory building.

Why this matters

Frontier Math Tier 4 has been the headline-resistant benchmark in math reasoning. 48% is a real jump. The interesting design choice is the asynchronous-stateful workspace: agent maintains its own working state across long sessions, including explicitly tracked failed hypotheses. This is the same pattern Anthropic shipped as "Dreaming" and "Outcomes" in Claude Managed Agents (this week). Research follows product follows research.

Connections to prior wiki

Same architectural pattern as the Skill curation cluster (today): persistent state, per-trajectory abstraction, distillation of useful artifacts.
Builds on Auto Research with Specialist Agents (2605.05724, also today): same closed empirical loop framing, applied to a different domain.
The Frontier Math Tier 4 number connects to ARA (05-01) which framed agent-native research as a workflow primitive.

Research angle

The "tracks failed hypotheses" claim is the most interesting. How does the system represent hypothesis-level state, and does that representation generalize to non-mathematical domains?
48% on FrontierMath Tier 4 is impressive; the cost-per-solve number is missing. If the per-solve cost is on the order of GPU-hours, the result is a research demo, not a deployment story.

Source

Paper: https://arxiv.org/abs/2605.06651
HuggingFace: https://huggingface.co/papers/2605.06651