FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Source: HuggingFace Daily Papers · arXiv 2605.14445 Raw: farmer file Tier: 2 — synthetic training data, open-ended coding, long-horizon agents

TL;DR

LLM coding progress has concentrated on well-defined tasks (feature implementation, bug fixes, competitive programming). Open-ended coding (problems with no known optimal solution) remains weak because training data for it is scarce and expensive. FrontierSmith automates the conversion of closed-ended coding tasks (competitive programming seeds) into open-ended variants by mutating goals, restricting outputs, and generalizing inputs. A quantitative idea divergence metric prunes to candidates that elicit genuinely diverse approaches across different solvers. Agent-generated test cases and verifiers cover the survivors. Training on synthesized data: Qwen3.5-9B improves +8.82 on FrontierCS and +306.36 Elo on ALE-bench; Qwen3.5-27B improves +12.12 and +309.12. Synthesized problems also drive more agent turns and tokens, similar to human-curated ones.

Why it matters

This is the third paper in three days where the model constructs its own training substrate instead of consuming a fixed corpus. EvoEnv (2026-05-15) constructs verifiable environments with solve-verify asymmetry. EvolveMem (2026-05-15) self-evolves retrieval configuration from failure logs. FrontierSmith evolves training problems from a closed-ended seed corpus. Three papers on the same diagnosis: the bottleneck is not the model, it is the substrate the model trains on, and substrates can be auto-generated if you have a quantitative discriminator (asymmetry condition for EvoEnv, divergence metric for FrontierSmith, failure logs for EvolveMem).

The idea-divergence metric is the operationally interesting part. Most synthetic-data pipelines that mutate seeds suffer from mode collapse: the mutated problems look superficially different but elicit the same solution strategy. FrontierSmith's divergence metric is a filter that keeps only the mutations that genuinely change the solution space. This is the same shape as the diversity scoring that worked for EvolveMem (closed-loop self-evolution): a quantitative diversity prior is doing more work than the generation step.

Connections to prior wiki state

EvoEnv (2026-05-15) — solve-verify asymmetry as invariant for environment synthesis. FrontierSmith's idea-divergence-as-invariant is the problem-synthesis analogue. Both papers identify a structural quantity that lets you auto-generate at scale without quality collapse. The composition (FrontierSmith feeds EvoEnv with seed problems for environment construction) is the natural follow-up.
EvolveMem (2026-05-15) — auto-discovers retrieval configurations via diagnosis on failure logs. FrontierSmith auto-discovers training problems via divergence on solver outputs. Both use AutoResearch-style closed-loop self-evolution.
Orchard 67.5% on SWE-bench Verified (2026-05-15) — the agentic post-training infrastructure. FrontierSmith is one source of the data Orchard needs. Open-ended coding is where SWE-bench-style closed problems hit their ceiling.
FrontierCS / ALE-bench Elo improvements — Elo gains in the +300 range on open-ended benchmarks are large enough that a follow-up on a frontier model (rather than Qwen3.5) is the obvious next experiment.
WildClawBench native runtime (2026-05-15) — WildClawBench tasks average 8 minutes of wall-clock work. The fact that FrontierSmith-trained agents take more turns and use more tokens (similar to human-curated problems) suggests the synthesis pipeline is producing problems with realistic length, not artificially shortened ones.

How it works

The pipeline has three stages.

Mutation. Starting from a competitive programming problem, FrontierSmith generates open-ended variants by (a) changing the goal (instead of "find the optimal X" ask "design a system that achieves good X under varying conditions"), (b) restricting outputs (limit what the solver can return), (c) generalizing inputs (extend to a broader input distribution).

Idea divergence filter. Multiple solver agents tackle each candidate problem. An idea-divergence metric quantifies how different their approaches are. Low-divergence problems are pruned (they're essentially the same problem in disguise). High-divergence problems survive.

Test and verifier generation. Agents synthesize test cases and verifiers for the surviving problems. Without verifiers, open-ended problems are not RL-trainable. This stage gates whether a candidate can be used downstream.

Open problems / Research angle

FrontierSmith + EvoEnv composition. FrontierSmith generates open-ended problems; EvoEnv requires verifiable environments. If FrontierSmith's test-and-verifier generation is what gives EvoEnv its solve-verify asymmetry, the two pipelines are partial-duals. The unified pipeline (problem synthesis with built-in solve-verify check) has not been written.
Idea-divergence beyond coding. The divergence-metric mechanism is domain-general: it relies only on a comparable solution space. Whether it transfers to math (where solution space is well-defined), agentic workflows (where it is harder to compare), or scientific discovery (where divergence is the goal) is open.
Mode collapse over many generations. If FrontierSmith is run iteratively (mutated problems become seeds for further mutation), does idea-divergence saturate? Falsifiable: a 30-day follow-up measuring divergence retention over five generations.
Frontier-model training. Qwen3.5-9B and -27B benefit. Whether a frontier-tier model (Opus, GPT-5.5) benefits at the same magnitude is the load-bearing scaling question. Most synthetic-data techniques saturate above a certain model size.

Concept tags

synthetic-training-data · open-ended-problems · idea-divergence · closed-loop-synthesis · agentic-rl-data