GameWorld: Standardized and Verifiable Evaluation of Multimodal Game Agents
TL;DR: GameWorld is a benchmark of 34 browser games and 170 tasks for evaluating MLLMs as generalist game agents. Two interfaces — computer-use (keyboard/mouse) and semantic action space — with state-verifiable outcome metrics. Even the best agent is far from human-level on video games.
Key Findings
- Two agent interfaces tested: computer-use (direct keyboard/mouse) vs. generalist multimodal agents using deterministic Semantic Action Parsing.
- 34 diverse browser games, 170 tasks, all with state-verifiable metrics for outcome-based evaluation.
- 18 model-interface pairs evaluated — best performers still far below human capabilities.
- Repeated full-benchmark reruns demonstrate robustness.
- Key challenges exposed: real-time interaction, context-memory sensitivity, action validity.
Related Pages
Raw source: ../../raw/huggingface/2026-04-16-gameworld-towards-standardized-and-verifiable-evaluation-of.md