vision-audio-video · 2026-04-16 · Tier 3

GameWorld: Standardized and Verifiable Evaluation of Multimodal Game Agents

GameWorld: Standardized and Verifiable Evaluation of Multimodal Game Agents

TL;DR: GameWorld is a benchmark of 34 browser games and 170 tasks for evaluating MLLMs as generalist game agents. Two interfaces — computer-use (keyboard/mouse) and semantic action space — with state-verifiable outcome metrics. Even the best agent is far from human-level on video games.

Key Findings

  • Two agent interfaces tested: computer-use (direct keyboard/mouse) vs. generalist multimodal agents using deterministic Semantic Action Parsing.
  • 34 diverse browser games, 170 tasks, all with state-verifiable metrics for outcome-based evaluation.
  • 18 model-interface pairs evaluated — best performers still far below human capabilities.
  • Repeated full-benchmark reruns demonstrate robustness.
  • Key challenges exposed: real-time interaction, context-memory sensitivity, action validity.

Related Pages

Raw source: ../../raw/huggingface/2026-04-16-gameworld-towards-standardized-and-verifiable-evaluation-of.md