GameWorld: Standardized and Verifiable Evaluation of Multimodal Game Agents

TL;DR: GameWorld is a benchmark of 34 browser games and 170 tasks for evaluating MLLMs as generalist game agents. Two interfaces — computer-use (keyboard/mouse) and semantic action space — with state-verifiable outcome metrics. Even the best agent is far from human-level on video games.

Key Findings

Two agent interfaces tested: computer-use (direct keyboard/mouse) vs. generalist multimodal agents using deterministic Semantic Action Parsing.
34 diverse browser games, 170 tasks, all with state-verifiable metrics for outcome-based evaluation.
18 model-interface pairs evaluated — best performers still far below human capabilities.
Repeated full-benchmark reruns demonstrate robustness.
Key challenges exposed: real-time interaction, context-memory sensitivity, action validity.

Agent Evaluation & Benchmarks
MERRIN: Multimodal Evidence Retrieval

Raw source: ../../raw/huggingface/2026-04-16-gameworld-towards-standardized-and-verifiable-evaluation-of.md

GameWorld: Standardized and Verifiable Evaluation of Multimodal Game Agents

GameWorld: Standardized and Verifiable Evaluation of Multimodal Game Agents

Key Findings

Related Pages