GUI Agents

Agents that interact with graphical user interfaces — clicking, typing, navigating apps — using vision-language models to perceive the screen and take actions.

Current State (as of 2026-04-16)

GUI agents are one of the fastest-moving areas of applied agentic AI. MLLMs now power agents that can operate desktop and web UIs, but long-horizon tasks remain the core challenge — memory degrades, math fails, and progress tracking breaks down over extended sessions.

Key Papers

UI-Copilot (2026-04-16) — Splits the agent into an executor (policy) and a copilot (memory+compute). Introduces memory decoupling and TIPO (Tool-Integrated Policy Optimization). 17.1% improvement on AndroidWorld at 7B scale. → summary

UI-Zoomer (2026-04-16) — Uncertainty-driven zoom-in for GUI grounding. Triggers zoom only when the model is uncertain, using a two-axis confidence gate. Gains of up to +13.4% on ScreenSpot-Pro with no training. → see raw: ../../raw/huggingface/2026-04-16-ui-zoomer-uncertainty-driven-adaptive-zoom-in-for-gui-ground.md

GameWorld (2026-04-16) — Benchmark of 34 browser games and 170 tasks for MLLM game agents. Best agents still far below human performance. → summary

Open Problems

Long-horizon memory management
Math and numerical reasoning during execution
Action validity and irreversibility in real environments
Evaluation standardization across different UI paradigms

GUI Agents

GUI Agents

Current State (as of 2026-04-16)

Key Papers

Open Problems

Related Pages