agentic-systems · Tier 2

GUI Agents

GUI Agents

Agents that interact with graphical user interfaces — clicking, typing, navigating apps — using vision-language models to perceive the screen and take actions.

Current State (as of 2026-04-16)

GUI agents are one of the fastest-moving areas of applied agentic AI. MLLMs now power agents that can operate desktop and web UIs, but long-horizon tasks remain the core challenge — memory degrades, math fails, and progress tracking breaks down over extended sessions.

Key Papers

UI-Copilot (2026-04-16) — Splits the agent into an executor (policy) and a copilot (memory+compute). Introduces memory decoupling and TIPO (Tool-Integrated Policy Optimization). 17.1% improvement on AndroidWorld at 7B scale. → summary

UI-Zoomer (2026-04-16) — Uncertainty-driven zoom-in for GUI grounding. Triggers zoom only when the model is uncertain, using a two-axis confidence gate. Gains of up to +13.4% on ScreenSpot-Pro with no training. → see raw: ../../raw/huggingface/2026-04-16-ui-zoomer-uncertainty-driven-adaptive-zoom-in-for-gui-ground.md

GameWorld (2026-04-16) — Benchmark of 34 browser games and 170 tasks for MLLM game agents. Best agents still far below human performance. → summary

Open Problems

  • Long-horizon memory management
  • Math and numerical reasoning during execution
  • Action validity and irreversibility in real environments
  • Evaluation standardization across different UI paradigms

Related Pages