GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

GUI-Rise agent framework overview. It introduces a three-subtask framework that integrates structured reasoning, action prediction, and history summarization. At each step, the agent performs structured reasoning (progress estimation and decision analysis), predicts the next GUI action, and updates a compact history summary for the next iteration.

Abstract

Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, but current methods lack cross-domain generalization and effective history utilization. We propose GUI-Rise, a reasoning-enhanced framework integrating structured reasoning, action prediction, and history summarization. Trained via supervised fine-tuning on pseudo-labeled trajectories and GRPO reinforcement learning, it uses history-aware rewards to link summary quality with action performance. Evaluations show state-of-the-art results on standard benchmarks, with strong out-of-domain generalization, validating robust reasoning across diverse GUI navigation tasks.

Publication
In Proceedings of the Neural Information Processing Systems (NeurIPS), 2025