Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
Abstract
Autonomous end-to-end agents are increasingly required to operate in environments where actions are not derived directly from the environment's raw actions but instead selected from higher-level action spaces. These actions are then mapped to the corresponding low-level interactions with the environment through controllers. In existing research, the action space is typically predefined. However, in practice, the optimal action space is context-dependent and difficult to determine in advance. For example, in complex domains such as Minecraft, relying solely on low-level raw actions or high-level planning actions is insufficient to handle the wide range of open-ended tasks, which vary in complexity and time horizons. The effective granularity of the control inevitably varies depending on the situation.To address this challenge, we propose CrossAgent, which introduces a novel adaptive action-space selection framework. CrossAgent is built through two stages of reinforcement learning fine-tuning: cold-start single-step reinforcement learning and multi-step reinforcement learning. Within Minecraft, we define three complementary action spaces: motion, grounding, and raw action—each with distinct advantages and limitations. Our framework enables agents to dynamically switch among these spaces and balance task rewards against reasoning costs.Experiments on over 30 diverse tasks in Minecraft demonstrate that CrossAgent exhibits strong long-horizon planning, precise execution, generalization, and efficiency, significantly outperforming fixed-action baselines. These results highlight the critical role of dynamic action-space adaptation in the development of generalist agents capable of tackling open-ended environments.