RAAS: LLM Agentic System Architecture Search with GRPO
Abstract
Large Language Model (LLM) agentic systems solve complex tasks through coordinated workflows, but designing them remains labor-intensive. The \textbf{Agentic Supernet} paradigm automates this by optimizing a probabilistic architecture space, yet suffers from critical evaluation instabilities: absolute performance scores entangle architectural merit with query difficulty, while single-execution protocols capture execution randomness rather than true capability. These instabilities lead to unreliable search dynamics where simple queries inflate weak designs and challenging queries suppress strong ones. We introduce \textbf{RAAS} (Robust Architecture Adaptive Search), which establishes stable, fair evaluation through two synergistic mechanisms. \textbf{Contextual Architecture Orchestration (CAO)} disentangles quality from task difficulty by evaluating cohorts of candidate architectures on identical queries, deriving context-aware merit signals through peer-group comparison. \textbf{Multi-Trial Assessment Synthesis (MTAS)} eliminates execution variance by aggregating performance across multiple independent trials, producing statistically robust capability estimates. Together, these mechanisms isolate genuine architectural superiority and guide reliable architecture discovery. Extensive experiments across six benchmarks show RAAS significantly outperforms state-of-the-art methods, improving HumanEval pass@1 from 92.23\% to 96.31\% and MATH accuracy from 52.08\% to 60.87\%, while maintaining practical efficiency, demonstrating the effectiveness of robust evaluation for agentic architecture search.