FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denoising
Abstract
Humans naturally allocate more time before performing actual actions when handling complex tasks in the physical world. This paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains.However, the potential of test-time computing remains largely unexplored for robotic foundation models interacting with the physical world.In this work, we propose \textbf{\ours}: a test-time computing framework that augments flow-based Vision-Language-Action (VLA) generalist policies with value-guided sampling and cascaded action denoising, enabling higher control performance and real-time action rates for dexterous robot manipulation.\ours first incorporates a flow-based intermediate verifier to estimate state–action values for candidate actions. At test time, the policy iteratively samples multiple noisy action proposals and retains the one with the highest predicted value, yielding value-aligned, high-quality actions without retraining.To satisfy the stringent frequency demands of robot control, \ours further introduces cascaded action denoising, decoupling expensive value-guided sampling from fast action refinement. A lightweight flow denoiser asynchronously takes the selected high-value noisy action and rapidly denoises it to produce the final control signal, enabling fluid, high-rate execution.During deployment, the intermediate verifier operates at a low frequency to provide value-guided sampling, while the lite-flow denoiser continually processes selected candidates to maintain real-time control.Extensive experiments demonstrate that \ours scales flow-based VLA models effectively at test time, and achieves state-of-the-art performance across diverse simulation benchmarks and real-world dexterous robotic tasks.