Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models
Abstract
Vision-Language-Action models (VLAs) achieve strong performance in sequential decision-making but remain fragile to subtle environment shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to spurious cues and replicate memorized actions. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates spurious correlations through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting high-confidence errors. Experiments on LIBERO (+7.4\% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains in task success over strong VLA, test-time adaptation and even fine-tuned approaches baselines with minimal overhead, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents.