Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 223

LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models

Senyu Fei ⋅ Siyin Wang ⋅ Junhao Shi ⋅ Zihao Dai ⋅ Jikun Cai ⋅ Pengfang Qian ⋅ Li Ji ⋅ Xinzhe He ⋅ Shiduo Zhang ⋅ Zhaoye Fei ⋅ Jinlan Fu ⋅ Jingjing Gong ⋅ Xipeng Qiu

Paper PDF

Abstract

Visual–Language–Action (VLA) models report impressive success rates exceeding 95\% on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. Current simulation-based robustness evaluations suffer from narrow perturbation coverage, manual design constraints, and coarse-grained analysis that fails to reveal when and how models fail. To address this gap, we propose LIBERO-Plus, a comprehensive, automatic, and fine-grained evaluation framework with controlled perturbations across seven dimensions: object layouts, camera viewpoints, robot initial states, language instructions, lighting conditions, background textures, and sensor noise. Our systematic analysis of ten state-of-the-art models reveals consistent brittleness beneath apparent competence, with performance dropping from 95\% to below 30\% under modest perturbations. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.