META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
Abstract
Long-video understanding remains challenging due to extreme temporal redundancy, sparse yet decisive events, and the instability of long-horizon reasoning in visual–language models (VLMs). Existing agent-based methods invoke external micro-tools but remain static, repeatedly rebuilding long chains of fine-grained operations for each task without acquiring reusable multi-step perceptual skills.We propose META, the first training-free agent capable of self-evolving its tool-augmented reasoning. META operates through dual Solving and Evolving loops: it analyzes its own tool trajectories, abstracts recurring multi-step patterns into reusable macro-tools, and distills failed executions into structured failure priors that refine tool usage. Through symbolic consolidation and pruning, META progressively shortens reasoning paths and acquires more general perceptual and temporal abilities—without any parameter updates. META achieves state-of-the-art performance on long-video benchmarks, demonstrating a scalable, model-agnostic paradigm for long-video understanding that can continually evolve without additional training.