Beyond Sequential Tools: A Unified VLM-Guided One-Shot System for Photographic Post-Processing via Dynamic Multi-Expert Fusion
Abstract
Real-world photographic post-processing is a formidable challenge due to the frequent co-occurrence of multiple, coupled image degradations. Current paradigms, such as monolithic "all-in-one" models, often face generalization bottlenecks, while recent agent-based systems suffer from time-consuming, sequential tool invocation and suboptimal coordination of isolated, single-task tools. To overcome these limitations, we propose a novel and efficient framework: a vision-language agent system for universal photographic post-processing. Our system employs a powerful Vision-Language Model (VLM) as an orchestrator agent to perform nuanced user intent understanding and in-depth degradation analysis. Based on its assessment, the VLM generates a structured plan, dynamically allocating weights to a suite of specialized expert LoRA modules. These experts, which adapt only the Key (K) and Value (V) matrices for enhanced composability, are then simultaneously merged into a pretrained diffusion backbone to execute a tailored restoration. To ensure perceptually optimal weights, we introduce a lightweight allocation branch trained on the VLM's features using Direct Preference Optimization (DPO) from human feedback. This dynamic fusion paradigm enables a synergistic, context-aware restoration in a single, efficient forward pass. Our method demonstrates state-of-the-art performance across a wide range of synthetic and real-world datasets with diverse degradations. Crucially, it exhibits remarkable zero-shot generalization, achieving excellent results on real-world data. Our code and weights will be made publicly available.