FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
Abstract
Systematic human recognition requires integrating multiple biometric traits such as face, gait, and body shape, through specialized models to achieve robustness in unconstrained scenarios. However, existing score-fusion strategies typically adopt a static design, combining all models for every test sample regardless of sample quality. This not only increases unnecessary computation but can degrade performance by incorporating noisy or unreliable modalities. To overcome these limitations, we propose FusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that \ours significantly outperforms SoTA methods, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. The proposed framework is scalable and adaptable to a wide range of multi-modal and multi-model tasks, such as vision-language retrieval, indicating its potential relevance to broader application scenarios. The code and model will be publicly released upon publication.