NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction
Abstract
Scene Graph Generation (SGG) aims to structurally represent visual scenes by detecting objects and their pairwise relationships. Despite significant progress, current models encode visual knowledge with ambiguous visual context and logically inferred implicit relations due to their purely neural, pipeline-based nature. This limitation underscores the need to advance beyond identifying what relations exist to explaining why they exist and how they can be compositionally reasoned about through logical rule chaining. To address these challenges, we introduce NeuroRule, the first \textbf{Neurally-Guided Rule Induction Network} that integrates Mask2Former pixel-precise visual understanding with a differentiable rule induction engine. Our proposed method enables automatic learning of compositional logical rules directly from visual data while providing transparent explanations for relational predictions. NeuroRule introduces three key innovations: (1) a neural-symbolic bridge that maps visual features to probabilistic symbolic representations; (2) a differentiable rule-learning mechanism that automatically discovers interpretable first-order logic rules without manual engineering; and (3) a compositional chain rule system that enables complex inference while propagating confidence scores through an end-to-end trainable pipeline. Extensive experiments on the benchmark datasets, including Visual Genome (VG), Panoptic Scene Graph (PSG), and OpenPSG, demonstrate that NeuroRule achieves state-of-the-art performance. Our method significantly improves few-shot relation extraction while maintaining full interpretability in its rule-based explanations. To ensure reproducibility, we will release the code after publication.