Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Abstract
Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject–predicate–object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification instead of a genuine {progressive, generative} task. We propose \textbf{FlowSG}, which recasts SGG as continuous-time transport on a {hybrid} discrete–continuous state: starting from a {noised graph}, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a {scene graph} (e.g., the continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for {categorical tokens (object features and predicate labels)}, coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors/segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete–continuous generative formulation over one-shot classification baselines, e.g., an average improvement of about 3 points over the SOTA USG-Par.