Poster Fri, Jun 5, 2026 • 3:00 PM – 5:00 PM PDT ExHall A & F 282

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

Xin Hu ⋅ Ke Qin ⋅ Wen Yin ⋅ Yuan-Fang Li ⋅ Ming Li ⋅ Tao He

Abstract

Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject–predicate–object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification instead of a genuine {progressive, generative} task. We propose \textbf{FlowSG}, which recasts SGG as continuous-time transport on a {hybrid} discrete–continuous state: starting from a {noised graph}, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a {scene graph} (e.g., the continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for {categorical tokens (object features and predicate labels)}, coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors/segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete–continuous generative formulation over one-shot classification baselines, e.g., an average improvement of about 3 points over the SOTA USG-Par.