Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
Abstract
Recent unified models have made remarkable strides in generating high-quality images, yet they consistently fail on reasoning-intensive tasks, i.e., solving mazes, assembling tangrams. Intriguingly, we find that vision-language models (VLMs) and large language models (LLMs) can accurately solve these tasks, but cannot generate the corresponding images because they lack a structured visual output interface. This reveals that the core bottleneck is not reasoning capacity, but the lack of a structured interface to translate high-level reasoning into precise visual output. To bridge this gap, we propose using code-structured visual hints (i.e., SVG/HTML) overlays that explicitly encode reasoning steps directly on the image plane. Accordingly, we develop an automatic data construction pipeline that can generate high-quality code-structured hints for existing datasets and train a unified model called Hint2Gen based on FLUX.1 Kontext to condition its generation on such hints. Furthermore, to comprehensively evaluate the effectiveness of our approach, we introduce Reason2Gen, a benchmark comprising 4,000 samples spanning 20 categories across 7 core dimensions, including path connectivity, spatial assembly, etc. Extensive experiments demonstrate that even simply providing such hints as extra inputs—without any retraining—boosts their performance. And our model significantly outperforms all leading open-source/closed-source methods on reasoning-aware generation and editing across all the dimensions.