Skip to yearly menu bar Skip to main content


Poster

HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation

Hermann Kumbong · Xian Liu · Tsung-Yi Lin · Ming-Yu Liu · Xihui Liu · Ziwei Liu · Daniel Y Fu · Christopher Re · David W. Romero


Abstract: Visual AutoRegressive modeling (VAR) shows promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolutions) scales. However, this formulation suffers from reduced image quality due to parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the resolution sampling schedule.We introduce H_ierarchical M_asked A_utoR_egressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256×256 and 512×512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5× and 1.75× respectively, as well as over 3× lower inference memory footprint.Finally, the Markovian formulation of HMAR yields additional flexibility over VAR; we show that its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

Live content is unavailable. Log in and register to view live content