CVPR Poster HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Poster

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Hermann Kumbong · Xian Liu · Tsung-Yi Lin · Ming-Yu Liu · Xihui Liu · Ziwei Liu · Daniel Y Fu · Christopher Re · David W. Romero

ExHall D Poster #227

[ Abstract ] [ Project Page ] [ Paper PDF ]

Fri 13 Jun 8:30 a.m. PDT — 10:30 a.m. PDT

Abstract: Visual AutoRegressive modeling (VAR) shows promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolutions) scales. However, this formulation suffers from reduced image quality due to parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the resolution sampling schedule.We introduce

\underset{―}{H}

ierarchical

\underset{―}{M}

asked

\underset{―}{A}

uto

\underset{―}{R}

egressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet

256 \times 256

and

512 \times 512

benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over

2.5 \times

and

1.75 \times

respectively, as well as over

3 \times

lower inference memory footprint.Finally, the Markovian formulation of HMAR yields additional flexibility over VAR; we show that its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

Live content is unavailable. Log in and register to view live content