Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 49

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu ⋅ Jiahao Wang ⋅ PengFei Zheng ⋅ Ruiran Yan ⋅ Shitao Xiao ⋅ Xin Luo ⋅ Yueze Wang ⋅ Wanli Li ⋅ Xiyan Jiang ⋅ Yexin Liu ⋅ Junjie Zhou ⋅ Ziyi Xia ⋅ Ze Liu ⋅ Chaofan Li ⋅ Haoge Deng ⋅ Kun Luo ⋅ Bo Zhang ⋅ Jiajun Zhang ⋅ Dong Liu ⋅ Defu Lian ⋅ Xinlong Wang ⋅ Zhongyuan Wang ⋅ Tiejun Huang ⋅ Zheng Liu

Paper PDF

Abstract

Multimodal generative models can process instructions in various modalities and demonstrate outstanding performance across a wide range of image generation tasks. However, their robustness in complex real-world scenarios remains limited due to insufficient generalized instruction alignment. We introduces \textbf{OmniGen2}, a unified multimodal generator designed to follow complex, fine-grained instructions. Our core contribution is a two-stage design that first builds a strong, world-knowledge-grounded foundation model and then aligns it using a progressive, multi-task instruction tuning strategy. The foundation model features a streamlined architecture with decoupled decoding for versatile multimodal generation and a novel positional encoding scheme to improve learning efficiency. We ground this model in real-world knowledge using large-scale data construction pipelines. Building on this foundation, we propose a progressive, reinforcement-based alignment process. This phase carefully schedules training tasks and reward signals to foster cross-task knowledge transfer, significantly improving the model's instruction-following capabilities. Our models demonstrate competitive performance on standard benchmarks and our dedicated in-context generation benchmark, \textbf{OmniContext}. We will release our models, code, benchmark, and training datasets to catalyze future research in building more capable and instruction aligned generative models.