Towards High-resolution and Disentangled Reference-based Sketch Colorization
Abstract
Sketch colorization models have been widely studied to automate and assist in the creation of animation frames and digital illustrations. However, current methods are still not satisfactory for industrial standard applications in high-resolution synthesis and precise controllability of details. To further enhance the synthesis quality and controllability, we propose an image-referenced sketch colorization method based on the powerful SDXL backbone and leverage sketches as spatial guidance and RGB images as color references. A split cross-attention mechanism is coupled with spatial masks to separately colorize the foreground and background regions to avoid spatial entanglement. A tagger network trained on a massive anime-style image dataset is employed to extract attribution-level information from reference images and integrated into the pipeline to provide precise control signals for synthesis. However, the increased resolution and number of attention layers in the SDXL backbone and precise reference information from the tagger network cause severe entanglement during colorization. We consequently combine a foreground encoder and a background encoder for disentanglement and better synthesis quality. Furthermore, a high-quality annotated and paired sketch colorization dataset is collected for fine-tuning. The proposed method is the first to achieve high resolution high quality sketch colorization with precise control, and obvious outperforms existing methods in quantitative and qualitative validations, as well as user studies in both quality and controllability. Ablation study reveals the influence of each component. Code and dataset will be made publicly available upon paper acceptance.