AceTone: Bridging Words and Colors for Conditional Image Grading
Tianren Ma ⋅ Mingxiang Liao ⋅ Xijin Zhang ⋅ Qixiang Ye
Abstract
Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose **AceTone**, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE-based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $\Delta \text{E}<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to **50%** over existing methods. Human evaluations confirm that AceTone’s results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading. The models and datasets will be publicly available.
Successful Page Load