OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion
Abstract
Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets focus on Western cuisines, with limited coverage of Chinese dishes, leading to limitations in accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food scenes with detailed nutritional annotations and multi-view images for each scene. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework to predict nutritional information from a single RGB image. We first predict a depth map from a single RGB image, then refine it using our Scale-Shift Residual Adapter (SSRA), which enforces global scale consistency and preserves local structural details. Second, the Frequency-Aligned Fusion Module (FAFM) hierarchically fuses RGB and adapted depth features, aligning multi-modal representations in the frequency domain across layers. Third, the Mask-based Prediction Head (MPH) emphasizes key ingredient regions via dynamic channel selection, improving prediction accuracy. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, providing a practical solution for daily dietary assessment.