Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 35

Learning Compact 3D Representations from Feed-Forward Novel View Synthesis

Honggyu An ⋅ Jaewoo Jung ⋅ Mungyeom Kim ⋅ Chaehyun Kim ⋅ Minkyeong Jeon ⋅ Jisang Han ⋅ Kazumi Fukuda ⋅ Takuya Narihira ⋅ HYUNAH KO ⋅ Junsu Kim ⋅ Sunghwan Hong ⋅ Yuki Mitsufuji ⋅ Seungryong Kim

Abstract

Reconstructing and understanding 3D scenes from sparse views in a feed-forward manner remains challenging. While recent approaches use per-pixel 3D Gaussian Splatting for reconstruction and 2D-to-3D feature lifting for scene understanding, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation. We propose a feed-forward framework that estimates compact Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns to efficiently lift features. Extensive experiments on 3D open-vocabulary segmentation and view-invariant feature generation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction, achieving superior memory efficiency and feature fidelity compared to existing methods. All of our code will be made publicly available.