Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
Abstract
Videos are increasingly used as inputs to machine learning models, where repeated decoding and processing for diverse downstream tasks dominate computational costs. However, existing video processing pipelines remain inefficient: traditional video codecs (H.264, H.265) are optimized for human visual quality and require full pixel decoding for each inference, Compressed Domain Inference (CDI) is tightly coupled to specific codec structures with limited task flexibility, and Video Coding for Machines (VCM) demands separate representations task-specific encoders without human visualization support.We propose Neural Video Pipeline (NVP), a framework that leverages Implicit Neural Representations (INR) to directly extract task-specific features from intermediate layers, eliminating pixel reconstruction overhead.NVP employs lightweight Micro Adapters to bridge INR features directly into the feature space of downstream models, bypassing both decoding and early extraction stages.Through comprehensive benchmarks across four representative tasks (image classification, object detection, action recognition, and segmentation), NVP reduces latency up to 89.5\%, inference FLOPs up to 29.9\%, while supporting multiple tasks with a single unified representation.