SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
Haojuan Li ⋅ Ruohan Tang ⋅ Dongzhou Cheng ⋅ Zongpu Zhang ⋅ Jian Li ⋅ Jiaqi Wang
Abstract
Video Large Language Models (VideoLLMs) face a fundamental performance bottleneck: the token explosion intrinsic to video inputs. The resulting $O(N^2)$ prefill cost makes conventional Transformer inference prohibitively expensive at scale. Existing attempts fall into a hard accuracy–latency dilemma: naive sparsification risks losing essential temporal–spatial context, whereas naive parallelization introduces substantial communication and memory overhead.To overcome this impasse, we argue that algorithm–system co-design is not optional but necessary, jointly optimizing what to compute (sparsification) and how to compute it (parallelism). We introduce SegMo, a unified framework that instantiates this co-design principle and enables efficient, accurate VideoLLM inference at scale.SegMo is driven by the empirical insight that VideoLLM attention exhibits Local Cohesion. Our system implements this via two integrated components:(1) Content-Aware Sparsification (CAS): A lightweight, hierarchical algorithm that first employs Query Relevance for scene-level assessment, and then uses Temporal Redundancy for intra-scene static redundancy pruning, to generate a precise, non-uniform computation load, ensuring accuracy.(2) Locally-Cohesive Segment Parallelism (LSP): A novel paradigm that leverages attention locality to partition the video at scene boundaries, using a lightweight Global Context Injection mechanism to replace the massive communication and memory overheads of global attention.SegMo was validated across LVBench, LongVideoBench, and Video-MME. Our CAS module improved accuracy by up to 12.00%. When integrated with LSP, the full system (CAS + LSP) achieved a peak prefill acceleration of 3.55x, while still maintaining a significant accuracy gain of up to 8.31%.
Successful Page Load