μVLM: A Vision Language Model for μNPUs
Zijie Chen ⋅ Guiyun Fan ⋅ Zhaoxing Yang ⋅ Rong Ding ⋅ Haiming Jin
Abstract
The proliferation of low-power intelligent processors with integrated Neural Processing Units (NPUs), called $\mu$NPUs, has created new opportunities for on-device generative AI, benefitting end devices like smart wearables and small robots. However, deploying Vision-Language Models (VLMs) on $\mu$NPUs is severely hindered by stringent memory constraints and limited operator support. To bridge this critical gap, we propose $\mu$VLM, the first lightweight-oriented VLM architecture designed for $\mu$NPUs. It is comprised of our proposed OverMod encoder and AttSSM decoder. OverMod is a lightweight dynamic convolutional network inspired by biomimetic vision, incorporating our novel Global Spatial Modulation mechanism to enable adaptive, high-fidelity feature extraction using only NPU-friendly operators. AttSSM leverages a highly efficient State Space Model (SSM) core, augmented with multi-scale feature fusion and Global Context Dynamic Modulation mechanism, to perform robust sequential modeling. Furthermore, we introduce a coordinated full-parameter quantization strategy that preserves precision across the encoder-decoder boundary, alongside hand-optimized operators for unsupported modules like SSMs. $\mu$VLM achieves a competitive CIDEr score of 117.8 on the COCO Karpathy test split and, for the first time, demonstrates the feasibility of millisecond-level VLM inference on a $\mu$NPU platform.
Successful Page Load