ROSE: Rotate Your Large Language Model to See
Abstract
Recent advances in multimodal large language models (MLLMs) have shown impressive progress in integrating visual and linguistic understanding. However, most existing MLLMs inject visual information into the input space of large language models (LLMs), which substantially increases context length and computational overhead, while often disrupting pretrained linguistic priors by forcing the LLM to optimize on vision-dominant multimodal sequences. In this work, we propose a rotation-based vision injection paradigm that aligns visual information with the parameter space of LLMs. Visual semantics are encoded as rotation matrices and applied directly to the pretrained parameters. This parameter-space injection eliminates the need for long input sequences, thus avoiding the quadratic computational overhead inherent in input-space injection. Besides, it preserves the linguistic competence of the LLM by maintaining the intrinsic geometric structure of the pretrained parameters. Building upon this paradigm, we develop ROSE, a 7B MLLM that achieves fine-grained vision–language alignment with remarkable computational efficiency. Extensive experiments across 12 multimodal benchmarks show that ROSE delivers superior or competitive performance compared with leading models.At comparable accuracy, ROSE reduces FLOPs by 80.7% and inference latency by 56.4% relative to Qwen2.5-VL-7B, demonstrating the effectiveness and scalability. All training code, model weights and data will be publicly released.