E2 -SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
Senyao Li ⋅ Haozhao Wang ⋅ Zhaobai Jiang ⋅ Zhanbo Jin ⋅ Hao Fan ⋅ Ruixuan Li
Abstract
In edge–cloud environments, the efficiency of speculative decoding is heavily constrained by uplink transmission and cloud-side verification. In this work, we identify a phenomenon we term credit inertia, where the acceptance rates of adjacent token windows exhibit strong temporal consistency. Tokens following recently well-performing windows are likely to pass verification, whereas tokens following poorly performing windows are likely to fail. Motivated by this observation, we propose E$^2$-SCI, an elastic edge–cloud speculative decoding framework that dynamically adjusts draft token verification thresholds based on recent historical performance. This adaptive mechanism allows the system to be more permissive for windows with strong historical performance and stricter for windows with weak performance, effectively leveraging temporal consistency to reduce overall latency. We further introduce Progressive Lookahead Concurrency (PLC), which pipelines draft generation and verification asynchronously to hide latency. Experiments across multiple benchmarks show that E$^2$-SCI achieves over $9.4$ tokens/s on DeepSeek-R1-Distill-Qwen (1.5B/32B), delivering an 88.5\% speed improvement over the FSD baseline while maintaining accuracy. Notably, E$^2$-SCI integrates seamlessly with existing frameworks (e.g., EAGLE-3), demonstrating broad applicability and superior efficiency–quality trade-offs.
Successful Page Load