In order to tackle video semantic segmentation task at a lower cost, e.g., only one frame annotated per video, lots of efforts have been devoted to investigate the utilization of those unlabeled frames by either assigning pseudo labels or performing feature enhancement. In this work, we propose a novel feature enhancement network to simultaneously model short- and long-term temporal correlation. Compared with existing work that only leverage short-term correspondence, the long-term temporal correlation obtained from distant frames can effectively expand the temporal perception field and provide richer contextual prior. More importantly, modeling adjacent and distant frames together can alleviate the risk of over-fitting, hence produce high-quality feature representation for the distant unlabeled frames in training set and unseen videos in testing set. To this end, we term our method SSLTM, short for Simultaneously Short- and Long-Term Temporal Modeling. In the setting of only one frame annotated per video, SSLTM significantly outperforms the state-of-the-art methods by 2% ~ 3% mIoU on the challenging VSPW dataset. Furthermore, when working with a pseudo label based method such as MeanTeacher, our final model only exhibits 0.13% mIoU less than the ceiling performance (i.e., all frames are manually annotated).