Implicit neural representation demonstrates promising ability in representing images with arbitrary resolutions recently. In this paper, we present Local Implicit Transformer (LIT) that integrates attention mechanism and frequency encoding technique into local implicit image function. We design a cross-scale local attention block to effectively aggregate local features and a local frequency encoding block to combine positional encoding with Fourier domain information for constructing high-resolution (HR) images. To further improve representative power, we propose Cascaded LIT (CLIT) exploiting multi-scale features along with cumulative training strategy that gradually increase the upsampling factors for training. We have performed extensive experiments to validate the effectiveness of these components and analyze the variants of the training strategy. The qualitative and quantitative results demonstrated that LIT and CLIT achieve favorable results and outperform the previous works within arbitrary super-resolution tasks.