Generating a talking face video from the input audio sequence is a practical yet challenging task. Most existing methods either fail to capture fine facial details or need to train a specific model for each identity. We argue that a codebook pre-learned on high-quality face images can serve as a useful prior that facilitates high-fidelity and generalizable talking head synthesis. Thanks to the strong capability of the codebook in representing face textures, we simplify the talking face generation task as finding proper lip-codes to characterize the variation of lips during a portrait talking. To this end, we propose LipFormer, a transformer-based framework, to model the audio-visual coherence and predict the lip-codes sequence based on the input audio features. We further introduce an adaptive face warping module, which helps warp the reference face to the target pose in the feature space, to alleviate the difficulty of lip-code prediction under different poses. By this means, LipFormer can make better use of the pre-learned priors in images and is robust to posture change. Extensive experiments show that LipFormer can produce more realistic talking face videos compared to previous methods and faithfully generalize to unseen identities.