W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning
Abstract
Pedestrian trajectory prediction is crucial for applications such as autonomous driving and social robots. Recently, language model (LM)–based trajectory prediction has offered both prediction accuracy and interpretability. However, the L2 loss commonly used in trajectory prediction cannot be directly applied to LM optimization, resulting in degraded prediction performance. Moreover, current LM-based trajectory prediction methods lack explicit expressions of social interactions, and their scene descriptions are overly simplistic, making it challenging to impose practical scene constraints. To address these issues, we propose Write-to-Walk (W2W). First, we construct a pedestrian trajectory dataset with explicit interaction semantics and generate parsable prompts based on observed trajectories and interaction cues (companion/following/obstacle), alleviating the lack of interaction semantics in prompts. Afterward, a T5-Small backbone is trained in a two-stage manner: (1) Full-parameter supervised fine-tuning with cross-entropy loss for language learning, endowing W2W with the capability for formatted question answering; (2) Reinforcement Learning (RL) to optimize W2W, where a reward function that integrates ADE error and off-road penalties strengthens scene constraints, producing future trajectories consistent with the scene context and further improving prediction accuracy. Experiments on the benchmarking datasets (ETH/UCY and SDD) demonstrate that W2W outperforms LM-based prediction methods on ADE/FDE metrics and achieves competitive results compared with SOTA trajectory prediction methods. Meanwhile, the interpretability of LMs further enhances W2W’s prospects for deployment in safety-critical domains, such as autonomous driving.