Dynamic Important Example Mining for Reinforcement Finetuning
Abstract
Reinforcement fine-tuning (RFT) is increasingly used to strengthen the reasoning abilities of large models, yet its effectiveness is bounded by how training data are selected and used. Most data-centric RFT methods rely on static or heuristic sample selection, implicitly assuming a sample’s value is fixed over training. This overlooks the non-stationary dynamics of policy learning and can lead to suboptimal updates.We propose Dynamic Important Example Mining (DIEM), a principled and fully automated framework that makes data utilization adaptive throughout RFT. DIEM integrates two components into each optimization step: (i) a gradient-alignment importance estimator that efficiently approximates each sample’s marginal contribution to policy improvement; and (ii) a constrained batch reweighting scheme that maximizes aggregate utility while preserving the update’s gradient magnitude to stabilize optimization. This converts data selection from a one-time preprocessing heuristic into an intrinsic part of the learning algorithm, yielding a self-organizing, curriculum-like training trajectory driven by model dynamics rather than external scores.Across several multimodal reasoning benchmarks, DIEM consistently outperforms strong static and dynamic baselines, providing a significant performance uplift to the base RFT algorithm of approximately 1% to 6%, while introducing only a minimal 1.2% training overhead.