Skip to yearly menu bar Skip to main content


GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Tomas Soucek · Dima Damen · Michael Wray · Ivan Laptev · Josef Sivic

Arch 4A-E Poster #172
[ ] [ Project Page ]
Wed 19 Jun 5 p.m. PDT — 6:30 p.m. PDT


We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.

Live content is unavailable. Log in and register to view live content