Skip to yearly menu bar Skip to main content


Poster

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

Yunseok Jang · Yeda Song · Sungryull Sohn · Lajanugen Logeswaran · Tiange Luo · Dong-Ki Kim · GyungHoon Bae · Honglak Lee


Abstract:

Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing agents capable of mobile operating system (mobile OS) navigation. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models trained on MONDAY demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving 21.41\%p better performance on previously unseen mobile OS configurations. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework combines robust OCR-based scene detection (95.04% F1-score), near-perfect UI component detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.

Live content is unavailable. Log in and register to view live content