WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
Abstract
We present WebGym, the largest open-source environment for training realistic visual web agents to date. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 1 million tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) algorithm, REINFORCE, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To speed up sampling rollouts for RL training, we develop a high-throughput asynchronous rollout system, designed specifically for web agents, that achieves a 4-5x rollout speedup compared to naive implementations, enabling us to train at scale on a diverse set of tasks. With this setup, we fine-tune strong vision-language models, such as Qwen-3-VL-8B-Instruct, on the training tasks from WebGym, which results in an improvement in success rate on an out-of-distribution test set from 21.8% to 28.5%, outperforming a "proprietary" GPT-4o-based agent and closing the gap to a GPT-5-Thinking agent that achieves 31.8%. This improvement is significant because our test set consists only of tasks on websites never seen during training, demonstrating generalization for web agents. We provide both the task breadth and system throughput for large-scale RL on web agents.