DynamicWebArena: Evaluating Browser Agents in Dynamic and Temporally Evolving Full-Stack Environments
Abstract
Existing browser-use benchmarks largely assume a classical, static web architecture. In contrast, today’s web relies heavily on dynamic, stateful, and temporal behaviors driven by asynchronous requests, persistent server communications, and real-time user interactions. Consequently, current agents struggle to handle the non-deterministic, partially observable nature of modern web applications. This exposes fundamental flaws not just in the models themselves, but in current agent architectures and prompting strategies.
To address these deficiencies, we introduce DynamicWebArena, a novel browser-use benchmark that challenges web agents with complex, full-stack environments. DynamicWebArena encompasses diverse, fully operational applications such as online bidding, stock trading, and algorithmically driven social media. While emulating the dynamic impact of external, time-sensitive events, the benchmark remains entirely self-hosted, deterministic, and reproducible. Our experiments demonstrate a significant performance gap, with current agents GPT-5.4 achieving only a 6.8% end-to-end task success rate.
Materials
