World of Bits: An Open-Domain Platform for Web-Based Agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, Percy Liang

MONDAY, JULY 11, 2022 •

NLP

Number of Passes: ✅

Why this paper?

Now that v1 of the WebShop project has been published, I thought I'd take some time to look back at some of the core inspiration for WebShop. WebShop was borne of the motivation to design an environment for interactive language grounding and decision making that is scalable and faithful to a real world task scenario. We believe that designing such an environment with scalability at the forefront is significant to not only the language grounding and language agent communities within NLP, but puts us a step closer to language agents operating in the real world, either on behalf or in cooperation with humans.

When thinking about a setting that checks the boxes for the aforementioned goal, we cam around to the idea that the web is an incredible environment, chock full of data and transitions, that when simulated purposefully, can capture a variety of interesting challenges for language agents, allowing trained models to not only be evaluated at scale, but also directly transferrable to a real setting.

We were not the first ones to make this observation. This review discuss the World of Bits (WoB) benchmark, which was, according to my research, the most outstanding work out of the initial attempts at introducing web tasks as an interesting setting for training grounded language agents.

Context

At the time, a significant number of reinforcement learning benchmarks are founded on synthetic data, particularly game environments (i.e. Atari). While such rich and complex environments showcase the immense potential of RL agents, rich semantic information grounded in reality is noticeably missing from an artificial setting that is significantly different from real world scenarios.

[A]gents in such [simulated] environments never experience the sheer breadth of experience of the real world, and thus they miss out on important semantic knowledge crucial for developing intelligence.

Furthermore, for areas with RL applications like robotics, collecting data is financially costly and time consuming.

Tasks found on the web have the potential to close the reality and scalability gap. The authors identify three attributes - open-domain, open-source, and easy data collection - that in conjunction, make scalability across domains and data sources achievable via engineering and low cost collection of human examples.

Contribution

World of Bits Environment

The general WoB environment is a model of the web with the following definitions:
• Observation: <Raw Screen Pixels, Text DOM, Scalar Reward Signal>
• State Space: Color image and query (NL for MiniWoB, <template, slots> for FormWoB and QAWoB)
• Action Space: KeyEvent (press a button) or PointerEvent (move mouse to pixel while pressing left mouse button)
• Agent Response: A list of actions
• Reward: The list of actions is processed in order, and a reward is determined per the task specific reward function

These tasks are entirely executed within the context of a single web page.

The authors develop three different techniques for creating web tasks: MiniWoB, FormWoB, and QAWoB
• MiniWoB: 100 environments written in HTML/CSS/Javascript. Each corresponds to a web task with a manually specified reward function. Across all 100, there are a variety of inputs (i.e. buttons, text fields, sliders, date pickers, etc.). Rewards [-1 (failure), 1 (success)] are based on time to completion.
• FormWoB: Convert websites into web tasks by recreating an offline approximation from HTTP requests + responses. The authors applied this approach to four real flight booking websites.
• QAWoB: Develop a web task, in the form of a query template, via crowdsourcing. This is done in a two step process, where the first individual proposes a website and query template (NL query with slots), then the second performs the task via an interface that records the selected DOM element as the answer.

Modeling

The authors put forth a model that works off of a joint representation of the image (procssed by a CNN) and textual features (text feature map based on matching between query and DOM). The authors then apply behavior cloning and reinforcement learning with promising results, but with a significant gap compared to human performance.

In particular, for flight booking, the model achieves 20%–30% of human level performance on training queries, and 16% on test queries.

Looking Forward

I realize I did not go over the modeling methodologies presented in this paper particularly thoroughly. This is not due to a lack of interest, but more because I'm interested in reading later papers that discuss the authors' original approaches, then propose new modeling techniques built on the MiniWoB benchmark. I'm hoping these papers will provide inspiration for closing the gap between the similar disparity between human and model performance observed in WebShop.