Inferring Rewards From Language in Context

Jessy Lin, Daniel Fried, Dan Klein, and Anca Dragan

SATURDAY, SEPTEMBER 17, 2022 •

NLP

Number of Passes: ✅✅

Thank you to Khanh Nguyen for going through this paper with me and helping me understand the inverse RL approach for estimating a reward function, and how this approach can be applied in the context of instruction following and dialogue.

Why this paper?

The WebShop task involves an agent searching for an instruction that comprehensively contains all the desired attributes and options up front. These preferences are not modified across the course of a single trajectory when the agent proceeds towards completing the task. One core feedback from reviewers of WebShop is that such a problem setting is unrealistic. When humans shop, preferences and the corresponding weight placed on that preference are not usually fully available at the beginning; rather, a human tends to apply these several turns into the problem, or perhaps is not even aware such a preference exists until later. Furthermore, such preferences could change, and the very nature of how preferences could be communicated and interpreted is a worthwhile question in and of itself.

Some of these concepts were incorporated in the dataset collection portion of the MultiWOZ dataset, where the constraints of the task are unrolled to the user across multiple turns of the conversation. However, from a slot-filling perspective, it seems that the constraints found in MultiWOZ are discrete (i.e. binary, multi-class) in nature. During the data collection portion for WebShop, one of the more apparent aspects of this was that preferences can be much more numeric and continuous. A methodology for representing such preferences in a generalizable manner from such feedback should be a big step forward towards the goal of designing human-like natural language interfaces.

Context

This paper falls in the instruction following domain. From the base problem of learning to map natural language commands to a sequence of actions, the authors make the observation that such feedback is also a rich source for understanding not just what, but why actions are taken. In practice, this why is framed as the manifestation of a preference. The authors also point out how such feedback is often communicated indirectly:

[W]hen people interact with systems they often primarily aim to achieve specific tasks, rather than literally describing their preferences in full. How do we infer general goals and preferences from utterances in these settings?

If we're able to do this, then the authors put forth the vision that

[B]eyond just selecting the right flight, such a system would be able to autonomously book flights on behalf of the user in other instances.

I should mention at this point that the dataset and task that's put forth in this paper is centered around estimating user preferences for flight booking. With that said, my educated guess is that the high level point of this paper is to put forth a framework and methodology for preference estimation that is fairly agnostic to the underlying domain.

This generalizability is exciting to me, and I like the angle that the authors tie it to preferences, which are closer to capturing user based constraints, rather than language-to-action mappings, which seems to capture more about the task as opposed to the person. This difference seems to be more apparent when a single "slot", in a dialogue sense, has a large number of possible answers.

Contribution

The key idea of our work is that the way that a user refers to their desired actions with language also reveals important information about their reward... Intuitively, in settings with repeated interactions, utterances are optimized to communicate information that is generalizable — implicitly helping listeners make useful inferences for acting on a longer horizon.

The authors put this idea into practice via a dataset and model, built around the core idea that:

[S]peakers choose utterances that both elicit reward-maximizing actions in a particular context and faithfully describe the reward. Given an utterance, our model infers that the most likely rewards are the ones that would have made a speaker likely to choose that utterance.

The dataset is FlightPref, a multi-turn flight booking game with the caveat that natural language utterances from humans are grounded in real, underlying preferences. From this conversation, the authors aim to estimate the ground truth preferences by learning a reward model trained via inverse reinforcement learning. The final deliverable to demonstrate that the authors' intuition is real is:

Our full model obtains relative accuracy improvements of 12% when compared to models that only treat language as descriptions of actions.

Notes

Task Construction FlightPref Task

• The goal of the game is for the assistant to choose the desired flight correctly in as few turns as possible.
• High Score represents most effective communication between speaker and listener.
• Both user and assistant are penalized if the assistant picks incorrectly. This is to encourage collaboration.

Dataset Collection
• Recruit two randomly chosen people, one is assistant and the other is the user
• Play 6 games of 6 rounds each
• Each game, a reward function vector of size 8, representing preferences for 8 different attributes, is generated randomly. The user can see this vector. The assistant cannot see this vector.
• The 91 highest score games are used as the evaluation set

Model Design
A fair share of these ideas are from the robotics domain. Here, the engineering discrepancy is that the inputs are natural language, which the authors convert into BERT-based encodings. The base models involve computing an inner product between a BERT-base encoded utterance and a learned representation of actions produced by an MLP encoder. The listener model's MLP encodes all available actions. The speaker model's MLP encodes the underlying reward. Section 3 of the paper discusses more about the rational listener model design along with how the estimated reward is then used for action selection and utterance generation.

Questions

I enjoyed reading this paper. A question that leapt into the forefront of my mind is how this task could present an opportunity towards models learning to express uncertainties and ask appropriate questions. During data collection, the listener entity, who is a human at this point, has the ability to choose between either 1. selecting one of 3 flight options as the answer or 2. prompting user for utterances. The goal of the model is to estimate the underlying reward function from the conversation, which is a fairly passive role. Is it possible to change aspects of the task to make it possible to design an agent that estimates awards actively? By active, I mean an agent that is still aiming to learn estimations, but is also capable of actively asking questions to address its uncertainties.

Some aspects of the task that could be changed:
• The number of attributes is fixed at 8. What if it's a variable number of attributes? This would likely affect how a reward model is learned. Generating data that captures this variability seems interesting.
• The user has full knowledge of the underlying reward vector. What if this is not the case? Can an agent help discover preferences that the user doesn't realize he/she/they cared about?
• The preferences do not change across the course of a single game.
• The model is inferring user preferences. What if the model can actively ask questions to address uncertainties, as opposed to just observing?
• The number of options that can be selected is fixed at three. What if this catalog is much larger? Realistically, there are hundreds, if not thousands of candidate flights.