MultiWOZ: a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modeling

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, Milica Gašić

WEDNESDAY, AUGUST 3, 2022 •

NLP

Number of Passes: ✅✅

Why this paper?

From working on WebShop, I've become more interested in work concerning human-machine interaction, especially through a natural language understanding interface. Simultaneously, after reading Prof. Manning's excerpt on human understanding, I've also been interested in the interplay of semantic and neural methods, and how this synergy in the form of neurosymbolic systems might be an appropriate approach for highly structured natural language domains, where natural language methodologies and large language models might benefit from. I believe that dialogue systems seems to be a natural intersection of these two interests, and I want a deeper understanding of the challenges in this field to see if such potential holds any water. As discussed in the Neurosymbolic Programming primer:

Fundamentally, dialog state is an intermediate symbolic representation that depends on complex, highdimensional semantic context, namely dialog history and the underlying knowledge base or API. Thereby, neurosymbolic programming is a natural choice for modeling dialog state, successfully applied in many domains.

The MultiWoz benchmark, released in 2018, has emerged as a milestone in identifying the current state of affirs and key challenges in the dialogue systems community, reflected in its motivations and contributions. In addition, it seems to have become the de facto dataset for a fair share of subsequent work in this space. As a dialogue system typically consists of many sub-parts, I thought that such a benchmark paper would be a good starting point for learning about the broad range of task settings and their corresponding methodologies across the dialogue system stack. Out of the myriad of challenges, I was most interested in seeing the paper's discussions on state tracking and response generation.

Context

The paper identifies the scale and type of interactions of existing conversation datasets to have inherent limitations. For each general area, the challenges are as follows:

Scale: Prior datasets are fairly limited in the number of dialogues, with no dataset exceeding 2500 total. While this is enough to construct modularized, individual systems, the limitation in size inherently limits the ability to train large, end to end systems.

MultiWOZ has around 10k dialogues, which is at least one order of magnitude larger than any structured corpus currently available. This significant size of the corpus allows researchers to carry on end-to-end based dialogue modelling experiments

Interaction Types: The authors review three types of interaction categories: machine-to-machine, human-to-machine, and human-to-human. Machine-to-machine conversations can be generated synthetically and for free, but the inherently artifical nature of such programmatically generated conversations along with the lack of grounding in real conversations makes it hard for systems trained on such corpora to transfer to real world settings. Human-to-machine pipelines do not have this problem, but bootstrapping such data to develop dialogue systems for new domains is not easy; even if such data is transformable or reducible to the desired, new dialogue system's format, it's possible for irrelevant or harmful biases and noise to carry over to the new domain. Human-to-human conversations are richest in their capture of human behavior, but collecting such data can be quite costly. Furthermore, such conversations can be hard to evaluate or unusable when there is no explicit underlying goal or structure, particularly for task-oriented dialogue systems.

Contribution

MultiWoz focuses on human-to-human data collection, and adapts the Wizard-of-Oz framework to 1. Ground human-to-human conversations with real tasks, which in turn makes labeling of semantics, state, and acts automatable 2. Crowdsoure such converations across a large population (as opposed to a handful of experts acting as Oz), making for not only a cheaper collection schema, but also greater diversity and reduced bias.

A quick overview for future me - the Wizard-of-Oz framework is a test set up where a user thinks he/she/they is communicating with an intelligent system or machine, when in fact there is a human (Oz) on the other side. This testing set up has been used across a variety of fields, commonly being exercised to gather data for or evaluate the usability/performance of a user interface. The set up varies in 1. whether the user knows that there is an Oz and 2. the distribution of responses coming from the system or human; these decisions on how Oz is perceived helps encourage certain sought-after, target natural behaviors. In the context of this paper, WOZ was the chosen set up to collect high quality natural language conversation data across multiple domains that are not tied to any particular system, thus addressing the scalability, faithfulness to reality, and semantic richness that prior works came up short on.

Notes

Collection Methodologies

High level goal is to collect multi-domain dialogues. Generally, the authors follow WOZ setup, with one key difference:

To overcome the need of relying the data collection to a small set of trusted workers, the collection set-up was designed to provide an easy-to-operate system interface for the Wizards and easy-to-follow goals for the users.

Using this set up, it's then possible to collect data at scale via crowdsourcing of a larger human population.

Tasks: Several dialogue tasks are generated via templates from a baseline ontology that spans across multiple domains (total 7), with numerous (informable/requestable) slots (total 24) and act types (total 13) found across domains.
• This enables creation of both single + multi-domain dialogue scenarios
• Goals can change with non-zero probability to encourage realistic conversations

Task Presentation + Oz Interface: The task template is mapped to natural language, then introduced to the user via a set of heuristic rules. The Oz GUI allows the operator to, at each turn, either solicit information from a back end database or provide the user with more information. Logging the operator's decision at each turn along with a running record of the belief state allows for automatic annotation of the dialogue with

Dialogue Act Annotation is done an entirely separate step, with a vetting process that 1. ensures dialogues are only annoted by top tier crowd workers and 2. incorporates suggestions by crowd workers on expanding the set of dialogue acts to be more comprehensive of what is being reflected in the conversation.

Other Thoughts

Section 4 discusses dataset statistics and formatting (I found the GitHub repo much more helpful for a better understanding of how the data is organized + what the conversations and annotations look like).

Section 5 re-runs a variety of tasks and SOTA methods on the new MultiWOZ dataset. The main point that seems to be demonstrated is that the MultiWOZ dataset is harder (for state tracking, context-to-text generation, and act-to-text generation) because

  1. The conversations are semantically richer and longer
  2. Dialogues span across multiple domains, making for a more complex context that makes text generation harder
  3. 60+% of dialogue turns have multiple system acts

Questions

I thought the re-evaluation of existing tasks and models on the MultiWOZ dataset poked a lot of holes in prior work, particularly a model's ability to scale with [semantic richness, length of conversation, number of acts / slots per turn / conversation, number of domains]. While I understand how MultiWOZ is a step forward from prior datasets, I don't know which of these facets is the most interesting and important for forthcoming dialogue systems work - I believe I'll get a better sense of this from seeing what other papers identify as the more worthwhile properties of MultiWOZ for their work.

Looking Forward

Read follow up work, mainly:

  1. Benchmarks / Datasets that identify some shortcoming of MultiWOZ and how they address it
  2. Advanacements for models performing Dialogue System tasks (i.e. generation, state tracking) using the MultiWOZ dataset for evaluation.