ABCD: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems
Derek Chen, Howard Chen, Yi Yang, Alex Lin, Zhou Yu
Number of Passes: ✅✅
Check out the blog post by Derek Chen for the original unveiling of the ABCD dataset to the greater conversational AI research community.
Why this paper?
The decision to read this paper is following up on my personal directive to read more dialogue systems papers, particularly benchmarks that follow up on MultiWOZ. I elected to read the ABCD paper in particular because 1. I work closely with Howard Chen at Princeton, who not only authored this paper while at ASAPP, but is also a fantastic mentor and a great friend, and 2. I just completed a summer internship at ASAPP working as an ML engineer, where I had the opportunity to examine the methodology and engineering behind production task oriented dialogue systems up close.
The evolution of dialogue system datasets going forward, especially when seen from the lense of motivating practical usage, seems to be greater complexity that captures constraints and interaction dynamics beyond simply identifying a user's desired intent. To develop more realistic conversation scenarios requires tasks and corresponding data that is a better reflection of the logical depth that a human agent in a real task oriented dialogue setting must know to properly do their job.
The research team at ASAPP is at the forefront of such pragmatic dialogue, and the ABCD paper reflects their effort to design a benchmark that mirrors the task oriented dialogue system setting for customer service applications.
Context
The main observation the authors make from a real customer service setting that forms the foundation of their contributions is:
[S]electing actions in real life requires not only obeying user requests, but also following practical policy limitations which may be at odds with those requests.
The second point refers to the traditional slot filling paradigm. While traditional task-oriented dialogue systems involve a fairly straightforward mapping of intent to action, in reality...
[R]esolving customer issues often concerns multiple actions completed in succession with a specific order since prior steps may influence future decision states.
These two observations characterize the full suite of contributions, which include the ABCD dataset, collection methodology, and two new tasks (Action State Tracking and Cascading Dialogue Success). The upshot of this paper is to enable development of more practical dialogue system models that are capable of carrying out actions respecting both user intents and company policies, while also defining a new paradigm for multi-step actions called subflows that can be adapted to designing sets of actions for any domain.
Contribution
ABCD Dataset: As elaborated upon in the related work, the differences between ABCD and prior work are as follows.
Compared to traditional dialogue datasets, particularly task-oriented settings that involve 1. looking up actions from a single knowledge base and 2. scale to more domains...
Rather than expanding wider, ABCD instead focuses deeper by increasing the count and diversity of actions within a single domain.
Prior conversational datasset work has made efforts to be more realistic via different focuses (i.e. handling empathy, common-sense reasoning), multimodal augmentations, or introducing external data sources (i.e. personas, online reviews, large knowledge bases).
ABCD aims to make dialogue more realistic by considering distinct constraints from policies.
This task setting preserves the agent-human configuration, but deviates from the typical WOZ setup. The authors describe the Expert Live Chat setting as a setup with the following three constraints:
(1) Conversations are conducted continuously in real-time.
(2) Users involved are not interchangeable.
(3) Players are informed that all participants are human – no wizard behind the scenes.
This setting seems to be quite a bit trickier, mainly because during collection, the human acting as an agent has to have expertise of the policy in place, in addition to the core task of helping the human achieve their desired intent. Section 4.2 discusses the design considerations to satisfy the constraints of needing an expert human and how to make task completion feasible since dialogues cannot be completed asynchronously or randomly assigned to any MTurk workers. All in all, it seems that the task put forth is a close replica of a real customer service agent role.
I found it impressive that the authors were able to translate their observations into a dataset that is quantitatively and qualitatively more extensive than prior work, offering an unparalleled depth that is not otherwise achievable via simply expanding across domains or knowledge bases.
[W]e end up with 10,042 total conversations with an average of 22.1 turns – the highest turn count among all compared datasets. Unsurprisingly, ABCD includes more actions per dialogue than other datasets, by at least a factor of two... Since each subflow represents a unique customer intent, ABCD contains 55 user intents evenly distributed through the dataset. By interpreting buttons as domains, the dataset contains 30 domains and 231 associated slots, compared to 7 domains and 24 slots within MultiWOZ.
New Benchmarks:
The Action State Tracking (AST) task is a variant of traditional dialog state tracking (DST), with the one important twist that it accounts for constraints from a policy (i.e. Customer Service Guidelines). In practice, this means that in addition to the intent identification step, any additional steps that correspond to the mandates listed out in the policy must be executed in the correct order. The Cascading Dialogue Success task evalutes a model's ability to understand actions in context, which means predicting the type of turn (utterance, action, ending) that should happen next and identifying the relevant details.
Looking Forward
The paper was very well written, and I had an opportunity ask questions to Howard and folks at ASAPP, so I feel I have a pretty clear picture of this paper. As for next steps, I think there are certainly more benchmark papers worth reading (I'm now more interested in the multimodal papers that were mentioned into this paper. Primarily, I'm curious exactly why multimodal is useful and worth exploring in the dialogue systems space). However, I think it makes more sense at this point to take a look at some methodology papers that build on MultiWOZ and ABCD.