Task-Oriented Dialogue as Dataflow Synthesis
Number of Passes: ✅🎞️✅
Why this paper? + Context
I've mentioned my interest in the idea of uniting the best of symbolic/neural methods/representations before while reading Prof. Manning's journal entry on a brief history of natural language processing. During his Turing talk, Prof. Yoshua Bengio discusses the idea of designing intelligent systems that mimic both Type 1 and Type 2 thinking. Whereas Type 1 is based on instinct and intuitive pattern recognition, Type 2 refers to more slow, methodical, rule-driven thinking. If we are to move towards AGI (artificial general intelligence), Prof. Bengio argues that Type 2 thought must also be automated. I agree with this premise, so the question that naturally follows is "How?".
The emergence of neural networks during the early 2010s seems to represent the latest pendulum swing from symbolic oriented methods back to pure neural approaches, a familiar phenomenon that occurred multiple times as AI winters throughout the second half of the 20th century. In the 2010s, with machine translation as the primary motivation, numerous prior research directions focused on designing grammars and parsers that were fairly rigid and usually domain specific. Neural methods came back in favor for their ability to deal with ambiguity and uncertainty better than a rule-based approach would be able to (aside from the more outstanding reason of superior empirical performance). Neural methods also offer more flexibility for input modalities and relatively scale much better in the wild.
Today, the recent flurry of large language models and generative models have put forth incredibly impressive performance on increasingly complicated tasks from simple problems like part of speech recognition to more complicated tasks that an average human might not be able to do such as joke explanation. While these predominantly neural methods are powerful, they are not without their flaws. Such large architectures are not particularly interpretable or analyzable. There have been research initiatives dedicated to reverse engineering NN layers into explainable concepts, such as the Network Dissection paper for vision models. However, such approaches often only work for specific model architectures and require a variety of conditions (i.e. model must be trained, quality of interpretability depends on underlying dataset). This is not to say that such approaches are futile, just that it is difficult to wrangle such information out of neural methods; even when we can, it seems we are only able to quantify, but not qualify NN behaviors in an automated fashion. The opaqueness of neural methods makes it difficult to deploy in the wild or design natural language interfaces that are safe and reliable.
So we arrive at a decision to be made. The authors point out this contrast between neural and symbolic systems in the context of solving state and action tracking in dialogue systems:
Dialogue systems with fixed symbolic state representations (like slot filling systems) are easy to train but hard to extend (Pieraccini et al., 1992). Deep continuous state representations are flexible enough to represent arbitrary properties of the dialogue history, but so unconstrained that training a neural dialogue policy “end-to-end” fails to learn appropriate latent states (Bordes et al., 2016).
A potential alternative that I'd be interested in working on is merging the best of both the symbolic and neural worlds in the form of neurosymbolic systems. In practice, this field encompasses a variety of approaches with the general goal of searching over symbolic programs via a combination of neural and symbolic techniques (reminiscent of program synthesis). This concept is nothing new, with initial formulations going back to several decades ago. What makes now a good time to revisit this area is 1. The scale of data and compute and 2. The motivation to deploy AI systems that can work safely in the wild. The authors put it eloquently:
This paper introduces a new framework for dialogue modeling that aims to combine the strengths of both approaches: structured enough to enable efficient learning, yet flexible enough to support open-ended, compositional user goals that involve multiple tasks and domains.
So in one sentence: I'm interested in reading this paper because it is a practical example of addressing NLP tasks with a neurosymbolic system that was also deployed to a production, in-the-wild setting.
Contribution
There's a lot this paper offers - I'll do my best to focus on the changes that stood out to me. If you want a more in depth overview, the video I linked above is pretty fantastic.
► Overview
Generally, the paper remodels the way the dialogue system itself is structured. Traditional dialogue systems, both in research and in production, tend to be made up multiple components.
Our approach is significantly different from a conventional dialogue system pipeline, which has separate modules for language understanding, dialogue state tracking, and dialogue policy execution (Young et al., 2013).
This paper presents an alternative representation, the dataflow graph, that unifies action and state tracking, and that can be directly acted upon for natural language generation purposes. The dataflow graph itself serves as a ground truth for the shared context of a multi-turn dialogue. For each turn, the order of operations is, at a high level, as follows:
User creates an utterance
→ Natural Language Understanding: NLU to Dataflow Repr. model translates the query into updates to the dataflow graph
→ Natural Language Generation: Based on the graph, the model determines an appropriate response that results in 1. a natural language output communicated back to the user and 2. extensions to the dataflow graph.
The below image features a dataflow graph, where nodes represent either a function (that takes its parents in as inputs) or a terminal node of which there are two types -- a primitive value or a constructor. findEvent
is an example of a function while 'retreat'
, DateTimeSpec
, EventSpec
, 2021
, start
, and duration
are all examples of primitives.
The construction and extension of the dataflow graph is framed as a collaborative task between the human and the agent, where human utterances and the agent's response are all recorded as non-destructive extensions to the dataflow graph. Down the line, language understanding by an agent is cast as a dataflow prediction problem, using neural methods rather than traditional semantic parsing to convert utterances into extensible graph components.
The dataflow graph offers a representation that is superior to traditional slot filling paradigms because it allows for users to talk referentially rather than explicitly and make amendments to prior values or decisions. The Operators for NLU section talks more about the node types and meta-computations that allow for this.
► SMCalFlow Dataset
The groundwork for manifesting the vision of the dataflow model is to design a dataset with the following characteristics:
• Captures natural language queries presented by prior dialogue datasets
• Create richer, more realistic dialogue that includes new challenges oriented around referential language as detailed by the following observation:
Human speakers avoid repeating themselves in conversation by using anaphora, ellipsis, and bridging to build on shared context (Mitkov, 2014).
• Create language with additional challenges, including out-of-scope requests and references to values not present at the current turn in the dataflow graph, which would require falling back on an API or database
• Annotate each dialogue turn with a corresponding program + dataflow graph
► Operators for NLU
To model referential dependencies, allow for modification of prior values, and resolve missing information or ambiguities in the dataflow graph, the authors introduce a set of new, metacomputation operators. I feel that the original explanations are quite strong and I've linked them here:
• refer
operation (Video Discussion)
• revise
operation (Video Discussion)
• Exception Mechanism (Video Discussion)
► Response Generation
The goal here is to take the most recent modifications to the formal representation that is the dataflow model, and determine an appropriate natural language equivalent to communicate back to the user.
The generation model... conditions on a view of the graph rooted at the most recent return node, so generated responses can mention both the previously returned value and the computation that produced it.
The authors don't talk about much the model design specifics, but note that:
The dataset released with this paper includes output from a learned generation model that can describe the value computed at a previous turn, describe the structure of the computation that produced the value, and reference other nodes in the dataflow graph via referring expressions.
Notes
► On the responses to this paper
From the number of citations referring to this paper along with the SMCalFlow leaderboard, it seems that this work has been widely acknowledge, but not as extended upon. From reading a couple GitHub issues and poking around the codebase, it seems that there's quite a bit of barrier to entry to getting set up. To me, these seemed like the biggest needs:
- Open source version of the language generation model. While the dataset does contain input-output pairs that should allow for training, for practitioners that are more interested in either using SMCalFlow out of box with their own ontologies or would like a point of reference for how to design their own models.
- A simpler annotation schema. From the examples in the dataset, the annotations for the dataflow graph are quite dense. For those without backgrounds in programming languages or symbolic systems, this format might look much more complicated than those presented in the original paper. For instance, the
Wrapper
andYield
entities took me a bit of reading to understand, and even then I felt like they weren't entirely necessary. Drawing the focus more to the ontology rather than - There aren't any examples of executable functions that would appear in a dataflow graph, and I couldn't find an execution engine to show how such functions were executed. Although the traversal is explained well in the paper and likely not too convoluted to implement, it's another bump in the road for any parties interested in productionizing SMCalFlow or doing extended research.
- More documentation that describes defining a custom ontology, writing an execution engine, generating language from dataflow graphs, and general design decisions would be very helpful.
Note (11/16/2022): I came across this workshop paper and companion code by Joram Meron that argues for greater adoption of the paper. The author identifies some core road blocks and then puts forth contributions that help with simplifying the dataflow graph representation and its associated annotations.
► On the practicality of this paper
Aside from SMCalFlow, there of course have been plenty of research advancements across the stack of dialogue systems, from fairly solved problems such a speech-to-text and language understanding, to more current challenges including language generation and interfaces. However, aside from large tech companies with a war chest of funding or start-ups with an explicit focus on language agents, we haven't seen such contributions make their way into mainstream production system, despite a definite enthusiasm for chat bots and virtual assistants. They are still too complicated for a ML practitioner or software engineer to pick up a system and run with it. Why?
I'm sure there are a variety of reasons for this beyond my understanding, but within my scope of knowledge, I believe one bottleneck is the use of neural approaches. Let's say you're tasked with building a new task-oriented dialogue system. The order of events, if one is to employ neural approaches for a component of the dialogue system, might be as follows:
You don't have dialogue examples readily available → It's expensive to collect such examples → In this few/zero-shot setting, you could fine tune or prompt an LLM, let's say for generation → It's very likely that language responses will suffer from data leakage and hallucinate. → If it does generate something undesirable, how do you identify the reason or correct the model?
There's definitely work in the space of controlled / constrained generation that looks at how to subdue such phenomenon [1] with probabilistic sampling schemes or smart prompt design. I don't know enough about this space to praise or criticize in an educated manner, so this next opinion should definitely be taken with a large grain of salt, but from a general read through of the cited blog post, none of these approaches involve directly debiasing or denoising the LLM itself. As a result, none of these can guarantee that certain behaviors will not occur with 0% chance. Instead, it seems to be probabilistic work arounds.
I can't speak to forthcoming research directions that could very well achieve those guarantees due to my lack of domain knowledge. I will say that such approaches, in their current state, seem 1. quite difficult to implement generally and 2. again, don't provide hard guarantees that certain output will definitively not occur.
I think interpretability and controllable are absolutely critical to successful and safe deployment of language models in the wild. While the generalizability of LLMs is laudable, I feel that for real world use cases, what's more important is the ability to quickly develop and deploy a dialogue system that is reliable and safe due to its interpretability and repairability. For people to adopt such systems, I think these are the properties that should be guaranteed, and making good on those guarantees feels much more realistic with neurosymbolic systems, despite its inferior generalizability and greater amount of engineering.
► On the engineering behind the paper
Before becoming interested in academia, I was very invested in the goal of becoming a strong engineer during my undergraduate years. That mission has stuck with me as I venture into research. For this reason, I'm a big fan of this paper due to the amount of engineering work and care that went into not only the paper itself, but also the presentation. From an engineering stand point, I feel like this paper definitely put forth a challenging code base, as there are a lot of moving parts. However, the authors then provided multiple mediums for learning about their work, from the original paper to the SMLCalFlow leaderboard to a video presentation that incorporated the whole team. I found the video incredibly helpful and engaging for understanding the paper at a high level, which is why I added a little video icon between the checks on the number of passes. Reading through the second time after watching the video was really fruitful. It would be cool if authors putting out 5-10 minute videos (alongside the original paper) that gave a high level overview of the whole paper was a more normalized practice, beyond just for conferences.
Citations
[1] Weng, Lilian. (Jan 2021). Controllable neural text generation. Lil’Log. https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/.