GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation
Number of Passes: ✅
Why this paper?
I'm currently wrapping up the tail end of an internship at ASAPP and have greatly enjoyed my role thus far. After beginning to work on NLP earlier this year, ASAPP has been a great place to see many subdomains be applied to production settings. ASAPP's core B2C business involves selling tools to enhance the productivity of customer service representatives via a suite of software tools that serve a variety of purposes, including identifying a customer's intent, summarizing a conversation, generating in context responses in a dialogue setting, and transcribing recorded calls via speech-to-text.
In a production setting, one of the most significant and omnipresent considerations by teams across the engineering stack was dealing with out of distribution scenarios. In a dialogue setting where tolerance for mistakes is low and agents are under the gun to resolve issues quickly and correctly, identifying unseen cases is important for preventing cascading errors and refining the system to perform better at core tasks in the future.
From my side, I'm also interested in the general idea of how language agents can learn to express and settle their uncertainties, particularly by requesting and leveraging advice from interacting with humans. Deploying an agent in a production setting that is readily equipped to deal with anything and everything thrown its way seems like an enigma - in this case, to prevent catastrophic errors and design systems that are able to self improve from external feedback, this ability seems particularly important.
This is a very high level idea that likely varies greatly in methodology and implementation across domains, so within my current situation, thinking about how language agents might be equipped to do this in a task oriented dialogue setting feels like an appropriate starting point.
Context
Out of scope detection is a well researched problem. Out of scope typically refers to either out-of-distribution issues (a.k.a. situations not covered in training) or dialogue breakdowns (a.k.a. system fails to continue responding due to earlier ambiguities or misunderstandings in the conversation). To deal with such scenarios, prior work has subscribed to three schools of thought:
- Train a core model for OOS detection on a sufficient amount of labeled OOS data. -> Sufficient prerequisite is unrealistic, so it fails in open world settings.
- Building a model for verifying that new data is in scope -> Not exactly the same objective (i.e. opposite of determining if
x
is in-distribution ≠ determining ifx
is OOS). - Augmenting in-scope data to improve out-of-domain robustness -> Since in-distribution and OOS data reflect separate distributions, augmentation on in-scope for the purposes of OOS detection is still somewhat misaligned.
The Generating Out-of-scope Labels with Data paper fits nicely as a fourth school of thought - Augmenting out-of-scope data, addressing the issue attached to approach 1 to make OOS prediction more reliable. The main goal of the method put forth is to create pseudo-labeled, out of scope examples.
There has been some prior work in this space.
• Data augmentation for improving NLU and intent detection in a dialogue setting
• Augment in-scope samples for bolstering robustness against out-of-scope scenarios
• Using GANs to create out-of-domain examples from in-scope examples
In contrast, we operate directly on OOS samples and consciously generate data far away from anything seen during pre-training, a decision which our later analysis reveals to be quite important.
Contribution
Formulation of OOS prediction
• Direct Prediction: Model is treated as an OOS detector, which learns p(y|x)
. The input x
= {(S1, U1),...,(St, Ut)} specifies a dialogue, where S
and U
are system and user utterances respectively; y
is 0 or 1 (in scope or out of scope).
• Indirect Prediction: Since OOS examples are limited during training, the model is treated as an intent classifier. In this case, the model learns p(y|x)
, where x
refers to a dataset of in-scope dialogues and y
is a multi-class set of labels referring to known user intents. The supporting model does not encounter out-of-scope utterances during training.
GOLD is designed for indirect prediction methods. The authors briefly discuss baseline methods across three categories -- probability threshold, outlier distance, and Bayesian ensemble -- to establish how such baseline methods operate in the indirect prediction setting and use the OOS examples.
GOLD
Concretely, GOLD performs data augmentation on a small sample of labeled OOS examples to generate pseudo-OOS data. This weakly-labeled data is then combined with INS data for training a core OOS detector.
The authors assume the following conditions in this setting that's meant to reflect realistic conditions:
• Limit number of OOS samples to just 1% of number of in-scope training examples.
• Access to external pool of utterances that serves as source of data augmentations, denoted as source dataset S
.
GOLD is then carried out in three steps:
1. Match Extraction
• Purpose: Find utterances in source data that closely matches examples in original OOS seed data.
• How: Encode source + seed data into shared embedding space. Per seed utterance, extract d
nearest utterances (a "match") from source dataset S
measured by cosine distance.
2. Candidate Generation
• Purpose: Create new conversations with existing dialogue context, but that also incorporates the utterances similar to OOS data.
• How: Generate new candidate (for OOS example) by swapping a random user utterance in seed data with a match utterance from source data.
3. Target Election
• Purpose: Out of the new candidates, choose the ones that are most likely to be OOS, and therefore, used as target OOS data.
• How: Evaluate candidates against an ensemble of three baseline detectors (a.k.a. prior indirect prediction methods) and elect those that receive a majority vote.
As a last step, we aggregate the pseudo-labeled OOS examples, the small seed set of known OOS examples and the original INS examples to form the final training set for our model.
Experimentation
The authors apply GOLD's data augmentation and out-of-scope classification techniques to the STAR, SMCalFlow, and ROSTD datasets, using three evaluation metrics:
• AUROC: Probability that random OOS example has higher probability of being out-of-scope compared to a random in-scope example (very meta!)
• AUPR: Summarize performance across multiple thresholds, especially when there is imbalance (# of INS examples >> # of OOS examples)
• FPR@N: Probability that INS example raises a false alarm when N% of OOS examples are detected. This checks whether a model might decide to be overly cautious and in the process, misclassify in-scope values as OOS.
Higher is better for the first two stats while lower is better for FPR@N.
The results of the evaluation reflect that...
Models trained with augmented data from GOLD consistently outperform all other baselines across all metrics.
An ablation study that compares AUROC per dataset against d
, the number of examples generated per seed example, indicates that the inclusion of OOS data generated via augmentation does help. With that said, the authors note that while indirect prediction is better for more realistic settings, from a performance standpoint, a model trained with the core objective of OOS detection on a sufficient amount of OOS data still performs absolutely better, indicating that while such augmentation works, there still may be cases that are not captured by GOLD's data augmentation.
Looking Forward
I thought this paper was well written and enjoyed reading it. I particularly liked the authors' thoroughness when describing the GOLD methodology. They started off with a high level summary, then detailed the three steps more thoroughly. The diagrams were also very self-explanatory, and I liked that the ablations very purposefully tested plausible decisions they could've made at each step. I think research contributions like GOLD are very much what I would like to work on in the coming years.
As for next steps, I felt like this paper did not quite fall in line with my initial motivation. Modeling and acting on uncertainty will likely require reading more on model design papers in the dialogue systems and instruction following spaces. With that said, this paper was very helpful in addressing what out-of-scope formally means along with a quick practical overview of detection methodologies.