Clippers 3/28: Amad Hussain on Improving Training with Imbalanced Datasets

Tackling Training with Imbalanced Datasets: An Investigation of MixUp and Paraphrase Augmentation for Downstream Classification

Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A possible solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low-quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. In previous work, we investigated the efficacy of paraphrase augmentation using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques were validated through training and evaluation of a downstream self-attentive RNN model with and without MixUp (embedding interpolation during training). The results were mixed and indicated a trade-off between reduction of misleading paraphrases and paraphrase diversity.

In this talk, I will go over potential training paradigms and paraphrase filtration mechanisms which expand on this previous work. Ideas range from example sampling techniques, variable-loss during MixUp, and paraphrase filtration using training loss. The hope is that one, or some combination, of these methods will improve model generalizability and class-imbalanced training. The obvious direction is not clear so feedback on these directions will be much appreciated!