Clippers 3/28: Amad Hussain on Improving Training with Imbalanced Datasets

Tackling Training with Imbalanced Datasets: An Investigation of MixUp and Paraphrase Augmentation for Downstream Classification

Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A possible solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low-quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. In previous work, we investigated the efficacy of paraphrase augmentation using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques were validated through training and evaluation of a downstream self-attentive RNN model with and without MixUp (embedding interpolation during training). The results were mixed and indicated a trade-off between reduction of misleading paraphrases and paraphrase diversity.

In this talk, I will go over potential training paradigms and paraphrase filtration mechanisms which expand on this previous work. Ideas range from example sampling techniques, variable-loss during MixUp, and paraphrase filtration using training loss. The hope is that one, or some combination, of these methods will improve model generalizability and class-imbalanced training. The obvious direction is not clear so feedback on these directions will be much appreciated!

Clippers 3/21: Vishal Sunder on end-to-end word-level disfluency detection and classification in children’s reading assessment

Title: End-to-end word-level disfluency detection and classification in children’s reading assessment.

Abstract: Disfluency detection and classification on children’s speech has a great potential for teaching reading skills. Word-level assessment of children’s speech can help teachers to effectively gauge their students’ progress. Hence, we propose a novel attention-based model to perform word-level disfluency detection and classification in a fully end-to-end (E2E) manner making it fast and easy to use. We develop a word-level disfluency annotation scheme using which we annotate a dataset of children read speech, the reading races dataset (READR).We also annotate disfluencies in the existing CMU Kids corpus. The proposed model significantly outperforms traditional cascaded baselines, which use forced alignments, on both datasets. To deal with the inevitable class-imbalance in the datasets, we propose a novel technique called HiDeC (Hierarchical Detection and Classification) which yields a detection improvement of 23% and 16% and a classification improvement of 3.8% and 19.3% relative F1-score on the READR and CMU Kids datasets respectively.