Clippers 9/5: Michael White on Bootstrapping a Conversational Guide for Colonoscopy Prep (Arya et al., SIGDIAL-23)

Pulkit Arya, Madeleine Bloomquist, Subhankar Chakraborty, Andrew Perrault, William Schuler, Eric Fosler-Lussier, and Michael White. 2023. Bootstrapping a Conversational Guide for Colonoscopy Prep. To appear in Proc. SIGDIAL-23.

Creating conversational systems for niche domains is a challenging task, further exacerbated by a lack of quality datasets. We explore the construction of safer conversational systems for guiding patients in preparing for colonoscopies. This has required a data generation pipeline to generate a minimum viable dataset to bootstrap a semantic parser, augmented by automatic paraphrasing. Our study suggests large language models (e.g., GPT-3.5 & GPT-4) are a viable alternative to crowd sourced paraphrasing, but conversational systems that rely upon language models’ ability to do temporal reasoning struggle to provide accurate responses. A neural-symbolic system that performs temporal reasoning on an intermediate representation of user queries shows promising results compared to an end-to-end dialogue system, improving the number of correct responses while vastly reducing the number of incorrect or misleading ones.

Clippers 8/29: Ash Lewis on Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide

Title: Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide

Authors: Ashley Lewis and Michael White


LLMs are known to be very powerful, exhibiting both great benefits and great risk. We seek to leverage the benefits, in particular the ability to be fluent, conversational dialogue agents, while minimizing the risks, such as hallucination and toxic content. In this work we use knowledge distillation to create a virtual museum tour guide dialogue agent, employing ChatGPT as a teacher model for a smaller student model, T5-large. We find the T5 model shows competitive performance, significantly reduces instances of hallucination, and shows promise for reducing toxic content.

Clippers 4/18: Jingyi Chen on Multi-Source Morphological Reinflection with Reinforcement Learning

Multi-Source Morphological Reinflection with Reinforcement Learning

This project develops a task using reinforcement learning to guild multi-source morphological reinflection (MRI). MRI is the task of transforming words from one inflectional form to another. For example, when encountering a new inflected form of a word, humans may rely on their knowledge of the morphological rules of the language, as well as their experience with similar forms in the past, to infer the correct inflection. In Kann and coauthors’ (2017) study, they develop a multi-source MRI model, which receives a target tag and multiple pairs of source form and source tag for a lemma. Their model is found to out-perform single-source reinflection models as different source forms can provide complementary information. Although Kann does not provide specific details on how the multiple pairs of source form and tag are chosen, selecting appropriate source form-tag pair as reference words are the key in modeling morphological reinflection. Our project use reinforcement learning to select reference words during morphological reinflection process, specifically, an RL agent could learn to select the appropriate source form and tag pair based on the context of the lemma and the morphological features, as well as its experience with similar examples in the past, which is similar to the way humans select the appropriate inflected form based on context and their past experience with the language. Since this project is still ongoing, I would greatly appreciate any suggestions or feedback.

Clippers 4/11: Alyssa Allen on Line-by-Line Comment Generation for SQL

This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. Prior research in SQL comment-generation has focused on comments which summarize entire SQL queries and translations of SQL to templated English (Eleftherakis et al., 2021; Narechania et al., 2021). These approaches can be helpful in comprehending SQL but are limited in their ability to guide users through the query steps and connect formal notation with intuitive concepts. To address this limitation, the project aims to generate line-by-line comments that leverage language from user questions, connecting formal SQL notation with user-friendly concepts (e.g. “tallest” or “alphabetical order”).

Due to a lack of pre-existing training data, 100 SQL queries from the SPIDER dataset (Yu et al., 2018) have been manually annotated. These 100 examples will then be used as a base for generating a more robust training set through self-training and prompting. I have been experimenting with using ChatGPT to generate comments for more queries as well as fine-tuning BART for the task. This approach will allow us to scale the training set quickly and minimize time spent writing comments by hand. This presentation will discuss the annotation process and preliminary results for comment generation using the above methods.

Clippers 3/28: Amad Hussain on Improving Training with Imbalanced Datasets

Tackling Training with Imbalanced Datasets: An Investigation of MixUp and Paraphrase Augmentation for Downstream Classification

Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A possible solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low-quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. In previous work, we investigated the efficacy of paraphrase augmentation using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques were validated through training and evaluation of a downstream self-attentive RNN model with and without MixUp (embedding interpolation during training). The results were mixed and indicated a trade-off between reduction of misleading paraphrases and paraphrase diversity.

In this talk, I will go over potential training paradigms and paraphrase filtration mechanisms which expand on this previous work. Ideas range from example sampling techniques, variable-loss during MixUp, and paraphrase filtration using training loss. The hope is that one, or some combination, of these methods will improve model generalizability and class-imbalanced training. The obvious direction is not clear so feedback on these directions will be much appreciated!

Clippers 3/21: Vishal Sunder on end-to-end word-level disfluency detection and classification in children’s reading assessment

Title: End-to-end word-level disfluency detection and classification in children’s reading assessment.

Abstract: Disfluency detection and classification on children’s speech has a great potential for teaching reading skills. Word-level assessment of children’s speech can help teachers to effectively gauge their students’ progress. Hence, we propose a novel attention-based model to perform word-level disfluency detection and classification in a fully end-to-end (E2E) manner making it fast and easy to use. We develop a word-level disfluency annotation scheme using which we annotate a dataset of children read speech, the reading races dataset (READR).We also annotate disfluencies in the existing CMU Kids corpus. The proposed model significantly outperforms traditional cascaded baselines, which use forced alignments, on both datasets. To deal with the inevitable class-imbalance in the datasets, we propose a novel technique called HiDeC (Hierarchical Detection and Classification) which yields a detection improvement of 23% and 16% and a classification improvement of 3.8% and 19.3% relative F1-score on the READR and CMU Kids datasets respectively.

Clippers 2/21: Ash Lewis on Improving Generated Responses of a COSI Language Pod Guide

In Clippers on Tuesday, I’m going to present on the beginning stages of a new project. I’m attempting to design a response generation model for the COSI museum avatar — a virtual question-answering guide at the Language Pod that can answer questions about the pod, linguistics, and other exhibits at COSI. Currently, the avatar, which is modeled after the Virtual Patient project, returns “canned” responses to questions, meaning that it has prescribed, static answers for a set of in-domain questions to which it tries to match user inputs. This can result in a fairly unnatural conversation; if the avatar interprets two utterances as the same question, it will repeat the exact same answer. The goal of my current project is to migrate to using a response generation model that will be more contextually aware and answer questions dynamically, but also adapt to constant changes in content as exhibits in the museum change. To do so, I’m attempting to leverage the capabilities of OpenAI’s ChatGPT to generate training data for a smaller model that will hopefully avoid the pitfalls of LLMs such as toxic behavior. The plan is to eventually train a document-grounded generation model that responds directly to user inputs rather than needing to first map them to prescribed questions. This project is in the early exploratory phases, so I’m hoping to get lots of feedback on design choices and suggestions for other avenues to explore.

Clippers 2/14: Shuaichen Chang on Selective Demonstration for Text-to-SQL

Abstract: Large language models (LLMs) have shown a strong generalization capability in the cross-domain text-to-SQL task without using in-domain examples. However, with a few in-domain annotations as demonstration examples, LLMs’ performance can be further improved. In this work, we first investigate the crucial elements of in-domain examples. Based on our findings, we propose to create demonstration examples with minimal in-domain annotation to improve the generalization ability of LLMs.

Clippers 2/7: Byung-Doh Oh on decomposing autoregressive LM hidden states

While there is much recent interest in studying why Transformer-based large language models make predictions the way they do, the complex computations performed within each layer has traditionally posed a strong bottleneck. To mitigate this shortcoming, this work presents a linear decomposition of final hidden states from autoregressive language models based on each initial input token, which is exact if the activation function is piecewise linear. This decomposition allows the definition of probability distributions that ablate the contribution of input tokens, which can be used to analyze their influence on model probabilities over a sequence of upcoming words with only one forward pass from the model. Using the change in next-word probabilities as a measure of importance, this work examines which context words make the biggest contribution to language model predictions. Regression experiments suggest that Transformer-based language models rely primarily on collocational associations, followed by linguistic factors such as syntactic dependencies and coreference relationships in making next-word predictions. Additionally, analyses using these measures to predict syntactic dependencies and coreferent mention spans show that collocational association and repetitions of the same token largely explain the language model’s predictions on the respective tasks.

Clippers 1/31: David Palzer on N-Pathic Speaker Diarization

Title: N-Pathic Speaker Diarization
Abstract: Speaker diarization is mainly studied through clustering speaker embeddings. However, the clustering approach has two major limitations: it doesn’t minimize diarization errors and can’t handle speaker overlaps. To address these problems, End-to-End Neural Diarization (EEND) was introduced. The Encoder-Decoder-Attractor (EDA) was also proposed for recordings with unknown speaker count. In this paper, we present two improvements: (1) N-Pathic, a base model that uses chunked data to reduce attention mechanism length and memory usage, and (2) an improved EDA architecture with increased data efficiency through non-sequence-dependant modules. Our proposed method was evaluated on simulated mixtures, real telephone calls, and real dialogue recordings.