Clippers 9/26: Christian Clark on categorial grammar induction

Toward Categorial Grammar Induction Using Predicate Co-occurrences from RoBERTa

Recent experiments with large language models (LLMs) have produced tantalizing
evidence that innate knowledge is not needed to acquire language. Even so, LLMs do not directly reveal what categories and rules are learned, limiting their utility in explaining human language acquisition. Grammar induction models, in contrast, provide a more explicit means of exploring questions about learnability. Recent work has achieved advances in unsupervised induction of probabilistic context-free grammars (PCFGs). However, categorial grammar induction has received less recent attention, despite its appealing properties such as a transparent syntax–semantics interface. Motivated by this, I will present a set of experiments using a new model that induces a basic categorial grammar. I will also describe some first steps toward an extension to the model that will incorporate predicate co-occurrence information extracted from RoBERTa, as a means of leveraging world knowledge from an LLM within a model that learns explicit rules. I am especially interested in hearing the group’s suggestions for this ongoing work.

Clippers 9/19: Byung-Doh Oh on the bigger-is-worse effect of LLM surprisal

A feature attribution analysis of the bigger-is-worse effect of large language model surprisal

Byung-Doh Oh, William Schuler

Recent studies have consistently shown that surprisal estimates from ‘bigger’ large language model (LLM) variants with more parameters and lower perplexity are less predictive of comprehension difficulty that manifests in human reading times, which highlights a fundamental mismatch between the mechanistic processes underlying LLMs and human sentence processing. This work will present preliminary results from a feature attribution analysis that sheds light on such systematic divergence of LLMs by examining how different variants leverage identical context tokens, including observations that 1) perturbation-based feature attribution methods and 2) feature interactions over multiple tokens may be more appropriate for examining bigger LLM variants.

Clippers 9/5: Michael White on Bootstrapping a Conversational Guide for Colonoscopy Prep (Arya et al., SIGDIAL-23)

Pulkit Arya, Madeleine Bloomquist, Subhankar Chakraborty, Andrew Perrault, William Schuler, Eric Fosler-Lussier, and Michael White. 2023. Bootstrapping a Conversational Guide for Colonoscopy Prep. To appear in Proc. SIGDIAL-23.

Creating conversational systems for niche domains is a challenging task, further exacerbated by a lack of quality datasets. We explore the construction of safer conversational systems for guiding patients in preparing for colonoscopies. This has required a data generation pipeline to generate a minimum viable dataset to bootstrap a semantic parser, augmented by automatic paraphrasing. Our study suggests large language models (e.g., GPT-3.5 & GPT-4) are a viable alternative to crowd sourced paraphrasing, but conversational systems that rely upon language models’ ability to do temporal reasoning struggle to provide accurate responses. A neural-symbolic system that performs temporal reasoning on an intermediate representation of user queries shows promising results compared to an end-to-end dialogue system, improving the number of correct responses while vastly reducing the number of incorrect or misleading ones.

Clippers 8/29: Ash Lewis on Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide

Title: Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide

Authors: Ashley Lewis and Michael White


LLMs are known to be very powerful, exhibiting both great benefits and great risk. We seek to leverage the benefits, in particular the ability to be fluent, conversational dialogue agents, while minimizing the risks, such as hallucination and toxic content. In this work we use knowledge distillation to create a virtual museum tour guide dialogue agent, employing ChatGPT as a teacher model for a smaller student model, T5-large. We find the T5 model shows competitive performance, significantly reduces instances of hallucination, and shows promise for reducing toxic content.

Clippers 4/18: Jingyi Chen on Multi-Source Morphological Reinflection with Reinforcement Learning

Multi-Source Morphological Reinflection with Reinforcement Learning

This project develops a task using reinforcement learning to guild multi-source morphological reinflection (MRI). MRI is the task of transforming words from one inflectional form to another. For example, when encountering a new inflected form of a word, humans may rely on their knowledge of the morphological rules of the language, as well as their experience with similar forms in the past, to infer the correct inflection. In Kann and coauthors’ (2017) study, they develop a multi-source MRI model, which receives a target tag and multiple pairs of source form and source tag for a lemma. Their model is found to out-perform single-source reinflection models as different source forms can provide complementary information. Although Kann does not provide specific details on how the multiple pairs of source form and tag are chosen, selecting appropriate source form-tag pair as reference words are the key in modeling morphological reinflection. Our project use reinforcement learning to select reference words during morphological reinflection process, specifically, an RL agent could learn to select the appropriate source form and tag pair based on the context of the lemma and the morphological features, as well as its experience with similar examples in the past, which is similar to the way humans select the appropriate inflected form based on context and their past experience with the language. Since this project is still ongoing, I would greatly appreciate any suggestions or feedback.

Clippers 4/11: Alyssa Allen on Line-by-Line Comment Generation for SQL

This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. Prior research in SQL comment-generation has focused on comments which summarize entire SQL queries and translations of SQL to templated English (Eleftherakis et al., 2021; Narechania et al., 2021). These approaches can be helpful in comprehending SQL but are limited in their ability to guide users through the query steps and connect formal notation with intuitive concepts. To address this limitation, the project aims to generate line-by-line comments that leverage language from user questions, connecting formal SQL notation with user-friendly concepts (e.g. “tallest” or “alphabetical order”).

Due to a lack of pre-existing training data, 100 SQL queries from the SPIDER dataset (Yu et al., 2018) have been manually annotated. These 100 examples will then be used as a base for generating a more robust training set through self-training and prompting. I have been experimenting with using ChatGPT to generate comments for more queries as well as fine-tuning BART for the task. This approach will allow us to scale the training set quickly and minimize time spent writing comments by hand. This presentation will discuss the annotation process and preliminary results for comment generation using the above methods.

Clippers 3/28: Amad Hussain on Improving Training with Imbalanced Datasets

Tackling Training with Imbalanced Datasets: An Investigation of MixUp and Paraphrase Augmentation for Downstream Classification

Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A possible solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low-quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. In previous work, we investigated the efficacy of paraphrase augmentation using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques were validated through training and evaluation of a downstream self-attentive RNN model with and without MixUp (embedding interpolation during training). The results were mixed and indicated a trade-off between reduction of misleading paraphrases and paraphrase diversity.

In this talk, I will go over potential training paradigms and paraphrase filtration mechanisms which expand on this previous work. Ideas range from example sampling techniques, variable-loss during MixUp, and paraphrase filtration using training loss. The hope is that one, or some combination, of these methods will improve model generalizability and class-imbalanced training. The obvious direction is not clear so feedback on these directions will be much appreciated!

Clippers 3/21: Vishal Sunder on end-to-end word-level disfluency detection and classification in children’s reading assessment

Title: End-to-end word-level disfluency detection and classification in children’s reading assessment.

Abstract: Disfluency detection and classification on children’s speech has a great potential for teaching reading skills. Word-level assessment of children’s speech can help teachers to effectively gauge their students’ progress. Hence, we propose a novel attention-based model to perform word-level disfluency detection and classification in a fully end-to-end (E2E) manner making it fast and easy to use. We develop a word-level disfluency annotation scheme using which we annotate a dataset of children read speech, the reading races dataset (READR).We also annotate disfluencies in the existing CMU Kids corpus. The proposed model significantly outperforms traditional cascaded baselines, which use forced alignments, on both datasets. To deal with the inevitable class-imbalance in the datasets, we propose a novel technique called HiDeC (Hierarchical Detection and Classification) which yields a detection improvement of 23% and 16% and a classification improvement of 3.8% and 19.3% relative F1-score on the READR and CMU Kids datasets respectively.

Clippers 2/21: Ash Lewis on Improving Generated Responses of a COSI Language Pod Guide

In Clippers on Tuesday, I’m going to present on the beginning stages of a new project. I’m attempting to design a response generation model for the COSI museum avatar — a virtual question-answering guide at the Language Pod that can answer questions about the pod, linguistics, and other exhibits at COSI. Currently, the avatar, which is modeled after the Virtual Patient project, returns “canned” responses to questions, meaning that it has prescribed, static answers for a set of in-domain questions to which it tries to match user inputs. This can result in a fairly unnatural conversation; if the avatar interprets two utterances as the same question, it will repeat the exact same answer. The goal of my current project is to migrate to using a response generation model that will be more contextually aware and answer questions dynamically, but also adapt to constant changes in content as exhibits in the museum change. To do so, I’m attempting to leverage the capabilities of OpenAI’s ChatGPT to generate training data for a smaller model that will hopefully avoid the pitfalls of LLMs such as toxic behavior. The plan is to eventually train a document-grounded generation model that responds directly to user inputs rather than needing to first map them to prescribed questions. This project is in the early exploratory phases, so I’m hoping to get lots of feedback on design choices and suggestions for other avenues to explore.

Clippers 2/14: Shuaichen Chang on Selective Demonstration for Text-to-SQL

Abstract: Large language models (LLMs) have shown a strong generalization capability in the cross-domain text-to-SQL task without using in-domain examples. However, with a few in-domain annotations as demonstration examples, LLMs’ performance can be further improved. In this work, we first investigate the crucial elements of in-domain examples. Based on our findings, we propose to create demonstration examples with minimal in-domain annotation to improve the generalization ability of LLMs.