Clippers 10/31: Jingyi Chen on “Aligning Text-to-Image Models using Human Feedback”

On Halloween in Clippers, Jingyi Chen will present the paper Aligning Text-to-Image Models using Human Feedback (https://arxiv.org/abs/2302.12192).

Abstract: Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.

Clippers 10/24: Sara Court on software tools for low-resource morphological analysis and annotation

Micha and I will present our ongoing work developing software tools for low-resource morphological analysis and annotation. This is part of a larger project we presented last summer at ACL’s ComputEL workshop in collaboration with Maria Copot, Stephanie Antetomaso, and Noah Diewald.

We combine unsupervised methods for morphological paradigm discovery with a browser-based interface and supervised learner implemented in tensorflow.js. We’re currently experimenting with various model designs and active learning selection heuristics and look forward to your feedback as we continue our work!

Clippers 10/3: Alex Petrov on intelligence in LLMs

Is GPT-4 Intelligent? What Does This Mean and How Can We Tell?

Artificial intelligence (AI) capabilities are improving at an unprecedented and alarming rate. Existing Large language models (LLMs) such as GPT-4 already demonstrate “sparks” of artificial general intelligence (AGI). That is, they do according to a controversial paper by Bubeck et al. that many ML researchers consider to be a disgrace to the profession, whereas other scientists (myself included) consider to be insightful and of pivotal importance.

These polarized opinions point to a methodological problem. The scientific community does not know how to evaluate opaque models with trillions of parameters. In my talk, I will try to shed some light on this question, drawing from philosophy, psychology, machine learning, theoretical computer science, hardware design, and linguistics. It is a remarkable fact that all these disparate disciplines provide valuable pieces of the puzzle.

Clippers 9/26: Christian Clark on categorial grammar induction

Toward Categorial Grammar Induction Using Predicate Co-occurrences from RoBERTa

Recent experiments with large language models (LLMs) have produced tantalizing
evidence that innate knowledge is not needed to acquire language. Even so, LLMs do not directly reveal what categories and rules are learned, limiting their utility in explaining human language acquisition. Grammar induction models, in contrast, provide a more explicit means of exploring questions about learnability. Recent work has achieved advances in unsupervised induction of probabilistic context-free grammars (PCFGs). However, categorial grammar induction has received less recent attention, despite its appealing properties such as a transparent syntax–semantics interface. Motivated by this, I will present a set of experiments using a new model that induces a basic categorial grammar. I will also describe some first steps toward an extension to the model that will incorporate predicate co-occurrence information extracted from RoBERTa, as a means of leveraging world knowledge from an LLM within a model that learns explicit rules. I am especially interested in hearing the group’s suggestions for this ongoing work.

Clippers 9/19: Byung-Doh Oh on the bigger-is-worse effect of LLM surprisal

A feature attribution analysis of the bigger-is-worse effect of large language model surprisal

Byung-Doh Oh, William Schuler

Recent studies have consistently shown that surprisal estimates from ‘bigger’ large language model (LLM) variants with more parameters and lower perplexity are less predictive of comprehension difficulty that manifests in human reading times, which highlights a fundamental mismatch between the mechanistic processes underlying LLMs and human sentence processing. This work will present preliminary results from a feature attribution analysis that sheds light on such systematic divergence of LLMs by examining how different variants leverage identical context tokens, including observations that 1) perturbation-based feature attribution methods and 2) feature interactions over multiple tokens may be more appropriate for examining bigger LLM variants.

Clippers 9/5: Michael White on Bootstrapping a Conversational Guide for Colonoscopy Prep (Arya et al., SIGDIAL-23)

Pulkit Arya, Madeleine Bloomquist, Subhankar Chakraborty, Andrew Perrault, William Schuler, Eric Fosler-Lussier, and Michael White. 2023. Bootstrapping a Conversational Guide for Colonoscopy Prep. To appear in Proc. SIGDIAL-23.

Creating conversational systems for niche domains is a challenging task, further exacerbated by a lack of quality datasets. We explore the construction of safer conversational systems for guiding patients in preparing for colonoscopies. This has required a data generation pipeline to generate a minimum viable dataset to bootstrap a semantic parser, augmented by automatic paraphrasing. Our study suggests large language models (e.g., GPT-3.5 & GPT-4) are a viable alternative to crowd sourced paraphrasing, but conversational systems that rely upon language models’ ability to do temporal reasoning struggle to provide accurate responses. A neural-symbolic system that performs temporal reasoning on an intermediate representation of user queries shows promising results compared to an end-to-end dialogue system, improving the number of correct responses while vastly reducing the number of incorrect or misleading ones.

Clippers 8/29: Ash Lewis on Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide

Title: Mitigating Harms of LLMs via Knowledge Distillation for a Virtual Museum Tour Guide

Authors: Ashley Lewis and Michael White

Abstract:

LLMs are known to be very powerful, exhibiting both great benefits and great risk. We seek to leverage the benefits, in particular the ability to be fluent, conversational dialogue agents, while minimizing the risks, such as hallucination and toxic content. In this work we use knowledge distillation to create a virtual museum tour guide dialogue agent, employing ChatGPT as a teacher model for a smaller student model, T5-large. We find the T5 model shows competitive performance, significantly reduces instances of hallucination, and shows promise for reducing toxic content.

Clippers 4/18: Jingyi Chen on Multi-Source Morphological Reinflection with Reinforcement Learning

Multi-Source Morphological Reinflection with Reinforcement Learning

This project develops a task using reinforcement learning to guild multi-source morphological reinflection (MRI). MRI is the task of transforming words from one inflectional form to another. For example, when encountering a new inflected form of a word, humans may rely on their knowledge of the morphological rules of the language, as well as their experience with similar forms in the past, to infer the correct inflection. In Kann and coauthors’ (2017) study, they develop a multi-source MRI model, which receives a target tag and multiple pairs of source form and source tag for a lemma. Their model is found to out-perform single-source reinflection models as different source forms can provide complementary information. Although Kann does not provide specific details on how the multiple pairs of source form and tag are chosen, selecting appropriate source form-tag pair as reference words are the key in modeling morphological reinflection. Our project use reinforcement learning to select reference words during morphological reinflection process, specifically, an RL agent could learn to select the appropriate source form and tag pair based on the context of the lemma and the morphological features, as well as its experience with similar examples in the past, which is similar to the way humans select the appropriate inflected form based on context and their past experience with the language. Since this project is still ongoing, I would greatly appreciate any suggestions or feedback.

Clippers 4/11: Alyssa Allen on Line-by-Line Comment Generation for SQL

This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. Prior research in SQL comment-generation has focused on comments which summarize entire SQL queries and translations of SQL to templated English (Eleftherakis et al., 2021; Narechania et al., 2021). These approaches can be helpful in comprehending SQL but are limited in their ability to guide users through the query steps and connect formal notation with intuitive concepts. To address this limitation, the project aims to generate line-by-line comments that leverage language from user questions, connecting formal SQL notation with user-friendly concepts (e.g. “tallest” or “alphabetical order”).

Due to a lack of pre-existing training data, 100 SQL queries from the SPIDER dataset (Yu et al., 2018) have been manually annotated. These 100 examples will then be used as a base for generating a more robust training set through self-training and prompting. I have been experimenting with using ChatGPT to generate comments for more queries as well as fine-tuning BART for the task. This approach will allow us to scale the training set quickly and minimize time spent writing comments by hand. This presentation will discuss the annotation process and preliminary results for comment generation using the above methods.

Clippers 3/28: Amad Hussain on Improving Training with Imbalanced Datasets

Tackling Training with Imbalanced Datasets: An Investigation of MixUp and Paraphrase Augmentation for Downstream Classification

Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A possible solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low-quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. In previous work, we investigated the efficacy of paraphrase augmentation using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques were validated through training and evaluation of a downstream self-attentive RNN model with and without MixUp (embedding interpolation during training). The results were mixed and indicated a trade-off between reduction of misleading paraphrases and paraphrase diversity.

In this talk, I will go over potential training paradigms and paraphrase filtration mechanisms which expand on this previous work. Ideas range from example sampling techniques, variable-loss during MixUp, and paraphrase filtration using training loss. The hope is that one, or some combination, of these methods will improve model generalizability and class-imbalanced training. The obvious direction is not clear so feedback on these directions will be much appreciated!