Clippers 11/28: Vishal Sunder on End-to-End Real Time Tracking of Children’s Reading with Pointer Network

In this work, we explore how a real time reading tracker can be built efficiently for children’s voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention-based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children’s speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children’s speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset.

Clippers 11/21: Lifeng Jin (Tencent) on safety and consistency in dialogue systems

Safety and Consistency in dialogue systems

Safety and consistency of generated utterances from dialogue systems have been important issues for dialogue system development. A good dialogue system should be safe all the time, even when provoked by users, and consistent with the context, even when the user is not. In this talk, I am going to present our attempts at addressing some of the issues related to safety and consistency with two new datasets, new tasks and experiments. Different models, including large language models such as ChatGPT and GPT4, are used in evaluation of tasks such as safe rewriting and inconsistency resolution to look at their ability to detect and amend dialogues caused by unsafe or inconsistent responses. I will discuss how they behave and what future directions are for these problems.

Clippers 11/14: Ash Lewis and Amad Hussain on Creating an Automated Museum Assistant

Creating an Automated Museum Assistant: Building low-resource document-grounded conversational agents

This week in Clippers, Ash and I would like to discuss our work in constructing a conversational assistant for the COSI Science Museum. Where our previous system consisted of a non-conversational query classifier which responded with canned answers, we seek to create a pipeline which conditions a generative response on retrieved facts/documents and conversational history with minimal risk of toxic output. Our work is on two fronts, the construction of a retrieval system and the training of a generative LLM. For our retrieval system we investigate how to best contextualize a query within a conversation, and how to best represent documents such that retrieval is possible. For the generative LLM, we fine tune t5 and Llama and evaluate their responses using automated metrics, including GPT-4, to see which metrics and model are most effective. These fronts have an added low-resource challenge as much of our data and annotations are synthetically generated.

Clippers 11/7: Alyssa Allen on Natural Language Comment Generation for SQL

Natural Language Comment Generation for SQL

This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. My portion of the project focuses on training a model that can generate line-by-line natural language comments which bridge the gap between SQL and the user’s higher-level question. Prior research in SQL explainability has largely focused on translating SQL to templated English or summarize entire SQL queries with a comment (Eleftherakis et al., 2021; Narechania et al., 2021). In our generation approach, the comments should faithfully describe the purpose of one or multiple SQL commands and leverage language from the user question, ultimately making SQL parse errors easier for novice users to identify.

Our methods include first building a hand-annotated set of examples, which are then used in few-shot prompting with Chat GPT to generate a relatively small set of seed training items. From there, we experiment with fine-tuning a model (e.g. Llama) that can generate natural language comments for any SQL query, using a knowledge distillation plus filtering and editing approach. Work presented in this talk is ongoing.

Clippers 10/31: Jingyi Chen on “Aligning Text-to-Image Models using Human Feedback”

On Halloween in Clippers, Jingyi Chen will present the paper Aligning Text-to-Image Models using Human Feedback (https://arxiv.org/abs/2302.12192).

Abstract: Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.

Clippers 10/24: Sara Court on software tools for low-resource morphological analysis and annotation

Micha and I will present our ongoing work developing software tools for low-resource morphological analysis and annotation. This is part of a larger project we presented last summer at ACL’s ComputEL workshop in collaboration with Maria Copot, Stephanie Antetomaso, and Noah Diewald.

We combine unsupervised methods for morphological paradigm discovery with a browser-based interface and supervised learner implemented in tensorflow.js. We’re currently experimenting with various model designs and active learning selection heuristics and look forward to your feedback as we continue our work!

Clippers 10/17: Sam Stevens on mixture-of-experts (MoE) language models

In Clippers next week I will present some early-stage planning for a mixture-of-experts (MoE) language model project I hope to pursue. It will consist of:

  1. A literature review of neural MoE models in NLP
  2. How MoE models changed my thinking around model parallelism, FLOPs and compute efficiency
  3. What this implies about GPT-4 (which is rumored to be a MoE model)
  4. Soft MoE: a recent paper that aims to solve many problems with MoE models, but only applies it to vision
  5. Ideas I have on how to apply soft MoE to language modeling

I hope that #1 and #2 will be valuable to everyone, because I think MoE models are very under-utilized in research, despite supposedly powering the best language model in the world (GPT-4).

Clippers 10/3: Alex Petrov on intelligence in LLMs

Is GPT-4 Intelligent? What Does This Mean and How Can We Tell?

Artificial intelligence (AI) capabilities are improving at an unprecedented and alarming rate. Existing Large language models (LLMs) such as GPT-4 already demonstrate “sparks” of artificial general intelligence (AGI). That is, they do according to a controversial paper by Bubeck et al. that many ML researchers consider to be a disgrace to the profession, whereas other scientists (myself included) consider to be insightful and of pivotal importance.

These polarized opinions point to a methodological problem. The scientific community does not know how to evaluate opaque models with trillions of parameters. In my talk, I will try to shed some light on this question, drawing from philosophy, psychology, machine learning, theoretical computer science, hardware design, and linguistics. It is a remarkable fact that all these disparate disciplines provide valuable pieces of the puzzle.

Clippers 9/26: Christian Clark on categorial grammar induction

Toward Categorial Grammar Induction Using Predicate Co-occurrences from RoBERTa

Recent experiments with large language models (LLMs) have produced tantalizing
evidence that innate knowledge is not needed to acquire language. Even so, LLMs do not directly reveal what categories and rules are learned, limiting their utility in explaining human language acquisition. Grammar induction models, in contrast, provide a more explicit means of exploring questions about learnability. Recent work has achieved advances in unsupervised induction of probabilistic context-free grammars (PCFGs). However, categorial grammar induction has received less recent attention, despite its appealing properties such as a transparent syntax–semantics interface. Motivated by this, I will present a set of experiments using a new model that induces a basic categorial grammar. I will also describe some first steps toward an extension to the model that will incorporate predicate co-occurrence information extracted from RoBERTa, as a means of leveraging world knowledge from an LLM within a model that learns explicit rules. I am especially interested in hearing the group’s suggestions for this ongoing work.

Clippers 9/19: Byung-Doh Oh on the bigger-is-worse effect of LLM surprisal

A feature attribution analysis of the bigger-is-worse effect of large language model surprisal

Byung-Doh Oh, William Schuler

Recent studies have consistently shown that surprisal estimates from ‘bigger’ large language model (LLM) variants with more parameters and lower perplexity are less predictive of comprehension difficulty that manifests in human reading times, which highlights a fundamental mismatch between the mechanistic processes underlying LLMs and human sentence processing. This work will present preliminary results from a feature attribution analysis that sheds light on such systematic divergence of LLMs by examining how different variants leverage identical context tokens, including observations that 1) perturbation-based feature attribution methods and 2) feature interactions over multiple tokens may be more appropriate for examining bigger LLM variants.