News & Announcements | CLLT | Computational Linguistics and Language Technology

Clippers 3/5: Alyssa Allen on SQL Query Explainability using Natural Language Generation

March 5, 2024 at 1:33am by Michael White

SQL Query Explainability using Natural Language Generation

This work is rooted in a larger project aimed at developing a dialogue system that helps increase transparency of database query outputs for non-expert SQL users. Previously, I’ve discussed processes for building a training set using few-shot prompting and a hand-annotated set of commented queries. Additionally, I’ve discussed test set results from LLMs (such as ChatGPT and Llama). This presentation will shift focus to the content of the natural language.

I’ll discuss the development of comment guidelines and the need for guidelines in standardizing the evaluation. Comment guidelines should ideally provide transparency in what constitutes a “good” comment. Comments should also 1) reflect certain properties of the relational database structure, 2) prioritize semantic fidelity to the query and 3) align with the user language wherever appropriate. The comment guidelines use these core elements to outline how generated natural language can increase explainability of database queries. Our methods will be compared to approaches that leverage templated or rule-based systems of explainability.

Clippers 2/20: Byung-Doh Oh, Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times

February 19, 2024 at 11:12pm by Michael White

Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times

https://arxiv.org/abs/2402.02255

Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. The current work presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. First, residual errors from four language model families on four corpora show that the inverse correlation between model size and fit to reading times is the strongest on the subset of least frequent words, which is driven by excessively accurate predictions of larger model variants. Additionally, training dynamics reveal that during later training steps, all model variants learn to predict rare words and that larger model variants do so more accurately, which explains the detrimental effect of both training data amount and model size on fit to reading times. Finally, a feature attribution analysis demonstrates that larger model variants are able to accurately predict rare words based on both an effectively longer context window size as well as stronger local associations compared to smaller model variants. Taken together, these results indicate that Transformer-based language models’ surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.

Clippers 2/6: Ash Lewis on a user study of interactive KB querying

February 5, 2024 at 3:55pm by Michael White

In Clippers on Tuesday, February 6th, I will be presenting the results of a user study we (Lingbo Mo, Huan Sun, Mike White, and myself) conducted in order to test the viability of an interactive semantic parsing system we built. The system was designed to help users query a knowledge base in natural language, offsetting the need to know the query language that the knowledge base uses and thus making the information more accessible to novice users. Our system decomposes the query into pieces and translates them into understandable natural language, so that users can see exactly how the system reached an answer and therefore be confident in it. Alternatively, if the parse is incorrect, the user can utilize a natural language interface to correct it.

This work was conducted in the “pre-LLM era” and thus much of the technical contribution is a bit outdated. However, the user study, in which we had crowdworkers test several versions of the system, has broad application to human evaluation of dialogue systems. As dialogue systems become increasingly ubiquitous, we believe our experience conducting this user study has important lessons to contribute to evaluation methodologies.

My goal for Clippers is to make clearer the “story” for a paper about evaluation – this project has spanned many years and there is a great deal of content to sift through. I hope to get fresh eyes on that content and get feedback on the most salient pieces.

Clippers 1/30: Chris Brew on building a summarizer module for Lexis+AI

February 5, 2024 at 3:53pm by Michael White

Building a summarizer module for Lexis+AI

With minimal prompting, commercial large language models can produce useful indicative summaries of many documents. Given informed and tolerant readers, the bar for usefulness is low, and current models easily achieve it. But these summaries do not meet the standards required of a professional information product. We show that, for legal documents, a “faceted” approach to summarization can smooth the path to acceptable professional quality. The Lexis+AI product currently covers about three and a half use cases, which I will explain and demonstrate.

In an applied AI setting, and especially for LLMs, evaluation is a key issue, and one which plays out differently for each use case, and also differently from what is normal in academic NLP. If time permits, I will try to give my impressions of how this really works in practice, and point at opportunities for high-impact work on evaluation.

In other words, we’ll finish up talking a little about what “acceptable professional quality” might mean. I am definitely speaking myself on this, not representing a company position.

Clippers 1/23: Sara Court and Alyssa Allen, Project Workshopping/Brainstorming

January 21, 2024 at 1:47amFebruary 5, 2024 by Michael White

Sara will be workshopping developments for her QP2 on leveraging pedagogical materials with LLMs for low-resource machine translation.

Alyssa will be workshopping directions for a potential collaborative project related to human-machine interactions. The experiments will involve an embodied language-capable robot. Research questions will likely focus on how the robot can best align with human conversational preferences. Example linguistic/conversational features of interest include backchanneling, laughter, cooperative overlap, and rate of speech.

Clippers 11/28: Vishal Sunder on End-to-End Real Time Tracking of Children’s Reading with Pointer Network

November 25, 2023 at 6:28pm by Michael White

In this work, we explore how a real time reading tracker can be built efficiently for children’s voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention-based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children’s speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children’s speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset.

Clippers 11/21: Lifeng Jin (Tencent) on safety and consistency in dialogue systems

November 17, 2023 at 1:45am by Michael White

Safety and Consistency in dialogue systems

Safety and consistency of generated utterances from dialogue systems have been important issues for dialogue system development. A good dialogue system should be safe all the time, even when provoked by users, and consistent with the context, even when the user is not. In this talk, I am going to present our attempts at addressing some of the issues related to safety and consistency with two new datasets, new tasks and experiments. Different models, including large language models such as ChatGPT and GPT4, are used in evaluation of tasks such as safe rewriting and inconsistency resolution to look at their ability to detect and amend dialogues caused by unsafe or inconsistent responses. I will discuss how they behave and what future directions are for these problems.

Clippers 11/14: Ash Lewis and Amad Hussain on Creating an Automated Museum Assistant

November 12, 2023 at 5:59pm by Michael White

Creating an Automated Museum Assistant: Building low-resource document-grounded conversational agents

This week in Clippers, Ash and I would like to discuss our work in constructing a conversational assistant for the COSI Science Museum. Where our previous system consisted of a non-conversational query classifier which responded with canned answers, we seek to create a pipeline which conditions a generative response on retrieved facts/documents and conversational history with minimal risk of toxic output. Our work is on two fronts, the construction of a retrieval system and the training of a generative LLM. For our retrieval system we investigate how to best contextualize a query within a conversation, and how to best represent documents such that retrieval is possible. For the generative LLM, we fine tune t5 and Llama and evaluate their responses using automated metrics, including GPT-4, to see which metrics and model are most effective. These fronts have an added low-resource challenge as much of our data and annotations are synthetically generated.

Clippers 11/7: Alyssa Allen on Natural Language Comment Generation for SQL

November 7, 2023 at 2:04pm by Michael White

Natural Language Comment Generation for SQL

This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. My portion of the project focuses on training a model that can generate line-by-line natural language comments which bridge the gap between SQL and the user’s higher-level question. Prior research in SQL explainability has largely focused on translating SQL to templated English or summarize entire SQL queries with a comment (Eleftherakis et al., 2021; Narechania et al., 2021). In our generation approach, the comments should faithfully describe the purpose of one or multiple SQL commands and leverage language from the user question, ultimately making SQL parse errors easier for novice users to identify.

Our methods include first building a hand-annotated set of examples, which are then used in few-shot prompting with Chat GPT to generate a relatively small set of seed training items. From there, we experiment with fine-tuning a model (e.g. Llama) that can generate natural language comments for any SQL query, using a knowledge distillation plus filtering and editing approach. Work presented in this talk is ongoing.

Clippers 10/31: Jingyi Chen on “Aligning Text-to-Image Models using Human Feedback”

October 28, 2023 at 3:25pm by Michael White

On Halloween in Clippers, Jingyi Chen will present the paper Aligning Text-to-Image Models using Human Feedback (https://arxiv.org/abs/2302.12192).

Abstract: Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.

Posts