Clippers 9/17: Discussion of Self-Taught Evaluators paper

Michael White will lead a discussion of Meta’s Self-Taught Evaluators paper.


Self-Taught Evaluators

Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

https://arxiv.org/abs/2408.02666

Model-based evaluation is at the heart of successful model development — as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Clippers 9/10: Alyssa Allen on GEM Data-to-Text Shared Task

This week in Clippers, I’ll be workshopping a poster that I’m bringing to INLG later this month. Myself, Ash, Yi-Chien, Tomiris, and Mike participated in this Data-to-Text GEM shared task. We were tasked with generating text for triple sets where each triple was of the form Subject | Property | Object. This was done for factual, counterfactual, and fictional triple sets. We experimented with English, Spanish, Chinese, and Russian — ultimately submitted outputs for English and Spanish. I appreciate all feedback on the content and layout of the poster, but (perhaps more importantly) I’d like to know what questions I’ll likely be asked at the conference based on our work.

Clippers 9/3: Christian Clark on predicting human reading times using a Transformer model with a recency bias

Abstract:

Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown that surprisal values from Transformer models align with reading times better than those from alternative models such as RNNs. However, standard Transformers include a lossless representation of the entire previous linguistic context, a feature which makes them somewhat implausible as models of human cognition. To address this limitation, I test a transformer variant which includes ALiBi, a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline.

Clippers 8/27: Amad Hussain on Synthetic Data for Social Needs Chatbot / Building KGQA for Social Determinants of Health and Sleep Behaviors

Title 1: Synthetic Data for Social Needs Chatbot

Abstract: In many cases social needs resources (e.g. food pantries, financial assistance) got underutilized due to lack of accessibility. While certain websites, such as Findhelp.org, exist to improve accessibility through the aggregation and filtering of resources, a barrier still exists due to disparities in technical literacy and mismatches between patient description of experiences relative to the formal terminology. We week to create a conversational agent which can bridge this accessibility barrier.

Due to patient data privacy concerns, and server-side resource limitations, the patient facing conversational system must be lightweight and not rely on API calls. As such, we make use of knowledge transfer through synthetic conversation generation using LLMs for use in training a downstream model. To reflect different user experiences, we make use of patient profile schemas and categorical expansion.

Title 2: Building KGQA for Social Determinants of Health and Sleep Behaviors

Abstract: Social determinants of health (SDOH) are primarily encoded within free-text clinical notes rather than structured data fields, causing cohort identification to be relatively intractable. Likewise, sleep complaints, while occasionally leading to formal diagnoses, can be missed and solely embedded within free text descriptions. We intend to extract sleep characteristics and SDOH mentions within clinical notes to assist in cohort identification and correlation studies. The goal is to see how certain SDOH factors can relate to sleep concerns, especially in cases where underlying biases can lead to not having a diagnosis despite the presence of appropriate mentions.

While models exist for SDOH extraction, they largely work on public datasets and cannot necessarily be converted to individual hospital system. Likewise, sleep mentions are understudied and do not come with a large-scale dataset. To minimize the need for annotations, we leverage LLMs to extract these mentions using prompt-based, or lightly fine-tuned, methods. To then understand deeper relationships between these two factors, we seek to create a knowledge graph relating SDOH and sleep characteristics for a given cohort, allowing a physician to ask questions of these relations in a downstream KGQA system.

Clippers 4/16: Byung-Doh Oh on the bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

The bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

(Saarland University colloquium practice talk)

Surprisal estimates from Transformer-based large language models (LLMs) are often used to model expectation-based effects in human sentence processing, which are facilitations in processing driven by the predictability of each upcoming word. This talk presents a series of analyses showing that surprisal estimates from LLM variants that are bigger and are trained on more data are worse predictors of processing difficulty that manifests in human reading times. First, regression analyses show a strong inverse correlation between model size and fit to reading times across three LLM families on two separate datasets. An error analysis reveals a systematic deviation for the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words. Subsequently, LLM variants that vary in the amount of training data show that their surprisal estimates generally provide the best fit after seeing about two billion training tokens and begin to diverge with more training data. The adverse influence of model size also begins to emerge at this point and becomes stronger as training continues. Finally, based on recent findings on the scaling behavior of LLMs, word frequency is presented as a unified explanation for these two effects. The theoretical implications of these results will be discussed.

Clippers 4/9: Christian Clark, Midwest Speech and Language Days practice talk

Grammar induction, the task of learning a set of syntactic rules from minimally annotated training data, can provide evidence about the mechanisms underlying children’s language acquisition. Recent work has achieved advances in the induction of probabilistic context-free grammars (PCFGs). However, less attention has been paid to inducing categorial grammars, despite their appealing properties such as a transparent syntax–semantics interface. Motivated by this, we introduce a new model for inducing a basic categorial grammar. The model attains comparable accuracy to state-of-the-art PCFG systems and learns from raw data without part-of-speech information, in contrast to earlier categorial grammar induction systems.

Clippers 4/2: Sara Court on Leveraging LLMs for Low-Resource Translation

This work investigates the in-context learning abilities of LLM foundation models when instructed to translate text from a low resource language into a high resource language as part of an automated machine translation pipeline. As case studies, I conduct a set of experiments using two language pairs, Inuktitut-English and Quechua-Spanish, and examine the informativity of various types of lexical and grammatical information retrieved from a constrained database of pedagogical materials (dictionaries and grammar lessons) as well as sentence-length examples retrieved from parallel corpora designed for traditional NLP tasks. Ablation studies that manipulate (1) context type (morpheme definitions, grammar lessons, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type (GPT-4, GPT 3.5 turbo, Llama2, and Gemini) suggest that even relatively small (7B) LLMs are capable of utilizing prompt context for zero-shot translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of database construction, retrieval method, model type, and linguistic structure highlight the limitations of even the best LLMs as standalone translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 3/26: Amad Hussain, A Review of RAPTOR: Can Tree-Organized Retrieval Improve a Virtual Museum Tour Guide

This week in Clippers (3/26) I will be presenting a review of the paper, RAPTOR: Recursive Abstractive Processing For Tree-Organized Retrieval (https://arxiv.org/abs/2401.18059). This work seeks to semantically cluster packages within a corpus and hierarchically create summaries based upon these clusters. A retrieval system may then present the original passages or summaries to a downstream LLM for Retrieval-Augmented-Generation (RAG). The authors present SOTA results over question-answering answering tasks, especially that requiring multi-step reasoning. In our talk, we will review RAPTOR and seek to explore how it, and other related retrieval solutions, can be applied to the existing Virtual Museum Tour Guide project in collaboration with COSI. This will basically be a brainstorming session following a paper review so I am hoping for good discussion.

Clippers 3/19: Christian Clark on semantically aided categorial grammar induction

Studies of grammar induction are a source of evidence about the mechanisms underlying children’s language acquisition. Manipulating the prior knowledge and inductive biases of grammar inducers can yield insights about the learnability of syntactic structure under various assumptions about the learner. While early induction models often relied on annotated data, more recent models have made progress toward learning from raw data, working with both probabilistic context-free grammars and categorial grammars. Still, accuracy levels of current systems fall well below human learners.

Incorporating world knowledge into grammar inducers is a potential path toward further improvement, one which is well motivated by psycholinguistic theory (e.g. semantic bootstrapping). Along these lines, I will present a categorial grammar inducer that incorporates semantic knowledge — implemented as association weights between predicate roles — into an existing syntax-only inducer. Associations can be distilled from large language models (LLMs), opening up possibilities not only for better grammar induction but also for exploration of the conceptual knowledge acquired by LLMs. This project is still a work in progress, but I will present some preliminary results on synthetic data and broad-coverage corpora.

Clippers 3/5: Alyssa Allen on SQL Query Explainability using Natural Language Generation

SQL Query Explainability using Natural Language Generation

This work is rooted in a larger project aimed at developing a dialogue system that helps increase transparency of database query outputs for non-expert SQL users. Previously, I’ve discussed processes for building a training set using few-shot prompting and a hand-annotated set of commented queries. Additionally, I’ve discussed test set results from LLMs (such as ChatGPT and Llama). This presentation will shift focus to the content of the natural language.

I’ll discuss the development of comment guidelines and the need for guidelines in standardizing the evaluation. Comment guidelines should ideally provide transparency in what constitutes a “good” comment. Comments should also 1) reflect certain properties of the relational database structure, 2) prioritize semantic fidelity to the query and 3) align with the user language wherever appropriate. The comment guidelines use these core elements to outline how generated natural language can increase explainability of database queries. Our methods will be compared to approaches that leverage templated or rule-based systems of explainability.