Posts

Clippers 3/4: Tomiris Kaumenova on language contact in language emergence studies

Language emergence studies have explored interaction among agents in a network, using a game-theoretic approach (e.g., Lewis signaling games) and reinforcement learning frameworks. Prior research has demonstrated that emergent languages exhibit compositionality (Chaabouni et al., 2020), linguistic conventions shaped by network structure (Lipowska & Lipowski, 2018), and population-driven changes such as improved generalization due to cultural transmission (Cogswell et al., 2019). However, these studies make use of unrealistic tasks and unrealistic agents incapable of reproducing natural language interactions. Recent advancements have expanded multi-agent modeling with large language models capable of reproducing natural language for a range of domains and tasks, including negotiation, consensus seeking, and problem-solving (Guo et al., 2024; Sun et al., 2024). In spirit of this work, I am brainstorming ideas for a project: I am curious to investigate language contact in a multi-agent setting with agents as language models that interact using natural language. I am interested in whether (1) agents develop hybrid languages similar to language change induced by contact among humans, (2) their communication strategies shift toward simplification or complexity over time, and (3) network topology influences linguistic change. This is a nascent idea, so all kind of suggestions are welcomed.

Clippers 2/25: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect.

Clippers 2/18: Ash Lewis on Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination—generating false information—and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models’ outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized “I don’t know” responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.

Clippers 2/4: Sam Stevens on DSPy, compiling prompts, and similar work

As language models continue to evolve, the complexity of prompt engineering has grown in parallel. My talk examines the fundamental insights of DSPy through the lens of plib, a minimalist implementation that highlights key principles often overlooked in current LLM research. I argue that automated few-shot example selection can match or exceed carefully crafted zero-shot prompts, challenging the conventional wisdom of prompt engineering. The framework introduces a novel perspective on compute scaling in language models, suggesting “prompt compilation” as a fourth axis alongside pre-training, post-training, and inference-time computation. By treating prompt optimization as a reinforcement learning problem with verifiable rewards, plib offers a systematic approach to example selection. I argue that this style of thinking enables the decomposition of complex language tasks into modular sub-programs, a capability that proves challenging with traditional prompting methods. I will illustrate how many contemporary developments in LLM applications are natural extensions of principles already present in DSPy’s design, arguing for a renewed examination of these foundational ideas in the context of modern language model development.

Clippers 1/28: Cory Shain (Stanford) on language in the functional connectome of brains and models

Title: Language in the functional connectome of brains and models

Speaker: Cory Shain, Stanford Linguistics

Abstract: AI has turned into a complex systems science, much like neuroscience has always been. And increasingly, precision functional connectivity techniques in neuroscience are revealing that despite the daunting complexity of the human brain, there are natural “cuts” in the system, not just in terms of physiology, but in terms of cognitive function. In this talk, I will present recent work in the lab showing that one of those cuts is language. I will show evidence from an ongoing large scale neuroimaging study (1200 participants) that an unsupervised technique for parcellating each participant’s brain into networks reliably discovers a frontotemporal network of interconnected regions that is highly selective for language in that individual. This network is both closely adjacent to multiple functionally distinct networks within individuals and “loosely tethered” (Vázquez-Rodríguez et al, 2019) to anatomy. I will further show that, within the network, three putatively distinct linguistic processes (lexical semantics, syntax, and combinatorial semantics) distribute broadly, rather than localizing to different hubs. Together with a growing body of other research, these results suggest that language is “nearly decomposable” (Simon, 1962) as an integrated network in the brain. I will sketch how the lab is now pursuing the implications of this insight for neuroscience, its possible translations to neurosurgery and neural engineering, and its potential relevance to AI theory and practice.

Clippers 1/21: Vishal Sunder on Advancing End-to-End Speech AI with Knowledge Transfer

Title: Advancing End-to-End Speech AI with Knowledge Transfer

Abstract:

My thesis explores end-to-end (E2E) approaches to improve speech AI by addressing limitations of cascaded systems, such as ASR error propagation and large, misaligned models. The thesis focuses on three key tasks: speech understanding, speech assessment, and joint speech recognition and synthesis, leveraging knowledge transfer (KT) from auxiliary sources like large language models (LLMs), dialog history, and related tasks.

For speech understanding, E2E models integrate semantic knowledge from LLMs for tasks like intent extraction and slot filling using tokenwise contrastive pretraining (TCP). This approach is extended to the RNN transducer (RNN-T) model to enhance ASR and spoken language understanding (SLU). Differentiable cascading of ASR and SLU incorporates intermediate non-autoregressive objectives, improving intent classification and slot filling across datasets. Additionally, dialog history is incorporated through hierarchical and conformer-based conversation models, enhancing dialog act classification.

In speech assessment, two sub-problems are addressed: E2E disfluency detection/classification and real-time reading tracking for children. A hierarchical detection-classification (HiDeC) method mitigates class imbalance, while pointer-network models, trained on ASR alignment maps, track reading positions effectively.

For joint speech recognition and synthesis, a non-autoregressive multimodal framework processes speech and text inputs, independently or combined, and trains on unpaired datasets. Iterative refinement enhances performance, achieving competitive results in STT and TTS tasks.

These contributions advance robust E2E systems that are compact and resilient to ASR errors, bypassing cascaded approaches for efficient and effective speech AI.

Clippers 1/14: Christian Clark on Linear Recency Bias and Transformers’ Fit to Reading Times

Title:
Linear Recency Bias During Training Improves Transformers’ Fit to Reading Times

Abstract:
Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates from a Transformer that includes ALiBi during training and inference show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi’s mixture of slopes—which determine the rate of memory decay in each attention head—may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.

Clippers 11/26: Sara Court on the Limitations of LLMs for Low-Resource Translation

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
https://arxiv.org/pdf/2406.15625

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 11/19: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Title: Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Description:

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect. Work presented in this talk is ongoing.

Clippers 11/12: Tomiris Kaumenova on a synthetic dataset for developing a colonoscopy prep virtual assistant

In Clippers this week, I will dry run my QP1 presentation. I will discuss our approach to constructing a synthetic dataset for developing a virtual assistant for colonoscopy preparation. The focus is on generating factually accurate but diverse dialogues between an AI Coach and a patient through prompt engineering with Llama 3.1 70B. In terms of factuality, I analyze errors in AI Coach responses across different prompt strategies: no few-shot, few-shot, and few-shot with chain-of-thought. For diversity, I compare theme-specific patient prompts with a “baseline” prompt using both diversity metrics and manual evaluation. I would appreciate feedback on the structure and format of my presentation, as well as any questions that might help me prepare for a broader audience with backgrounds other than CL.