Posts

Clippers 11/25: Micha Elsner on understanding public attitudes towards AI

This Tuesday, I will talk about some in-progress work on understanding how people feel about AI. Along with Sara Court, Emily Sagasser and Galey Modan, I am analyzing the online discourse around the recent study “Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task” (https://arxiv.org/abs/2506.08872) through analysis of Youtube and Reddit reactions to the study. We find that people online have a variety of pre-existing attitudes towards AI which shape their understanding of new information. Some people really hate AI, some are really excited about it, and many are in between, with various ways of discussing what makes for “good” or “bad” usage.

This talk will not involve much in the way of computational methods, but it may still be helpful for understanding how people out there in the world react to the research we do.

Clippers 11/18: Christian Clark on Improved Reading Time Predictions from Word-Level Contextual Entropy

Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, with evidence that the latter provides better predictions of human sentence processing difficulty. These results suggest a need for caution in using first-token approximations of contextual entropy.

Clippers 11/4: Sara Court on Responsible Applications of LLMs for Low-Resource Machine Translation

Responsible Applications of LLMs for Low-Resource Machine Translation

This dissertation lays the groundwork for a machine learning (ML) framework intended to facilitate community-driven development of language technologies and NLP applications. Inspired by morphological theories of the mental lexicon, language learning pedagogy, and human translation methods, the framework implements an LLM-based text generation pipeline that combines Retrieval Augmented Generation (RAG) with multistep agentic loops for quality estimation and refinement. The dissertation will also describe a set of minimal specifications for the construction and maintenance of a digital database used to steer generation performance. Maltese-English machine translation (MT) serves as a case study in order to empirically assess a selection of post-training methods for adapting an LLM for use in a particular language or domain for which its parametric knowledge is insufficient. The proposed framework is designed with language agnostic principles in mind, and the dissertation will specify how to configure the framework with respect to a specific community’s self-defined language and usage conventions. Ablation studies will analyze the relative contributions of the pipeline’s component parts and their effects on downstream generation performance. Human analysis of model outputs will also clarify and document the risks and limitations associated with the proposed methods. Ultimately, this dissertation aims to support infrastructure for collaborative and “bottom-up” methods of developing modern language technologies. As the use of LLM-based tools becomes increasingly normalized throughout many of our daily lives, such initiatives are critical to ensure that participation in the data-driven future may be accessible to speakers of all of the world’s languages — not just those of a select few.

Clippers 10/21: Hanane Moussa and Patrick Da Silva on Research Idea Evaluation Grounded in Literature

As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas. We introduce SCHOLAREVAL, a retrieval-augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness—the empirical validity of proposed methods based on existing literature, and contribution—the degree of advancement made by the idea across different dimensions relative to prior research. To evaluate SCHOLAREVAL, we introduce SCHOLARIDEAS, the first expert-annotated dataset of multi-domain research ideas and reviews, comprised of 117 ideas across four disciplines: artificial intelligence, neuroscience, biochemistry, and ecology. Our evaluation shows that SCHOLAREVAL achieves significantly higher coverage of points mentioned in the human expert annotated rubrics in SCHOLARIDEAS compared to all baselines. Furthermore, SCHOLAREVAL is consistently preferred over our strongest baseline o4-mini-deep-research, a reasoning and search-enabled agentic system by OpenAI, in terms of evaluation actionability, depth, and evidence support. Our large-scale user study also shows that SCHOLAREVAL significantly outperforms deep research in literature engagement, idea refinement, and usefulness. We openly release our code, dataset, and SCHOLAREVAL tool for the community to use and build on.

Clippers 10/14: Sara Court on COLM conference and Low-Resource MT

This week in Clippers I’ll share some of my experiences from attending the 2nd Conference on Language Modeling (COLM) in Montreal last week. I intend to cover some “professionalization” topics that I hope can be helpful to new grads in CL/NLP, as well as some personal impressions regarding which themes seemed to be most prevalent at the conference — i.e., my (extremely subjective) takeaways on what constitutes the current “cutting edge” in language modeling research, based on what I saw and heard last week.

I’ll also share a bit on what I’ve been working on for my dissertation, in which I’m exploring methods for adapting LLMs to low-resource domains using Maltese-English machine translation (MT) as a case study.

Clippers 10/7: Alex Petrov on AGI from a Humanist Perspective

AGI from a Humanist Perspective

AGI is no ordinary technology. One of the several things that make it extraordinary is its potential to destabilize cherished beliefs that humans have about themselves and their place in the grand scheme of things. For example, Rene Descartes famously proclaimed “Cogito ergo sum!” and defined himself as a thinking substance (“res cogitans”). He believed it was our God-given capacity to think that set us apart from the animals, which in his view were mere machines. In a similar vein, Blaise Pascal wrote: “Man is a thinking reed, the most feeble thing in nature; but he is a thinking reed. […] The dignity of man consists in thought.”

What happens, then, if a machine comes along that can outsmart us at our own game? One easy way out is to pretend it isn’t happening. Alan Turing (1950) called this “The Heads in the Sand Objection”: “The consequences of machines thinking would be too dreadful. Let us hope and believe that they cannot do so.” Turing didn’t think this argument was sufficiently substantial to require refutation. He wrote that consolation would be more appropriate and then added, enigmatically: “perhaps this should be sought in the transmigration of souls.”

These examples illustrate there is a lot more to the value alignment problem than meets the eye. Alignment is a two-way street. Not only do we confront the engineering challenge of making the AGIs harmless, honest, helpful, etc. We also face the challenge of re-telling our cultural narratives to accommodate the presence of AGIs in our midst.

This talk is about the second challenge. I do not pretend to have an answer or even elements of an answer. Rather, I present an opinionated sample of some prominent ideas in the Western cultural tradition, and re-interpret them for the age of AGI.

Clippers 9/23: Ash Lewis on Detection and Mitigation of Hallucination in AI Dialogues

Hallucination in AI Dialogues: Detection and Mitigation

Large Language Models (LLMs) excel at generating fluent language but remain vulnerable to producing false or misleading outputs, commonly referred to as hallucinations. This presentation explores the nature of hallucinations in dialogue systems, why they emerge, and why they matter in high-stakes applications. I review current strategies for detecting hallucinations, including human evaluation, LLM-as-judge methods, uncertainty estimation, and fact-checking techniques such as FActScore. I also introduce VISTA Score, a new framework for sequential, turn-based verification that improves consistency and factuality in conversational settings. Building on these detection methods, I outline complementary approaches for mitigating hallucinations, from retrieval-augmented generation to evaluation pipelines that encourage abstention when confidence is low. Through examples from my virtual museum tour guide project, I demonstrate how combining detection and mitigation strategies can lead to more trustworthy and reliable dialogue systems.

Clippers 9/16: Yi-Chien Lin on Vectors from Larger Language Models Predict Human Reading Time and fMRI Data More Poorly when Dimensionality Expansion is Controlled

The impressive linguistic abilities of large language models (LLMs) have recommended them as models of human sentence processing, with some conjecturing a positive ‘quality-power’ relationship (Wilcox et al., 2023), in which language models’ (LMs’) fit to psychometric data continues to improve as their ability to predict words in context increases. This is important because it suggests that elements of LLM architecture, such as veridical attention to context and a unique objective of predicting up-coming words, reflect the architecture of the human sentence processing faculty, and that any inadequacies in predicting human reading time and brain imaging data may be attributed to insufficient model complexity, which recedes as larger models become available. Recent studies (Oh and Schuler, 2023) have shown this scaling inverts after a point, as LMs become excessively large and accurate, when word prediction probability (as information-theoretic surprisal) is used as a predictor. Other studies propose the use of entire vectors from differently sized LLMs, still showing positive scaling (Schrimpf et al., 2021), casting doubt on the value of surprisal as a predictor, but do not control for the larger number of predictors in vectors from larger LMs. This study evaluates LLM scaling using entire LLM vectors, while controlling for the larger number of predictors in vectors from larger LLMs. Results show that inverse scaling obtains, suggesting that inadequacies in predicting human reading time and brain imaging data may be due to substantial misalignment between LLMs and human sentence processing, which worsens as larger models are used.

Clippers 9/9: Tomiris Kaumenova on a Self-trained Evaluator for Persona Adherence

Large Language Models (LLMs) are used to simulate role-specific interactions, such as doctor–patient dialogues, but they often drift away from their assigned personas in longer conversations. This raises issues for controllability, consistency, and safety, especially in the healthcare domain. As part of the JSALT 2025 workshop, I focused on building a self-trained evaluator for persona adherence in doctor–patient dialogues. Instead of relying on costly human annotation or large closed models, this approach iteratively trains smaller open-source models on contrastive synthetic data, created by generating matched and minimally altered (unmatched) personas. In this Clippers talk, I’ll walk through the approach, share some promising results, and outline directions for where this work is headed next.

Clippers 9/2: Lara Downing (CRIS Ohio), “Emergencies don’t discriminate, and neither should your response system”: dialogue on the use of machine translation in high-stakes contexts

“Emergencies don’t discriminate, and neither should your response system”: dialogue on the use of machine translation in high-stakes contexts

Lara Downing
Program Manager, Victims of Crime Assistance Program
Community Refugee & Immigration Services (CRIS)

Abstract:

Machine translation has proliferated across many sectors of society, including high-stakes domains such as policing, healthcare, courts, and emergency communication centers. Adopted for its relative low cost, ease of use, fast response time, and broad language options, overreliance on MT without human interpreters raises urgent questions about accuracy, accountability, informed consent, privacy, and language rights. Public and nonprofit sector decision makers often face funding cuts, mounting federal pressure, lack of technical expertise, and limited guidance that is accessible, data driven, and from independent sources when incorporating AI products into their language access plans.

Using real-world use cases from her role as a social worker at a local immigrant services organization, Lara Downing will focus on the social impact of automated translation on marginalized communities. She will then invite attendees to share their perspectives. What role might academic researchers play in evaluating MT use in the wild? How might researchers contribute to clearer understanding of MT’s potential and its limits in the public and amongst key decision makers? What frameworks can guide its responsible application while mitigating the risks of critical miscommunications, erosion of due process rights, amplification of inequity, waste of public resources, and misuse of sensitive data? By bridging social work practice with computational linguistics, Lara’s goal is to foster dialogue on safeguarding linguistic rights while shaping a more ethically grounded trajectory for translation technologies.