Clippers 9/23: Ash Lewis on Detection and Mitigation of Hallucination in AI Dialogues

Hallucination in AI Dialogues: Detection and Mitigation

Large Language Models (LLMs) excel at generating fluent language but remain vulnerable to producing false or misleading outputs, commonly referred to as hallucinations. This presentation explores the nature of hallucinations in dialogue systems, why they emerge, and why they matter in high-stakes applications. I review current strategies for detecting hallucinations, including human evaluation, LLM-as-judge methods, uncertainty estimation, and fact-checking techniques such as FActScore. I also introduce VISTA Score, a new framework for sequential, turn-based verification that improves consistency and factuality in conversational settings. Building on these detection methods, I outline complementary approaches for mitigating hallucinations, from retrieval-augmented generation to evaluation pipelines that encourage abstention when confidence is low. Through examples from my virtual museum tour guide project, I demonstrate how combining detection and mitigation strategies can lead to more trustworthy and reliable dialogue systems.

Clippers 9/16: Yi-Chien Lin on Vectors from Larger Language Models Predict Human Reading Time and fMRI Data More Poorly when Dimensionality Expansion is Controlled

The impressive linguistic abilities of large language models (LLMs) have recommended them as models of human sentence processing, with some conjecturing a positive ‘quality-power’ relationship (Wilcox et al., 2023), in which language models’ (LMs’) fit to psychometric data continues to improve as their ability to predict words in context increases. This is important because it suggests that elements of LLM architecture, such as veridical attention to context and a unique objective of predicting up-coming words, reflect the architecture of the human sentence processing faculty, and that any inadequacies in predicting human reading time and brain imaging data may be attributed to insufficient model complexity, which recedes as larger models become available. Recent studies (Oh and Schuler, 2023) have shown this scaling inverts after a point, as LMs become excessively large and accurate, when word prediction probability (as information-theoretic surprisal) is used as a predictor. Other studies propose the use of entire vectors from differently sized LLMs, still showing positive scaling (Schrimpf et al., 2021), casting doubt on the value of surprisal as a predictor, but do not control for the larger number of predictors in vectors from larger LMs. This study evaluates LLM scaling using entire LLM vectors, while controlling for the larger number of predictors in vectors from larger LLMs. Results show that inverse scaling obtains, suggesting that inadequacies in predicting human reading time and brain imaging data may be due to substantial misalignment between LLMs and human sentence processing, which worsens as larger models are used.

Clippers 9/9: Tomiris Kaumenova on a Self-trained Evaluator for Persona Adherence

Large Language Models (LLMs) are used to simulate role-specific interactions, such as doctor–patient dialogues, but they often drift away from their assigned personas in longer conversations. This raises issues for controllability, consistency, and safety, especially in the healthcare domain. As part of the JSALT 2025 workshop, I focused on building a self-trained evaluator for persona adherence in doctor–patient dialogues. Instead of relying on costly human annotation or large closed models, this approach iteratively trains smaller open-source models on contrastive synthetic data, created by generating matched and minimally altered (unmatched) personas. In this Clippers talk, I’ll walk through the approach, share some promising results, and outline directions for where this work is headed next.

Clippers 9/2: Lara Downing (CRIS Ohio), “Emergencies don’t discriminate, and neither should your response system”: dialogue on the use of machine translation in high-stakes contexts

“Emergencies don’t discriminate, and neither should your response system”: dialogue on the use of machine translation in high-stakes contexts

Lara Downing
Program Manager, Victims of Crime Assistance Program
Community Refugee & Immigration Services (CRIS)

Abstract:

Machine translation has proliferated across many sectors of society, including high-stakes domains such as policing, healthcare, courts, and emergency communication centers. Adopted for its relative low cost, ease of use, fast response time, and broad language options, overreliance on MT without human interpreters raises urgent questions about accuracy, accountability, informed consent, privacy, and language rights. Public and nonprofit sector decision makers often face funding cuts, mounting federal pressure, lack of technical expertise, and limited guidance that is accessible, data driven, and from independent sources when incorporating AI products into their language access plans.

Using real-world use cases from her role as a social worker at a local immigrant services organization, Lara Downing will focus on the social impact of automated translation on marginalized communities. She will then invite attendees to share their perspectives. What role might academic researchers play in evaluating MT use in the wild? How might researchers contribute to clearer understanding of MT’s potential and its limits in the public and amongst key decision makers? What frameworks can guide its responsible application while mitigating the risks of critical miscommunications, erosion of due process rights, amplification of inequity, waste of public resources, and misuse of sensitive data? By bridging social work practice with computational linguistics, Lara’s goal is to foster dialogue on safeguarding linguistic rights while shaping a more ethically grounded trajectory for translation technologies.