Clippers 11/26: Sara Court on the Limitations of LLMs for Low-Resource Translation

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
https://arxiv.org/pdf/2406.15625

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 11/19: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Title: Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Description:

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect. Work presented in this talk is ongoing.

Clippers 11/12: Tomiris Kaumenova on a synthetic dataset for developing a colonoscopy prep virtual assistant

In Clippers this week, I will dry run my QP1 presentation. I will discuss our approach to constructing a synthetic dataset for developing a virtual assistant for colonoscopy preparation. The focus is on generating factually accurate but diverse dialogues between an AI Coach and a patient through prompt engineering with Llama 3.1 70B. In terms of factuality, I analyze errors in AI Coach responses across different prompt strategies: no few-shot, few-shot, and few-shot with chain-of-thought. For diversity, I compare theme-specific patient prompts with a “baseline” prompt using both diversity metrics and manual evaluation. I would appreciate feedback on the structure and format of my presentation, as well as any questions that might help me prepare for a broader audience with backgrounds other than CL.