Clippers 11/26: Sara Court on the Limitations of LLMs for Low-Resource Translation

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
https://arxiv.org/pdf/2406.15625

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 11/19: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Title: Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Description:

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect. Work presented in this talk is ongoing.

Clippers 11/12: Tomiris Kaumenova on a synthetic dataset for developing a colonoscopy prep virtual assistant

In Clippers this week, I will dry run my QP1 presentation. I will discuss our approach to constructing a synthetic dataset for developing a virtual assistant for colonoscopy preparation. The focus is on generating factually accurate but diverse dialogues between an AI Coach and a patient through prompt engineering with Llama 3.1 70B. In terms of factuality, I analyze errors in AI Coach responses across different prompt strategies: no few-shot, few-shot, and few-shot with chain-of-thought. For diversity, I compare theme-specific patient prompts with a “baseline” prompt using both diversity metrics and manual evaluation. I would appreciate feedback on the structure and format of my presentation, as well as any questions that might help me prepare for a broader audience with backgrounds other than CL.

Clippers 10/29: Ash Lewis on the COSI Virtual Museum Tour Guide

This presentation outlines the development, challenges, and future plans for a virtual museum tour guide for the COSI Language Pod. Originally derived from the Virtual Patient project, the guide initially relied on a static question-answering system that required frequent retraining and could answer only a limited set of questions. The transition to a more dynamic, retrieval-augmented generation (RAG) model aims to increase responsiveness, robustness, and resource efficiency, with minimal dependency on costly, corporate AI systems. Key development phases include leveraging open-source, mid-sized LLMs and knowledge distillation techniques to balance robustness and control.  Key enhancements include exploring  retrieval methods, adapting models for multilingual interactions, and ensuring safe, confabulation-free outputs. Future steps involve reducing hallucinations further through contrastive and reinforcement learning and exploring potential adaptations for similar projects.

Clippers 10/22: David Palzer on End-to-End Neural Diarization

Title: Simplifying End-to-End Neural Diarization: Generic Speaker Attractors are Enough

Abstract: In this work, we propose a simplified approach to neural speaker diarization by removing the Encoder-Decoder Attractor (EDA) mechanism and replacing it with a linear layer. This modification significantly reduces the model’s parameter count, allowing us to increase the depth of the backbone network by stacking additional Conformer blocks. To further enhance efficiency, we replace the Shaw relative positional encoding in the Conformer blocks with ALiBi positional bias, which improves the handling of short/long-range dependencies while decreasing computational complexity. Our results show that this streamlined model achieves comparable performance to previous diarization systems utilizing dynamic attractors, suggesting that Generic Speaker Attractors—global static learned attractors—can be as effective as dynamic attractors in representing speakers. Furthermore, we observe that the clustering effect, a key feature of previous EDA-based models, is preserved in our approach. These findings suggest that the EDA mechanism may not be necessary for high-quality speaker diarization, and that a more straightforward architecture can yield competitive results.

Clippers 10/15: Vishal Sunder on a Non-autoregressive Model for Joint STT and TTS

Title: A Non-autoregressive Model for Joint STT and TTS

Abstract: In this work, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

Clippers 10/8: Jingyi Chen on Speech Emotion Cloning

EmoClone: Speech Emotion Cloning
Jingyi Chen

In this paper, we introduce EmoClone, an end-to-end speech-to-speech model that replicates the emotional tone of a reference speech from a short audio sample, reproducing the reference speaker’s exact emotion in new outputs, regardless of content or voice differences. Unlike traditional Emotional Voice Conversion (EVC) models that use emotion text labels to alter the input speech’s emotional state, EmoClone is designed to faithfully clone a broad range of emotional expressions beyond these preset categories, making it ideal for applications requiring precise emotional fidelity, such as personalized voice generation and interactive media. Experimental results show that EmoClone leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.

Clippers 9/24: Amy Chun on Linguistic Age Prediction

Children’s language development is a critical factor in creating engaging and age-appropriate interactions in conversational AI systems. As children grow, their communication evolves in sentence complexity, vocabulary use, and conversational style. However, many current AI-driven systems struggle to dynamically adjust to these developmental changes, especially in interactive environments like the COSI Museum, where engaging, personalized conversations can foster learning and curiosity. In this talk, I will discuss how our research aims to bridge this gap by predicting a child’s age based on linguistic features to create more engaging and age-appropriate interactions.

Clippers 9/17: Discussion of Self-Taught Evaluators paper

Michael White will lead a discussion of Meta’s Self-Taught Evaluators paper.


Self-Taught Evaluators

Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

https://arxiv.org/abs/2408.02666

Model-based evaluation is at the heart of successful model development — as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Clippers 9/10: Alyssa Allen on GEM Data-to-Text Shared Task

This week in Clippers, I’ll be workshopping a poster that I’m bringing to INLG later this month. Myself, Ash, Yi-Chien, Tomiris, and Mike participated in this Data-to-Text GEM shared task. We were tasked with generating text for triple sets where each triple was of the form Subject | Property | Object. This was done for factual, counterfactual, and fictional triple sets. We experimented with English, Spanish, Chinese, and Russian — ultimately submitted outputs for English and Spanish. I appreciate all feedback on the content and layout of the poster, but (perhaps more importantly) I’d like to know what questions I’ll likely be asked at the conference based on our work.