Clippers 2/4: Sam Stevens on DSPy, compiling prompts, and similar work

As language models continue to evolve, the complexity of prompt engineering has grown in parallel. My talk examines the fundamental insights of DSPy through the lens of plib, a minimalist implementation that highlights key principles often overlooked in current LLM research. I argue that automated few-shot example selection can match or exceed carefully crafted zero-shot prompts, challenging the conventional wisdom of prompt engineering. The framework introduces a novel perspective on compute scaling in language models, suggesting “prompt compilation” as a fourth axis alongside pre-training, post-training, and inference-time computation. By treating prompt optimization as a reinforcement learning problem with verifiable rewards, plib offers a systematic approach to example selection. I argue that this style of thinking enables the decomposition of complex language tasks into modular sub-programs, a capability that proves challenging with traditional prompting methods. I will illustrate how many contemporary developments in LLM applications are natural extensions of principles already present in DSPy’s design, arguing for a renewed examination of these foundational ideas in the context of modern language model development.

Clippers 1/28: Cory Shain (Stanford) on language in the functional connectome of brains and models

Title: Language in the functional connectome of brains and models

Speaker: Cory Shain, Stanford Linguistics

Abstract: AI has turned into a complex systems science, much like neuroscience has always been. And increasingly, precision functional connectivity techniques in neuroscience are revealing that despite the daunting complexity of the human brain, there are natural “cuts” in the system, not just in terms of physiology, but in terms of cognitive function. In this talk, I will present recent work in the lab showing that one of those cuts is language. I will show evidence from an ongoing large scale neuroimaging study (1200 participants) that an unsupervised technique for parcellating each participant’s brain into networks reliably discovers a frontotemporal network of interconnected regions that is highly selective for language in that individual. This network is both closely adjacent to multiple functionally distinct networks within individuals and “loosely tethered” (Vázquez-Rodríguez et al, 2019) to anatomy. I will further show that, within the network, three putatively distinct linguistic processes (lexical semantics, syntax, and combinatorial semantics) distribute broadly, rather than localizing to different hubs. Together with a growing body of other research, these results suggest that language is “nearly decomposable” (Simon, 1962) as an integrated network in the brain. I will sketch how the lab is now pursuing the implications of this insight for neuroscience, its possible translations to neurosurgery and neural engineering, and its potential relevance to AI theory and practice.

Clippers 1/21: Vishal Sunder on Advancing End-to-End Speech AI with Knowledge Transfer

Title: Advancing End-to-End Speech AI with Knowledge Transfer

Abstract:

My thesis explores end-to-end (E2E) approaches to improve speech AI by addressing limitations of cascaded systems, such as ASR error propagation and large, misaligned models. The thesis focuses on three key tasks: speech understanding, speech assessment, and joint speech recognition and synthesis, leveraging knowledge transfer (KT) from auxiliary sources like large language models (LLMs), dialog history, and related tasks.

For speech understanding, E2E models integrate semantic knowledge from LLMs for tasks like intent extraction and slot filling using tokenwise contrastive pretraining (TCP). This approach is extended to the RNN transducer (RNN-T) model to enhance ASR and spoken language understanding (SLU). Differentiable cascading of ASR and SLU incorporates intermediate non-autoregressive objectives, improving intent classification and slot filling across datasets. Additionally, dialog history is incorporated through hierarchical and conformer-based conversation models, enhancing dialog act classification.

In speech assessment, two sub-problems are addressed: E2E disfluency detection/classification and real-time reading tracking for children. A hierarchical detection-classification (HiDeC) method mitigates class imbalance, while pointer-network models, trained on ASR alignment maps, track reading positions effectively.

For joint speech recognition and synthesis, a non-autoregressive multimodal framework processes speech and text inputs, independently or combined, and trains on unpaired datasets. Iterative refinement enhances performance, achieving competitive results in STT and TTS tasks.

These contributions advance robust E2E systems that are compact and resilient to ASR errors, bypassing cascaded approaches for efficient and effective speech AI.

Clippers 11/26: Sara Court on the Limitations of LLMs for Low-Resource Translation

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
https://arxiv.org/pdf/2406.15625

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 11/19: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Title: Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Description:

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect. Work presented in this talk is ongoing.

Clippers 11/12: Tomiris Kaumenova on a synthetic dataset for developing a colonoscopy prep virtual assistant

In Clippers this week, I will dry run my QP1 presentation. I will discuss our approach to constructing a synthetic dataset for developing a virtual assistant for colonoscopy preparation. The focus is on generating factually accurate but diverse dialogues between an AI Coach and a patient through prompt engineering with Llama 3.1 70B. In terms of factuality, I analyze errors in AI Coach responses across different prompt strategies: no few-shot, few-shot, and few-shot with chain-of-thought. For diversity, I compare theme-specific patient prompts with a “baseline” prompt using both diversity metrics and manual evaluation. I would appreciate feedback on the structure and format of my presentation, as well as any questions that might help me prepare for a broader audience with backgrounds other than CL.

Clippers 10/29: Ash Lewis on the COSI Virtual Museum Tour Guide

This presentation outlines the development, challenges, and future plans for a virtual museum tour guide for the COSI Language Pod. Originally derived from the Virtual Patient project, the guide initially relied on a static question-answering system that required frequent retraining and could answer only a limited set of questions. The transition to a more dynamic, retrieval-augmented generation (RAG) model aims to increase responsiveness, robustness, and resource efficiency, with minimal dependency on costly, corporate AI systems. Key development phases include leveraging open-source, mid-sized LLMs and knowledge distillation techniques to balance robustness and control.  Key enhancements include exploring  retrieval methods, adapting models for multilingual interactions, and ensuring safe, confabulation-free outputs. Future steps involve reducing hallucinations further through contrastive and reinforcement learning and exploring potential adaptations for similar projects.

Clippers 10/22: David Palzer on End-to-End Neural Diarization

Title: Simplifying End-to-End Neural Diarization: Generic Speaker Attractors are Enough

Abstract: In this work, we propose a simplified approach to neural speaker diarization by removing the Encoder-Decoder Attractor (EDA) mechanism and replacing it with a linear layer. This modification significantly reduces the model’s parameter count, allowing us to increase the depth of the backbone network by stacking additional Conformer blocks. To further enhance efficiency, we replace the Shaw relative positional encoding in the Conformer blocks with ALiBi positional bias, which improves the handling of short/long-range dependencies while decreasing computational complexity. Our results show that this streamlined model achieves comparable performance to previous diarization systems utilizing dynamic attractors, suggesting that Generic Speaker Attractors—global static learned attractors—can be as effective as dynamic attractors in representing speakers. Furthermore, we observe that the clustering effect, a key feature of previous EDA-based models, is preserved in our approach. These findings suggest that the EDA mechanism may not be necessary for high-quality speaker diarization, and that a more straightforward architecture can yield competitive results.

Clippers 10/15: Vishal Sunder on a Non-autoregressive Model for Joint STT and TTS

Title: A Non-autoregressive Model for Joint STT and TTS

Abstract: In this work, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

Clippers 10/8: Jingyi Chen on Speech Emotion Cloning

EmoClone: Speech Emotion Cloning
Jingyi Chen

In this paper, we introduce EmoClone, an end-to-end speech-to-speech model that replicates the emotional tone of a reference speech from a short audio sample, reproducing the reference speaker’s exact emotion in new outputs, regardless of content or voice differences. Unlike traditional Emotional Voice Conversion (EVC) models that use emotion text labels to alter the input speech’s emotional state, EmoClone is designed to faithfully clone a broad range of emotional expressions beyond these preset categories, making it ideal for applications requiring precise emotional fidelity, such as personalized voice generation and interactive media. Experimental results show that EmoClone leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.