Posts

Clippers 1/21: Vishal Sunder on Advancing End-to-End Speech AI with Knowledge Transfer

Title: Advancing End-to-End Speech AI with Knowledge Transfer

Abstract:

My thesis explores end-to-end (E2E) approaches to improve speech AI by addressing limitations of cascaded systems, such as ASR error propagation and large, misaligned models. The thesis focuses on three key tasks: speech understanding, speech assessment, and joint speech recognition and synthesis, leveraging knowledge transfer (KT) from auxiliary sources like large language models (LLMs), dialog history, and related tasks.

For speech understanding, E2E models integrate semantic knowledge from LLMs for tasks like intent extraction and slot filling using tokenwise contrastive pretraining (TCP). This approach is extended to the RNN transducer (RNN-T) model to enhance ASR and spoken language understanding (SLU). Differentiable cascading of ASR and SLU incorporates intermediate non-autoregressive objectives, improving intent classification and slot filling across datasets. Additionally, dialog history is incorporated through hierarchical and conformer-based conversation models, enhancing dialog act classification.

In speech assessment, two sub-problems are addressed: E2E disfluency detection/classification and real-time reading tracking for children. A hierarchical detection-classification (HiDeC) method mitigates class imbalance, while pointer-network models, trained on ASR alignment maps, track reading positions effectively.

For joint speech recognition and synthesis, a non-autoregressive multimodal framework processes speech and text inputs, independently or combined, and trains on unpaired datasets. Iterative refinement enhances performance, achieving competitive results in STT and TTS tasks.

These contributions advance robust E2E systems that are compact and resilient to ASR errors, bypassing cascaded approaches for efficient and effective speech AI.

Clippers 1/14: Christian Clark on Linear Recency Bias and Transformers’ Fit to Reading Times

Title:
Linear Recency Bias During Training Improves Transformers’ Fit to Reading Times

Abstract:
Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates from a Transformer that includes ALiBi during training and inference show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi’s mixture of slopes—which determine the rate of memory decay in each attention head—may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.

Clippers 11/26: Sara Court on the Limitations of LLMs for Low-Resource Translation

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
https://arxiv.org/pdf/2406.15625

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 11/19: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Title: Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

Description:

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect. Work presented in this talk is ongoing.

Clippers 11/12: Tomiris Kaumenova on a synthetic dataset for developing a colonoscopy prep virtual assistant

In Clippers this week, I will dry run my QP1 presentation. I will discuss our approach to constructing a synthetic dataset for developing a virtual assistant for colonoscopy preparation. The focus is on generating factually accurate but diverse dialogues between an AI Coach and a patient through prompt engineering with Llama 3.1 70B. In terms of factuality, I analyze errors in AI Coach responses across different prompt strategies: no few-shot, few-shot, and few-shot with chain-of-thought. For diversity, I compare theme-specific patient prompts with a “baseline” prompt using both diversity metrics and manual evaluation. I would appreciate feedback on the structure and format of my presentation, as well as any questions that might help me prepare for a broader audience with backgrounds other than CL.

Clippers 10/29: Ash Lewis on the COSI Virtual Museum Tour Guide

This presentation outlines the development, challenges, and future plans for a virtual museum tour guide for the COSI Language Pod. Originally derived from the Virtual Patient project, the guide initially relied on a static question-answering system that required frequent retraining and could answer only a limited set of questions. The transition to a more dynamic, retrieval-augmented generation (RAG) model aims to increase responsiveness, robustness, and resource efficiency, with minimal dependency on costly, corporate AI systems. Key development phases include leveraging open-source, mid-sized LLMs and knowledge distillation techniques to balance robustness and control.  Key enhancements include exploring  retrieval methods, adapting models for multilingual interactions, and ensuring safe, confabulation-free outputs. Future steps involve reducing hallucinations further through contrastive and reinforcement learning and exploring potential adaptations for similar projects.

Clippers 10/22: David Palzer on End-to-End Neural Diarization

Title: Simplifying End-to-End Neural Diarization: Generic Speaker Attractors are Enough

Abstract: In this work, we propose a simplified approach to neural speaker diarization by removing the Encoder-Decoder Attractor (EDA) mechanism and replacing it with a linear layer. This modification significantly reduces the model’s parameter count, allowing us to increase the depth of the backbone network by stacking additional Conformer blocks. To further enhance efficiency, we replace the Shaw relative positional encoding in the Conformer blocks with ALiBi positional bias, which improves the handling of short/long-range dependencies while decreasing computational complexity. Our results show that this streamlined model achieves comparable performance to previous diarization systems utilizing dynamic attractors, suggesting that Generic Speaker Attractors—global static learned attractors—can be as effective as dynamic attractors in representing speakers. Furthermore, we observe that the clustering effect, a key feature of previous EDA-based models, is preserved in our approach. These findings suggest that the EDA mechanism may not be necessary for high-quality speaker diarization, and that a more straightforward architecture can yield competitive results.

Clippers 10/15: Vishal Sunder on a Non-autoregressive Model for Joint STT and TTS

Title: A Non-autoregressive Model for Joint STT and TTS

Abstract: In this work, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

Clippers 10/8: Jingyi Chen on Speech Emotion Cloning

EmoClone: Speech Emotion Cloning
Jingyi Chen

In this paper, we introduce EmoClone, an end-to-end speech-to-speech model that replicates the emotional tone of a reference speech from a short audio sample, reproducing the reference speaker’s exact emotion in new outputs, regardless of content or voice differences. Unlike traditional Emotional Voice Conversion (EVC) models that use emotion text labels to alter the input speech’s emotional state, EmoClone is designed to faithfully clone a broad range of emotional expressions beyond these preset categories, making it ideal for applications requiring precise emotional fidelity, such as personalized voice generation and interactive media. Experimental results show that EmoClone leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.

Clippers 9/24: Amy Chun on Linguistic Age Prediction

Children’s language development is a critical factor in creating engaging and age-appropriate interactions in conversational AI systems. As children grow, their communication evolves in sentence complexity, vocabulary use, and conversational style. However, many current AI-driven systems struggle to dynamically adjust to these developmental changes, especially in interactive environments like the COSI Museum, where engaging, personalized conversations can foster learning and curiosity. In this talk, I will discuss how our research aims to bridge this gap by predicting a child’s age based on linguistic features to create more engaging and age-appropriate interactions.