This presentation outlines the development, challenges, and future plans for a virtual museum tour guide for the COSI Language Pod. Originally derived from the Virtual Patient project, the guide initially relied on a static question-answering system that required frequent retraining and could answer only a limited set of questions. The transition to a more dynamic, retrieval-augmented generation (RAG) model aims to increase responsiveness, robustness, and resource efficiency, with minimal dependency on costly, corporate AI systems. Key development phases include leveraging open-source, mid-sized LLMs and knowledge distillation techniques to balance robustness and control. Key enhancements include exploring retrieval methods, adapting models for multilingual interactions, and ensuring safe, confabulation-free outputs. Future steps involve reducing hallucinations further through contrastive and reinforcement learning and exploring potential adaptations for similar projects.
Month: October 2024
Clippers 10/22: David Palzer on End-to-End Neural Diarization
Title: Simplifying End-to-End Neural Diarization: Generic Speaker Attractors are Enough
Abstract: In this work, we propose a simplified approach to neural speaker diarization by removing the Encoder-Decoder Attractor (EDA) mechanism and replacing it with a linear layer. This modification significantly reduces the model’s parameter count, allowing us to increase the depth of the backbone network by stacking additional Conformer blocks. To further enhance efficiency, we replace the Shaw relative positional encoding in the Conformer blocks with ALiBi positional bias, which improves the handling of short/long-range dependencies while decreasing computational complexity. Our results show that this streamlined model achieves comparable performance to previous diarization systems utilizing dynamic attractors, suggesting that Generic Speaker Attractors—global static learned attractors—can be as effective as dynamic attractors in representing speakers. Furthermore, we observe that the clustering effect, a key feature of previous EDA-based models, is preserved in our approach. These findings suggest that the EDA mechanism may not be necessary for high-quality speaker diarization, and that a more straightforward architecture can yield competitive results.
Clippers 10/15: Vishal Sunder on a Non-autoregressive Model for Joint STT and TTS
Title: A Non-autoregressive Model for Joint STT and TTS
Abstract: In this work, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
Clippers 10/8: Jingyi Chen on Speech Emotion Cloning
EmoClone: Speech Emotion Cloning
Jingyi Chen
In this paper, we introduce EmoClone, an end-to-end speech-to-speech model that replicates the emotional tone of a reference speech from a short audio sample, reproducing the reference speaker’s exact emotion in new outputs, regardless of content or voice differences. Unlike traditional Emotional Voice Conversion (EVC) models that use emotion text labels to alter the input speech’s emotional state, EmoClone is designed to faithfully clone a broad range of emotional expressions beyond these preset categories, making it ideal for applications requiring precise emotional fidelity, such as personalized voice generation and interactive media. Experimental results show that EmoClone leads to improved performance for content and speaker identity preservation, while achieving a comparable emotion accuracy to SOTA methods.