Title: Advancing End-to-End Speech AI with Knowledge Transfer
Abstract:
My thesis explores end-to-end (E2E) approaches to improve speech AI by addressing limitations of cascaded systems, such as ASR error propagation and large, misaligned models. The thesis focuses on three key tasks: speech understanding, speech assessment, and joint speech recognition and synthesis, leveraging knowledge transfer (KT) from auxiliary sources like large language models (LLMs), dialog history, and related tasks.
For speech understanding, E2E models integrate semantic knowledge from LLMs for tasks like intent extraction and slot filling using tokenwise contrastive pretraining (TCP). This approach is extended to the RNN transducer (RNN-T) model to enhance ASR and spoken language understanding (SLU). Differentiable cascading of ASR and SLU incorporates intermediate non-autoregressive objectives, improving intent classification and slot filling across datasets. Additionally, dialog history is incorporated through hierarchical and conformer-based conversation models, enhancing dialog act classification.
In speech assessment, two sub-problems are addressed: E2E disfluency detection/classification and real-time reading tracking for children. A hierarchical detection-classification (HiDeC) method mitigates class imbalance, while pointer-network models, trained on ASR alignment maps, track reading positions effectively.
For joint speech recognition and synthesis, a non-autoregressive multimodal framework processes speech and text inputs, independently or combined, and trains on unpaired datasets. Iterative refinement enhances performance, achieving competitive results in STT and TTS tasks.
These contributions advance robust E2E systems that are compact and resilient to ASR errors, bypassing cascaded approaches for efficient and effective speech AI.