Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8% over previous results.
Month: March 2022
Clippers 3/22: Lingbo Mo on complex question answering
Complex question answering (CQA) requires multi-hop reasoning to combine multiple pieces of evidences ideally from different knowledge sources. Considering the insufficient labeled data in a single knowledge source and expensive human annotations, we study knowledge transfer for CQA between heterogeneous sources including a text corpus and a knowledge base (KB). To facilitate knowledge transfer between sources, we first propose a unified framework, SimultQA, to bridge KBQA and TextQA systems, which could leverage supervisions from both sources. By conducting experiments on CWQ and HotpotQA that are two popular datasets originally designed for KBQA and TextQA respectively, we explore how knowledge is transferred between sources following the pre-training and fine-tuning paradigm, and find that knowledge transfer between heterogeneous sources consistently improves the QA performance. We also conduct fine-grained analysis and hybrid evaluation experiments to further explain what knowledge has been transferred.
Clippers 3/8: Ash Lewis and Ron Chen on the AlexaPrize Taskbot Challenge
On Tuesday, Ron Chen and I will discuss our ongoing work on the AlexaPrize Taskbot Challenge. The competition, which is currently in the semi-finals stage, involves 9 teams that are developing taskbots that will assist an Alexa user to go through a recipe or DIY task in a step-by-step, engaging manner. We will do a demonstration of our taskbot and outline our efforts on topics including dialogue management, response generation, question answering, and user engagement. We hope to solicit feedback both on technical aspects of the work as well as ways in which the bot can be made more engaging and intuitive for users.
Clippers 3/1: Byung-Doh Oh on Analyzing the predictive power of neural LM surprisal
Analyzing the predictive power of neural LM surprisal
This work presents an in-depth analysis of an observation that contradicts the findings of recent work in computational psycholinguistics, namely that smaller GPT-2 models that show higher test perplexity nonetheless generate surprisal estimates that are more predictive of human reading times. Analysis of the surprisal values shows that rare proper nouns, which are typically tokenized into multiple subword tokens, are systematically assigned lower surprisal values by the larger GPT-2 models. A comparison of residual errors from regression models fit to reading times reveals that regression models with surprisal predictors from smaller GPT-2 models have significantly lower mean absolute errors on words that are tokenized into multiple tokens, while this trend is not observed on words that are kept intact. These results indicate that the ability of larger GPT-2 models to predict internal pieces of rare words more accurately makes their surprisal estimates deviate from humanlike expectations that manifest in self-paced reading times and eye-gaze durations.