Towards a Coreference-aware Measure of Surprisal
This talk will describe ongoing work to model coreference as an incremental process, discussing current results, model design, and current challenges. Coreference is the semantic identity relationship between entities. Humans are able to effortlessly produce and comprehend language that describes coreference relations. While much work has explored coreference from a psycholinguistic angle, extensive modeling efforts have come from a more task-oriented NLP domain that does not seek to model cognitively plausible mechanisms. The current work attempts to bridge the two approaches by modeling coreference as part of an incremental semantic parsing process. Ultimately the model will be evaluated on parsing performance, coreference performance, and how well its predictions correlate with human processing data.
Learning to disambiguate by combining multiple sense representations
This talk will discuss ongoing work investigating the combination of multiple sense representation methods for word sense disambiguation (WSD). A variety of recent methods have been proposed for learning representations of semantic senses in different domains, and there is some evidence that different methods capture complementary information for WSD. We consider a simple but competitive cosine similarity-based model for WSD, and augment it by learning to produce a context-sensitive linear transformation of representations of candidate senses. In addition to transforming the input sense space, our method allows us to jointly project multiple sense representations into a single space. We find that a single learned projection matches or outperforms directly updated sense embeddings for single embedding methods, and demonstrate that combining multiple representations improves over any individual method alone. Further, by transforming and conjoining complete embedding spaces, we gain the ability to transfer model knowledge to ambiguous terms not seen during training; we are currently investigating the effectiveness of this transfer.
Evaluating state-of-the-art models of speaker commitment
When a speaker, Mary, utters “John did not discover that Bill lied”, we take Mary to be committed to Bill having lied, whereas in “John didn’t say that Bill lied”, we do not take that she is. Extracting such inferences arising from speaker commitment (aka event factuality) is crucial for information extraction and question answering. In this talk, we evaluate the state-of-the-art models for speaker commitment and natural language inference on the CommitmentBank, an English dataset of naturally occurring discourses, annotated with speaker commitment towards the content of the complement (“lied” in the example) of clause-embedding verbs (“discover”, “say”) under four entailment-canceling environment (negation, conditional, question, and modal). The CommitmentBank thus focuses on specific linguistic constructions and can be viewed as containing “adversarial” examples for speaker commitment models. We perform a detailed error analysis of the models’ outputs by breaking down items into classes according to various linguistic features. We show that these models can achieve good performance on certain classes of items, but fail to generalize to the diverse linguistic constructions that are present in natural language, highlighting directions for improvement.
Prediction is All You Need: A Large-Scale Study of the Effects of Word Frequency and Predictability in Naturalistic Reading
A number of psycholinguistic studies have factorially manipulated words’ contextual predictabilities and corpus frequencies and shown separable effects of each on measures of human sentence processing, a pattern which has been used to support distinct processing effects of prediction on the one hand and strength of memory representation on the other. This paper examines the generalizability of this finding to more realistic conditions of sentence processing by studying effects of frequency and predictability in three large-scale naturalistic reading corpora. Results show significant effects of word frequency and predictability in isolation but no effect of frequency over and above predictability, and thus do not provide evidence of distinct effects. The non-replication of separable effects in a naturalistic setting raises doubts about the existence of such a distinction in everyday sentence comprehension. Instead, these results are consistent with previous claims that apparent effects of frequency are underlyingly effects of predictability.
Improving classification of speech transcripts
Off-the-shelf speech recognition systems can yield useful results and accelerate application development, but general-purpose systems applied to specialized domains can introduce acoustically small–but semantically catastrophic–errors. Furthermore, sufficient audio data may not be available to develop custom acoustic models for niche tasks. To address these problems, we propose a concept to improve performance in text classification tasks that use speech transcripts as input, without any in-domain audio data. Our method augments available typewritten text training data with inferred phonetic information so that the classifier will learn semantically important acoustic regularities, making it more robust to transcription errors from the general purpose ASR. We successfully pilot our method in a speech-based virtual patient used for medical training, recovering up to 62% of errors incurred by feeding a small test set of speech transcripts to a classification model trained on typescript.
Exploring Mimic Loss for Robust ASR
We have recently devised a non-local criterion, called mimic loss, for training a model for speech denoising. This objective, which uses feedback from a senone classifier trained on clean speech, ensures that the denoising model produces spectral features are useful for speech recognition. We combine this knowledge transfer technique with the traditional local criterion to train the speech enhancer. We achieve a new state-of-the-art for the CHiME-2 corpus by feeding the denoised outputs to an off-the-shelf Kaldi recipe. An in-depth analysis of mimic loss reveals that this performance correlates with better reproduction of consonants with low average energy.
Explicitly Incorporating Tense/Aspect to Facilitate Creation of New Virtual Patients
The Virtual Patient project has collected a fair amount of data from student interactions with a patient presenting with back pain, but there is a desire to include a more diverse array of patients. With adequate training examples, treating the question identification task as a single label classification problem has been fairly successful. However, the current approach is not expected to work well to identify the novel questions that are important for patients with different circumstances, because these new questions have little training support. Exploring the label sets reveals some generalities across patients, including the importance of temporal properties of the symptoms. Including temporal information in the canonical question representations may allow us to leverage external data to mitigate the data sparsity issue for questions unique to new patients. I will solicit feedback on an approach to create a frame-like question representation that incorporates this temporal information, as revealed by the tense and linguistic aspect of clauses in the queries.
Alternate Uses for Domain Adaptation and Neural Machine Translation
Recent advances in Neural Machine Translation (NMT) have had ripple effects in other areas of NLP. The advances I am concerned with in this talk have to do with using NMT sentence encodings in downstream NLP tasks. After verifying an experiment where Wang et al. (2017) used this technique for sentence selection, I would now like to use this approach for paraphrase identification. In this talk, I will discuss Wang et al.’s experiment, my reimplementation, and my plans for integrating similar approaches for augmenting data used in the Virtual Patient project.
Tailoring “language agnostic” blackboxes to Arabic Dialects
Many state-of-the-art NLP technologies aspire to be language agnostic but perform disproportionately poorly on Arabic and its dialects. Identifying and understanding the linguistic phenomena which cause these performance drops and developing language specific solutions can shed light on how such technologies might be adapted to broaden their typological coverage. This talk will discuss several recent projects involving Arabic dialects which I worked on, including pan-dialectal dictionary induction, morphological modeling, and spelling normalization. For each of these projects, I will discuss the linguistic traits of Arabic that challenge language agnostic approaches, the language specific adaptations we employed to resolve such challenges, and finally, I will speculate on the generalizability of our solutions to other languages.
Learning from the best: A teacher-student framework multilingual models for low-resource languages.
Automatic Speech Recognition (ASR) in low resource languages is problematic because of the absence of transcripted speech. The amount of training data for any specific language in this category does not exceed 100 hours of speech. Recently, it has been found that knowledge obtained from a huge multilingual dataset (~ 1500 hours) is advantageous for ASR systems in low resource settings, i.e. the neural speech recognition models pre-trained on this dataset and then fine-tuned on language-specific data report a gain in performance as compared to training on language-specific data only. However, it goes without saying that a lot of time and resources are required to pre-train these models, specially the ones with recurrent connections. This work investigates the effectiveness of Teacher-Student (TS) learning to transfer knowledge from a recurrent speech recognition model (TDNN-LSTM) to a non-recurrent model (TDNN) in the context of multilingual speech recognition. Our results are interesting in more than one level. First, we find that student TDNN models trained using TS learning from a recurrent model (TDNN-LSTM) perform much better than their counterparts pre-trained using supervised learning. Second, these student models are trained only with language-specific data instead of the bulky multilingual dataset. Finally, the TS architecture allows us to leverage untranscribed data (previously untouched during supervised training) resulting in further improvement in the performance of the student TDNNs.