Multimodal processing and cross-lingual relation extraction at BBN
I will show the architecture of a system we have built to process visual, audio, and text information in parallel to support hypothesis generation. Then I will talk about a specific research thrust into relation extraction, a text-based technology, using BERT embeddings and annotation projection to perform relation extraction in Russian and Ukrainian.
Real-time Mispronunciation Detection for Kids’ Speech
Modern mispronunciation detection and diagnosis systems have seen significant gains in accuracy due to the introduction of deep learning. However, these systems have not been evaluated for the ability to be run in real-time, an important factor in applications that provide rapid feedback. In particular, the state-of-the-art uses bi-directional recurrent networks, where a uni-directional network may be more appropriate. Teacher-student learning is a natural approach to improve a uni-directional model, but when using a CTC objective, this is limited by poor alignment of outputs to evidence. We address this limitation by trying two loss terms for improving the alignments of our models. One loss is an “alignment loss” term that encourages outputs only when features do not resemble silence. The other loss term uses a uni-directional model as teacher model to align the bi-directional model. Our proposed model uses these aligned bi-directional models as teacher models. Experiments on the CSLU kids’ corpus show that these changes decrease the latency of the outputs, and improve the detection rates, with a trade-off between these goals.
Do you know that there’s still a chance? Identifying speaker commitment for natural language understanding
Marie-Catherine de Marneffe
When we communicate, we infer a lot beyond the literal meaning of the words we hear or read. In particular, our understanding of an utterance depends on assessing the extent to which the speaker stands by the event she describes. An unadorned declarative like “The cancer has spread” conveys firm speaker commitment of the cancer having spread, whereas “There are some indicators that the cancer has spread” imbues the claim with uncertainty. It is not only the absence vs. presence of embedding material that determines whether or not a speaker is committed to the event described: from (1) we will infer that the speaker is committed to there being war, whereas in (2) we will infer the speaker is committed to relocating species not being a panacea, even though the clauses that describe the events in (1) and (2) are both embedded under “(s)he doesn’t believe”.
(1) The problem, I’m afraid, with my colleague here, he really doesn’t believe that it’s war.
(2) Transplanting an ecosystem can be risky, as history shows. Hellmann doesn’t believe that relocating species threatened by climate change is a panacea.
In this talk, I will first illustrate how looking at pragmatic information of what speakers are committed to can improve NLP applications. Previous work has tried to predict the outcome of contests (such as the Oscars or elections) from tweets. I will show that by distinguishing tweets that convey firm speaker commitment toward a given outcome (e.g., “Dunkirk will win Best Picture in 2018”) from ones that only suggest the outcome (e.g., “Dunkirk might have a shot at the 2018 Oscars”) or tweets that convey the negation of the event (“Dunkirk is good but not academy level good for the Oscars”), we can outperform previous methods. Second, I will evaluate current models of speaker commitment, using the CommitmentBank, a dataset of naturally occurring discourses developed to deepen our understanding of the factors at play in identifying speaker commitment. We found that a linguistically informed model outperforms a LSTM-based one, suggesting that linguistic knowledge is needed to achieve robust language understanding. Both models however fail to generalize to the diverse linguistic constructions present in natural language, highlighting directions for improvement.
Constrained Decoding for Neural NLG from Compositional Representations in Task-Oriented Dialogue
(joint work with Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani and Rajen Subba)
Neural methods for natural language generation (NNLG) arrived with much fanfare a few years ago and became the dominant method employed in the recent E2E NLG Challenge. While neural methods promise flexible, end-to-end trainable models, recent studies have revealed their inability to produce satisfactory output for longer or more complex texts as well as how the black-box nature of these models makes them difficult to control. In this talk, I will propose using tree-structured semantic representations, like those used in traditional rule-based NLG systems, for better discourse-level structuring and sentence-level planning. I will then introduce a constrained decoding approach for sequence-to-sequence models that leverages this representation to improve semantic correctness. Finally, I will demonstrate promising results on a new conversational weather dataset as well as the E2E dataset and discuss remaining challenges.
Title: fMRI reveals language-specific predictive coding during naturalistic sentence comprehension
Abstract: Much research in cognitive neuroscience supports prediction as a canonical computation of cognition in many domains. Is such predictive coding implemented by feedback from higher-order domain-general circuits, or is it locally implemented in domain-specific circuits? What information sources are used to generate these predictions? This study addresses these two questions in the context of language processing. We present fMRI evidence from a naturalistic comprehension paradigm (1) that predictive coding in the brain’s response to language is domain-specific, and (2) that these predictions are sensitive both to local word co-occurrence patterns and to hierarchical structure. Using a recently developed continuous-time deconvolutional regression technique that supports data-driven hemodynamic response function discovery from continuous BOLD signal fluctuations in response to naturalistic stimuli, we found we found effects of prediction measures in the language network but not in the domain-general, multiple-demand network. Moreover, within the language network, surface-level and structural prediction effects were separable. The predictability effects in the language network were substantial, with the model capturing over 37% of explainable variance on held-out data. These findings indicate that human sentence processing mechanisms generate predictions about upcoming words using cognitive processes that are sensitive to hierarchical structure and specialized for language processing, rather than via feedback from high-level executive control mechanisms.
We demonstrate a natural language understanding module for a question-answering dialog agent in a resource-constrained virtual patient domain, which combines both rule-based and machine learning approaches. We further validate the model development work by performing a replication study using live subjects, broadly confirming the findings from the development process using a fixed dataset, but highlighting important deficits. In particular, the hybrid approach continues to show substantial improvements over either rule-based or machine learning approaches individually, even handling unseen classes with some success; however, the system has unexpected difficulty handling out-of-domain questions. We attempt to mitigate this issue with moderate success, and provide analysis of the problem to suggest future improvements.
Nanjiang Jiang and Marie-Catherine de Marneffe’s paper entitled, “Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment” has won Best Short Paper at the ACL 2019 Annual Meeting. It is one of three papers to receive a best paper award out of 660 accepted papers total.
Measuring the perceptual availability of phonological features during language acquisition using unsupervised binary stochastic autoencoders
This study deploys binary stochastic neural autoencoder networks as models of infant language learning in two typologically unrelated languages (Xitsonga and English). Results show that the drive to model auditory percepts leads to latent clusters that partially align with theory-driven phonemic categories. Evaluation of the degree to which theory-driven phonological features are encoded in the latent bit patterns shows that some (e.g. [+-approximant]), are well represented by the network in both languages, while others (e.g. [+-spread glottis]) are less so. Together, these findings suggest that many reliable cues to phonemic structure are immediately available to infants from bottom-up perceptual characteristics alone, but that these cues must eventually be supplemented by top-down lexical and phonotactic information to achieve adult-like phone discrimination. These results also suggest differences in degree of perceptual availability between features, yielding testable predictions as to which features might depend more or less heavily on top-down cues during child language acquisition.
Towards a Coreference-aware Measure of Surprisal
This talk will describe ongoing work to model coreference as an incremental process, discussing current results, model design, and current challenges. Coreference is the semantic identity relationship between entities. Humans are able to effortlessly produce and comprehend language that describes coreference relations. While much work has explored coreference from a psycholinguistic angle, extensive modeling efforts have come from a more task-oriented NLP domain that does not seek to model cognitively plausible mechanisms. The current work attempts to bridge the two approaches by modeling coreference as part of an incremental semantic parsing process. Ultimately the model will be evaluated on parsing performance, coreference performance, and how well its predictions correlate with human processing data.
Learning to disambiguate by combining multiple sense representations
This talk will discuss ongoing work investigating the combination of multiple sense representation methods for word sense disambiguation (WSD). A variety of recent methods have been proposed for learning representations of semantic senses in different domains, and there is some evidence that different methods capture complementary information for WSD. We consider a simple but competitive cosine similarity-based model for WSD, and augment it by learning to produce a context-sensitive linear transformation of representations of candidate senses. In addition to transforming the input sense space, our method allows us to jointly project multiple sense representations into a single space. We find that a single learned projection matches or outperforms directly updated sense embeddings for single embedding methods, and demonstrate that combining multiple representations improves over any individual method alone. Further, by transforming and conjoining complete embedding spaces, we gain the ability to transfer model knowledge to ambiguous terms not seen during training; we are currently investigating the effectiveness of this transfer.