With the recent explosion and hype of deep learning, linguists within the NLP community have used carefully constructed linguistic examples to do targeted assessment of model linguistic capability, to see what models really know and where they fall short. In the spirit of these studies, my project aims to investigate neural network behavior on a linguistic phenomenon that has not received much attention: cataphora (i.e. when a referring expression such as a pronoun precedes its antecedent). I investigate the behavior of two models on cataphora: WebNLG (a model trained for NLG as described in Li et al 2020, based on pretrained T5 model in Raffel et al 2019), and the Joshi model (a finetuned model for coreference resolution described in Joshi et al 2019, based on the pretrained BERT model in Devlin et al 2019). The general idea is to test whether these models can distinguish acceptable and unacceptable examples involving cataphora. Some factors I will be investigating include 1) preposed (ie fronted) vs. postposed clauses. 2) cataphora across subordination vs. coordination of clauses. 3) a special case of pragmatic subordination with contrastive “but”.
Ash Lewis and Lingbo Mo will present an update on their work, beginning with a paper called Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction. Since they last presented, they have conducted further experiments and begun planning for a “real user” study. They will also share their thoughts on potential future work for feedback. An abstract of the paper can be found below.
Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction
Existing studies on semantic parsing focus on mapping a natural-language utterance to a logical form (LF) in one turn. However, because natural language may contain ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted LF step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user trust the final answer. We construct INSPIRED, a crowdsourced dialogue dataset derived from the ComplexWebQuestions dataset. Our experiments show that this framework has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without further crowdsourcing effort. The results demonstrate that our frameworkpromise s to be effective across such models.
Enriching Linguistic Analyses by Modelling Neutral and Controversial Items
Typically, linguistic analyses are performed over datasets composed of text items where each item is assigned a category that represents a phenomenon. This category is obtained by combining multiple human annotations. Items considered for analyses are often those which exhibit a clear polarizing phenomenon (e.g. either polite or impolite). However, language can sometimes exhibit none of those phenomena (neither polite nor impolite) or a combination of phenomena (e.g. polite and impolite). This is evident in NLU datasets as they contain a significant number of items on which annotators disagreed, or agreed that they do not exhibit any phenomenon. The goal is to discover linguistic patterns associated with those items. This helps in further enriching linguistic analyses by providing insight into how language could be interpreted by different listeners.
SYSML: StYlometry with Structure and Multitask Learning: Implications for Darknet Forum Migrant Analysis
Darknet market forums are frequently used to exchange illegal goods and services between parties who use encryption to conceal their identities. The Tor network is used to host these markets, which guarantees additional anonymization from IP and location tracking, making it challenging to link across malicious users using multiple accounts (sybils). Additionally, users migrate to new forums when one is closed, making it difficult to link users across multiple forums. We develop a novel stylometry-based multitask learning approach for natural language and interaction modeling using graph embeddings to construct low-dimensional representations of short episodes of user activity for authorship attribution. We provide a comprehensive evaluation of our methods across four different darknet forums demonstrating its efficacy over the state-of-the-art, with a lift of up to 2.5X on Mean Retrieval Rank and 2X on Recall@10.