Do languages differ in semantic transparency of derived words? Using word vectors to explore English and Russian
This study explores whether the semantic relationship of derived words to their bases is similarly sensitive to word frequency in English and Russian. High-frequency derived words are thought to be memorized by speakers, rather than being parsed into constituents. As a result, such words may become semantically opaque, implying that frequent words have lower average transparency. We investigated whether distributional differences of English and Russian derivational suffixes translate into differences in semantic transparency, using cosine similarity of word vectors. Our results show a positive correlation between derived word frequency and semantic transparency, contrary to expectations. This may reflect suffix-specific effects.
With the recent explosion and hype of deep learning, linguists within the NLP community have used carefully constructed linguistic examples to do targeted assessment of model linguistic capability, to see what models really know and where they fall short. In the spirit of these studies, my project aims to investigate neural network behavior on a linguistic phenomenon that has not received much attention: cataphora (i.e. when a referring expression such as a pronoun precedes its antecedent). I investigate the behavior of two models on cataphora: WebNLG (a model trained for NLG as described in Li et al 2020, based on pretrained T5 model in Raffel et al 2019), and the Joshi model (a finetuned model for coreference resolution described in Joshi et al 2019, based on the pretrained BERT model in Devlin et al 2019). The general idea is to test whether these models can distinguish acceptable and unacceptable examples involving cataphora. Some factors I will be investigating include 1) preposed (ie fronted) vs. postposed clauses. 2) cataphora across subordination vs. coordination of clauses. 3) a special case of pragmatic subordination with contrastive “but”.
Ash Lewis and Lingbo Mo will present an update on their work, beginning with a paper called Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction. Since they last presented, they have conducted further experiments and begun planning for a “real user” study. They will also share their thoughts on potential future work for feedback. An abstract of the paper can be found below.
Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction
Existing studies on semantic parsing focus on mapping a natural-language utterance to a logical form (LF) in one turn. However, because natural language may contain ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted LF step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user trust the final answer. We construct INSPIRED, a crowdsourced dialogue dataset derived from the ComplexWebQuestions dataset. Our experiments show that this framework has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without further crowdsourcing effort. The results demonstrate that our frameworkpromise s to be effective across such models.
Enriching Linguistic Analyses by Modelling Neutral and Controversial Items
Typically, linguistic analyses are performed over datasets composed of text items where each item is assigned a category that represents a phenomenon. This category is obtained by combining multiple human annotations. Items considered for analyses are often those which exhibit a clear polarizing phenomenon (e.g. either polite or impolite). However, language can sometimes exhibit none of those phenomena (neither polite nor impolite) or a combination of phenomena (e.g. polite and impolite). This is evident in NLU datasets as they contain a significant number of items on which annotators disagreed, or agreed that they do not exhibit any phenomenon. The goal is to discover linguistic patterns associated with those items. This helps in further enriching linguistic analyses by providing insight into how language could be interpreted by different listeners.
SYSML: StYlometry with Structure and Multitask Learning: Implications for Darknet Forum Migrant Analysis
Darknet market forums are frequently used to exchange illegal goods and services between parties who use encryption to conceal their identities. The Tor network is used to host these markets, which guarantees additional anonymization from IP and location tracking, making it challenging to link across malicious users using multiple accounts (sybils). Additionally, users migrate to new forums when one is closed, making it difficult to link users across multiple forums. We develop a novel stylometry-based multitask learning approach for natural language and interaction modeling using graph embeddings to construct low-dimensional representations of short episodes of user activity for authorship attribution. We provide a comprehensive evaluation of our methods across four different darknet forums demonstrating its efficacy over the state-of-the-art, with a lift of up to 2.5X on Mean Retrieval Rank and 2X on Recall@10.
Modeling Plural Inflection Class Structure in Maltese
Theoretical and typological research in morphology define an inflectional paradigm as the collection of related word forms associated with a given lexeme. When multiple lexemes share the same paradigm, they in turn define an inflection class. Recent work in morphology uses information theory to quantify the complexity of a language’s inflectional system in terms of interpredictability across word forms and paradigms. These studies provide precise synchronic descriptions of inflectional structure, but are unable to account for how or why these systems emerge in language-specific ways. I’ll be presenting on ongoing research for my QP1 that addresses this question by modeling the relative influence of three factors – phonological form, semantic meaning, and etymological origin – on the organization of plural inflection classes in Maltese.
Are syntactic categories real?
People can express novel, precise complex ideas — plans with sophisticated contingencies, predictive models of interrelated uncertain events, and more — which seems to suggest a formal, compositional semantics in which sentences are divided into categories with associated semantic functions. But state-of-the-art NLP systems – transformers like BERT and GPT-3 — don’t work like that. This talk will review evidence about syntactic categories from sentence processing experiments and grammar inductions simulations conducted over the past few years in the OSU computational cognitive modeling lab, and hazard some guesses about the cognitive status of syntactic categories.
Title: Semi-Supervised Heterogeneous Feature Learning in a Large-Scale Conversational AI System
Abstract: This paper aims to improve an important downstream component of a large-scale industrial conversational AI system. The component is called the Skill Routing Component (SRC) and is responsible for a variety of tasks. As the last component before executing user requests, SRC utilizes many textual and symbolic features obtained from heterogeneous upstream components like automatic speech recognition (ASR) and natural language understanding (NLU), which necessitates the need for an efficient way to utilize these features. To achieve this, we propose a unified transformer model which in contrast to the traditional methods encodes the heterogeneous features into a shared latent space. Next, there is an inherent connection between SRC tasks and upstream NLU tasks. We utilize noisy NLU data for pre-training the unified SRC model via specifically curated objectives and fine-tune it separately on the different SRC tasks. Our method shows an average improvement of 1.8% on four SRC tasks over the state-of-the-art baseline.
Title: Towards end-to-end integration of dialog history for improved spoken language understanding.
Abstract: Dialog history plays an important role in spoken language understanding (SLU) performance in a dialog system. For end-to-end (E2E) SLU, previous work has used dialog history in text form, which makes the model dependent on a cascaded automatic speech recognizer (ASR). This rescinds the benefits of an E2E system which is intended to be compact and robust to ASR errors. In this work, we propose a hierarchical conversation model that is capable of directly using dialog history in speech form, making it fully E2E. We also distill semantic knowledge from the available gold conversation transcripts by jointly training a similar text-based conversation model with an explicit tying of acoustic and semantic embeddings. We also propose a novel technique that we call DropFrame to deal with the long training time incurred by adding dialog history in an E2E manner. On the HarperValleyBank dialog dataset, our E2E history integration outperforms a history independent baseline by 7.7% absolute F1 score on the task of dialog action recognition. Our model performs competitively with the state-of-the-art history based cascaded baseline, but uses 48% fewer parameters. In the absence of gold transcripts to fine-tune an ASR model, our model outperforms this baseline by a significant margin of 10% absolute F1 score.
Byung-Doh Oh will be presenting his work unsupervised grammar induction, followed by some attempts to extend the project.
Character-based PCFG Induction for Modeling the Syntactic Acquisition of Morphologically Rich Languages
Unsupervised PCFG induction models, which build syntactic structures from raw text, can be used to evaluate the extent to which syntactic knowledge can be acquired from distributional information alone. However, many state-of-the-art PCFG induction models are word-based, meaning that they cannot directly inspect functional affixes, which may provide crucial information for syntactic acquisition in child learners. This work first introduces a neural PCFG induction model that allows a clean ablation of the influence of subword information in grammar induction. Experiments on child-directed speech demonstrate first that the incorporation of subword information results in more accurate grammars with categories that word-based induction models have difficulty finding, and second that this effect is amplified in morphologically richer languages that rely on functional affixes to express grammatical relations. A subsequent evaluation on multilingual treebanks shows that the model with subword information achieves state-of-the-art results on many languages, further supporting a distributional model of syntactic acquisition.