Clippers 1/24: Sandro Maskharashvili on Discourse Relations in NLG

Discourse Relations: Their Role and Use in Natural Language Generation

Speakers make extensive use of discourse connectives (e.g., but, and, so, although etc.) while communicating messages with rich information: Discourse connectives express abstract relations, called discourse relations, between pieces of information they connect. This facilitates understanding the message the speaker wants to communicate. Traditional computational linguistic (CL) approaches to natural language processing heavily rely on modeling discourse relations, in both natural language generation (NLG) and parsing tasks. The recent emergence of neural network-based approaches to natural language modeling led to remarkable advances in many CL tasks, including NLG. Nevertheless, when it comes to discourse-level phenomena, particularly the coherent use of discourse connectives, improvements are less obvious. First, I will present results of my doctoral research concerning design of symbolic, grammatical approaches to discourse, which are in line with the traditional CL approaches to discourse but overcome some important obstacles that previous approaches have. Then, I will review studies we have been systematically carrying out to establish whether neural network-based approaches can be extended/revised to overcome the issues they face. Based on our results, I will argue that reinstating the central, ubiquitous status of discourse relations by explicitly encoding discourse relations in natural language meaning representations, significantly enhances correct and coherent generation of discourse connectives with neural network-based approaches. Finally, I will discuss ample possibilities of exploring synergies of traditional, grammatical approaches and the state-of-the-art neural network-based ones to overcome critical issues, such as, data limitation problems for low-resourced languages, and interpretability of the performance of neural-network based models of language.

Clippers 11/29: Chris Brew on NLP Beyond Academia

What’s it like to be a research scientist/data scientist in industry?

I’ll expand on my short answer, which is in the next paragraph.

It varies with the DNA of the organization. For example, the places I have been earn money in different ways and value different things.
  • ETS (non-profit running tests like the GRE and TOEFL)
  • Nuance (speech products, often on contract to undisclosed big company)
  • Thomson Reuters (broad spectrum information provider)
  • Digital Operatives (subcontractor to the security industrial complex)
  • Facebook Applied AI. (trying to suppress “harmful content”)
  • Facebook Linguistic Engineering (linguistic data and processes FTW)
  • LivePerson (chatbot services and products for Fortune 500-ish clients)
  • LexisNexis (information with a legal flavor, mostly for lawyers)

If you are a student now you are acquiring skills that will please and amaze people who are in business.

  • Communication. Do as much as you can, to as many audiences as you can, orally and in writing.
  • Evidence. There is great value in collecting evidence and using it to change your mind when you turn out to be wrong.
  • Persistence. Dealing with the fact that the original plan didn’t work as expected, but the problem still needs solving.
Absent from the list of skills is any particular technical tool. If I were giving this talk in 1990, people would be asking whether they could keep using Prolog or Lisp in the commercial world, or in 2000 whether XML and XSLT were going to be important, or now,  whether the company uses Keras, PyTorch or MxNet. These are/were all perfectly valid questions, but the answers change as quickly as anything else on the Internet, so don’t count on that kind of expertise to get you where you want to go.

Clippers 11/22: Pranav Maneriker on Scaling Laws and Structure for Stylometry on Reddit


The problem of authorship identification (AID) consists of predicting whether two documents were composed by the same author. I will describe the creation of the Colossal Reddit User Dataset (CRUD), a corpus consisting of comment histories by five million anonymous Reddit users. The corpus comprises of 2.2 billion Reddit comments from January 2015 to December 2019. To our knowledge, CRUD is the most extensive corpus of its kind and, as such, may prove a valuable resource for researchers interested in various aspects of user modeling, such as modeling author style. We will also discuss preliminary experimental results from scaling AID models on large datasets inspired by related work on scaling laws for neural language models. Finally, we will discuss ongoing research on the role of interaction graph structures in AID.

Clippers 11/15: Christian Clark on Categorial Grammar Induction

Grammar induction is the task of learning a set of syntactic rules from an unlabeled text corpus. Much recent work in this area has focused on learning probabilistic context-free grammar (PCFG) rules; however, these rules are not sufficiently expressive to capture the full variety of structures found in human languages. Bisk and Hockenmaier (2012) present a system for inducing a Combinatory Categorial Grammar, a more expressive formalism, but this system learns from sentences with part-of-speech tags rather than unlabeled data. I will present my initial work toward implementing a categorial grammar induction system that can learn from unlabeled data using a neural network–based architecture.

Clippers 11/8: Vishal Sunder on Textual Knowledge Transfer for Speech Understanding

Title: Fine-grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding

Abstract: RNN Tranducer (RNN-T) technology is very popular for building deployable models for end-to-end (E2E) automatic speech recognition (ASR) and spoken language understanding (SLU). Since these are E2E models operating on speech directly, there remains a potential to improve their performance using purely text-based models like BERT, which have strong language understanding capabilities. In this work, we propose a new training criterion for RNN-T based E2E ASR and SLU to transfer BERT’s knowledge into these systems. In the first stage of our proposed mechanism, we improve ASR performance by using a fine-grained, tokenwise knowledge transfer from BERT. In the second stage, we fine-tune the ASR model for SLU such that the above knowledge is explicitly utilized by the RNN-T model for improved performance. Our techniques improve ASR performance on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and on the recently released SLURP dataset on which we achieve a new state-of-the-art performance. For SLU, we show significant improvements on the SLURP slot filling task, outperforming HuBERT-base and reaching a performance close to HuBERT-large. Compared to large transformer-based speech models like HuBERT, our model is significantly more compact and uses only 300 hours of speech pretraining data.

Clippers 10/25: Lingbo Mo on Controllable Decontextualization

Yes/No or polar questions represent one of the main linguistic question categories. They consist of a main interrogative clause, for which the answer is binary (assertion or negation). Polar questions and answers (PQA) represent a valuable knowledge resource, present in many communities and other curated QA sources, such as forums or e-commerce applications. Using answers to polar questions alone in other contexts is not trivial. Answers are contextualized, and presume that the interrogative question clause and any shared knowledge between the asker and answerer are provided. We address the problem of controllable rewriting of answers to polar questions into decontextualized and succinct factual statements. We propose a Transformer sequence-to-sequence model that utilizes soft constraints to ensure controllable rewriting, such that the output statement is semantically equivalent to its PQA input. We conduct the evaluation on three separate PQA datasets as measured through both automated and human evaluation metrics and show the effectiveness of our proposed approach compared with existing baselines.

Clippers 10/18: Amad Hussain on Data Augmentation using Paraphrase Generation and Mix-Up

Amad Hussain and Henry Leonardi

Abstract: Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A potential solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. We investigate the efficacy of paraphrase augmentation through Neural Example Extrapolation (Ex2) using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques are validated through training and evaluation of a downstream self-attentive RNN model with and without MIXUP. Initial results indicate paraphrase augmentation improves downstream model performance, however with less benefit than augmenting with MIXUP. Furthermore, we show mixed results for paraphrase augmentation in combination with MIXUP as well as for the efficacy of paraphrase validation. These results indicate a trade-off between reduction of misleading paraphrases and paraphrase diversity. In accordance with these initial findings, we identify promising areas of future work that have the potential to address this trade-off and better leverage paraphrase augmentation, especially in coordination with Mix-Up. As this is a work in progress, we hope to have a productive conversation with regards to the feasibility of our future directions as well any larger limitations or directions we should consider.

Clippers 10/11: Willy Cheung on Targeted Linguistic Evaluation of Cataphora

Due to their state of the art performance on natural language processing tasks, large neural language models have garnered significant interest as of late. To get a better understanding of their linguistic abilities, linguistics researchers have used the targeted linguistic evaluation paradigm to test neural models in a more linguistically controlled manner. Following this line of work, I am interested in investigating how neural models handle cataphora, i.e. when a pronoun precedes what it refers to (e.g. when [he] gets to work, [John] likes to drink a cup of coffee). I will present work attempting to use stimuli from existing cataphora studies, running and comparing GPT2 results to experimental data. A number of issues arise in comparing to existing studies, motivating a new study to collect data that would better suit the testing of neural models. I show the set up for my pilot experiment, and some preliminary results. I end with some ideas for future directions of this work.

Clippers 10/4: Andy Goodhart on sentiment analysis for multi-label classification

Title: Perils of Legitimacy: How Legitimation Strategies Sow the Seeds of Failure in International Order

Abstract: Autocratic states are challenging U.S. power and the terms of the post-WWII security order. U.S. policy debates have focused on specific military and economic responses that might preserve the United States’ favorable position while largely taking for granted that the effort should be organized around a core of like-minded liberal states. I treat this U.S. emphasis on promoting a liberal narrative of international order as an effort to make U.S. hegemony acceptable to domestic and foreign audiences; it is a strategy to legitimate a U.S. led international hierarchy and mobilize political cooperation. Framing legitimacy in liberal terms is only one option, however. Dominant states have used a range of legitimation strategies that present unique advantages and disadvantages. The main choice these hierarchs face is whether to emphasize the order’s ability to solve problems or to advocate for a governing ideology like liberalism. This project aims to explain why leading states in the international system choose performance- or ideologically-based legitimation strategies and the advantages and disadvantages of each.

This research applies sentiment analysis techniques (that were designed to characterize text based on positive or negative language) to the multi-label classification of foreign policy texts. The goal is to take a corpus of foreign policy speeches and documents that include rhetoric intended to justify an empire or hegemon’s international behavior and build a data set that shows variation in this rhetoric over time. Custom dictionaries reflect vocabulary used by each hierarch to articulate their value proposition to subordinate political actors. The output of the model is the percentage of each text committed to performance- and ideologically-based legitimation strategies. Using sentiment analysis for document classification represents an improvement over supervised machine learning techniques, because it does not require the time-consuming step of creating training sets. It is also better suited to multi-label classification in which each document belongs to multiple categories. Supervised machine learning techniques are better suited to texts that are either homogenous in their category (e.g., a press release is either about health care or about foreign policy) or easily divided into sections that belong to homogenous categories.