Clippers 10/4: Andy Goodhart on sentiment analysis for multi-label classification

Title: Perils of Legitimacy: How Legitimation Strategies Sow the Seeds of Failure in International Order

Abstract: Autocratic states are challenging U.S. power and the terms of the post-WWII security order. U.S. policy debates have focused on specific military and economic responses that might preserve the United States’ favorable position while largely taking for granted that the effort should be organized around a core of like-minded liberal states. I treat this U.S. emphasis on promoting a liberal narrative of international order as an effort to make U.S. hegemony acceptable to domestic and foreign audiences; it is a strategy to legitimate a U.S. led international hierarchy and mobilize political cooperation. Framing legitimacy in liberal terms is only one option, however. Dominant states have used a range of legitimation strategies that present unique advantages and disadvantages. The main choice these hierarchs face is whether to emphasize the order’s ability to solve problems or to advocate for a governing ideology like liberalism. This project aims to explain why leading states in the international system choose performance- or ideologically-based legitimation strategies and the advantages and disadvantages of each.

This research applies sentiment analysis techniques (that were designed to characterize text based on positive or negative language) to the multi-label classification of foreign policy texts. The goal is to take a corpus of foreign policy speeches and documents that include rhetoric intended to justify an empire or hegemon’s international behavior and build a data set that shows variation in this rhetoric over time. Custom dictionaries reflect vocabulary used by each hierarch to articulate their value proposition to subordinate political actors. The output of the model is the percentage of each text committed to performance- and ideologically-based legitimation strategies. Using sentiment analysis for document classification represents an improvement over supervised machine learning techniques, because it does not require the time-consuming step of creating training sets. It is also better suited to multi-label classification in which each document belongs to multiple categories. Supervised machine learning techniques are better suited to texts that are either homogenous in their category (e.g., a press release is either about health care or about foreign policy) or easily divided into sections that belong to homogenous categories.

Clippers 9/27: Micha Elsner on community-centered morphological annotation

Towards community-centered morphological annotation
Micha Elsner

I’ll be discussing joint work with Sara Court, Maria Copot, Noah Diewald and Stephanie Antetomaso, covering work from our recent ComputeEL publication and slightly updated version for Language Documentation & Archiving.

I hope to discuss both the existing work (for which an abstract is below) and also some of the upcoming challenges as we attempt to develop the learning part of the process into a usable and deployable part of the user experience.

There are many challenges in morphological fieldwork annotation: it heavily relies on segmentation and feature labeling (which have both practical and theoretical drawbacks), it’s time-intensive, and the annotator needs to be linguistically trained and may still annotate things inconsistently. We propose a workflow that relies on unsupervised and active learning grounded in Word-and-Paradigm morphology (WP). Machine learning has the potential to greatly accelerate the annotation process and allow a human annotator to focus on problematic cases, while the WP approach makes for an annotation system that is word-based and relational, removing the need to make decisions about feature labeling and segmentation early in the process and allowing speakers of the language of interest to participate more actively, since linguistic training is not necessary. We present a proof-of-concept for the first step of the workflow, in a realistic fieldwork setting, annotators can process hundreds of forms per hour.

Clippers 9/20: Byung-Doh Oh on the larger-gets-worse behavior of G/OPT surprisal

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Byung-Doh Oh and William Schuler

This work presents a replication and post-hoc analysis of recent surprising findings that larger GPT-2 language model variants that show lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times (Oh et al., 2022). First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for five GPT-Neo variants and eight OPT variants on two separate datasets, providing strong empirical support for this trend. Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and overpredicting reading times of nouns that are heavily constrained by the discourse. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pretrained language models to study human language processing.

Clippers 9/13: Sam Stevens on Foundation Model Encryption

We use autoregressive models’ capability to encode token sequences as a novel symmetric key cipher. We aim to demonstrate that the near-infinite possible representations for any given message means that we can empirically demonstrate CPA-security for our proposed cipher.

Clippers 8/30: Shuaichen Chang on Robustness Evaluation for Text-to-SQL

Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries on unseen databases. However, recent studies reveal that text-to-SQL models are vulnerable to adversarial perturbations. In this paper, we propose a comprehensive robustness evaluation benchmark based on Spider, a cross-domain text-to-SQL benchmark to evaluate the robustness of models. We design 17 realistic perturbations for databases, natural questions, and SQLs to systematically measure the robustness of text-to-SQL models from various task-specific aspects. We leverage the structural nature of the task for database and SQL perturbation and utilize large pretrained language model (PLM) to simulate human users for natural question perturbations. We conduct a diagnostic study of the state-of-the-art models on robustness with our evaluation set. The experimental results reveal that even the best model suffers around 50\% performance drop on certain perturbations. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.

Clippers 4/19: Ash Lewis on Question Generation in Interactive Semantic Parsing

I will presenting the work I’ve been doing on my QP2 in which I am attempting to use RSA (Rational Speech Act) approaches to improve a question generation model. This work is an extension of previous work on Transparent Interactive Semantic Parsing, in which we develop a dialogue agent that helps a human user query a knowledge base in natural language. The dialogue agent parses an NL utterance into a SPARQL query, decomposes it into pieces, retrieves answers, then translates the entire process into a series of natural language sub-questions so that the user can validate the results or make corrections as necessary. The current work focuses on the sub-question generation sub-task, in which it is very important for the question to accurately and coherently represent the meaning of its SPARQL query. To this end, I experiment with RSA-style approaches of explicit modeling of a listener to improve the generator. Specifically in this work I focus on a “reconstructor”-based method in which a listener model is trained to recover the original meaning representation (SPARQL query) from a base speaker model. I will show my experiments with self-training using the reconstructor-based model and detail my in-progress work with a “distractor”-based approach, in which the model attempts to generate an utterance that distinguishes an input from possible distractor inputs.

Clippers 4/5: Willy Cheung on neural networks and cataphora

In the last few years, deep learning approaches using the pretraining/finetuning approach have become state-of-the-art on a number of language tasks. Due to the success of pretrained neural language models, the following question has been raised: to what extent can good general linguistic representations be learned from language modeling alone? One line of research that aims to test this treats pretrained neural language models as linguistic experiment subjects, using the probabilities output by neural models as a proxy for acceptability on linguistic data in minimal pairs. With this approach, I will present tests on data from one particular cataphora study on GPT2, and will also discuss ongoing work in this vein.

Clippers 3/29: Vishal Sunder on Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8% over previous results.

Clippers 3/22: Lingbo Mo on complex question answering

Complex question answering (CQA) requires multi-hop reasoning to combine multiple pieces of evidences ideally from different knowledge sources. Considering the insufficient labeled data in a single knowledge source and expensive human annotations, we study knowledge transfer for CQA between heterogeneous sources including a text corpus and a knowledge base (KB). To facilitate knowledge transfer between sources, we first propose a unified framework, SimultQA, to bridge KBQA and TextQA systems, which could leverage supervisions from both sources. By conducting experiments on CWQ and HotpotQA that are two popular datasets originally designed for KBQA and TextQA respectively, we explore how knowledge is transferred between sources following the pre-training and fine-tuning paradigm, and find that knowledge transfer between heterogeneous sources consistently improves the QA performance. We also conduct fine-grained analysis and hybrid evaluation experiments to further explain what knowledge has been transferred.