Multi-Source Morphological Reinflection with Reinforcement Learning
This project develops a task using reinforcement learning to guild multi-source morphological reinflection (MRI). MRI is the task of transforming words from one inflectional form to another. For example, when encountering a new inflected form of a word, humans may rely on their knowledge of the morphological rules of the language, as well as their experience with similar forms in the past, to infer the correct inflection. In Kann and coauthors’ (2017) study, they develop a multi-source MRI model, which receives a target tag and multiple pairs of source form and source tag for a lemma. Their model is found to out-perform single-source reinflection models as different source forms can provide complementary information. Although Kann does not provide specific details on how the multiple pairs of source form and tag are chosen, selecting appropriate source form-tag pair as reference words are the key in modeling morphological reinflection. Our project use reinforcement learning to select reference words during morphological reinflection process, specifically, an RL agent could learn to select the appropriate source form and tag pair based on the context of the lemma and the morphological features, as well as its experience with similar examples in the past, which is similar to the way humans select the appropriate inflected form based on context and their past experience with the language. Since this project is still ongoing, I would greatly appreciate any suggestions or feedback.
This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. Prior research in SQL comment-generation has focused on comments which summarize entire SQL queries and translations of SQL to templated English (Eleftherakis et al., 2021; Narechania et al., 2021). These approaches can be helpful in comprehending SQL but are limited in their ability to guide users through the query steps and connect formal notation with intuitive concepts. To address this limitation, the project aims to generate line-by-line comments that leverage language from user questions, connecting formal SQL notation with user-friendly concepts (e.g. “tallest” or “alphabetical order”).
Due to a lack of pre-existing training data, 100 SQL queries from the SPIDER dataset (Yu et al., 2018) have been manually annotated. These 100 examples will then be used as a base for generating a more robust training set through self-training and prompting. I have been experimenting with using ChatGPT to generate comments for more queries as well as fine-tuning BART for the task. This approach will allow us to scale the training set quickly and minimize time spent writing comments by hand. This presentation will discuss the annotation process and preliminary results for comment generation using the above methods.
Tackling Training with Imbalanced Datasets: An Investigation of MixUp and Paraphrase Augmentation for Downstream Classification
Low-resource dialogue systems often contain a high degree of few-shot class labels, leading to challenges in utterance classification performance. A possible solution is data augmentation through paraphrase generation, but this method has the potential to introduce harmful data points in form of low-quality paraphrases. We explore this challenge as a case-study using a virtual patient dialogue system, which contains a long-tail distribution of few-shot labels. In previous work, we investigated the efficacy of paraphrase augmentation using both in-domain and out-of-domain data, as well as the effects of paraphrase validation techniques using Natural Language Inference (NLI) and reconstruction methods. These data augmentation techniques were validated through training and evaluation of a downstream self-attentive RNN model with and without MixUp (embedding interpolation during training). The results were mixed and indicated a trade-off between reduction of misleading paraphrases and paraphrase diversity.
In this talk, I will go over potential training paradigms and paraphrase filtration mechanisms which expand on this previous work. Ideas range from example sampling techniques, variable-loss during MixUp, and paraphrase filtration using training loss. The hope is that one, or some combination, of these methods will improve model generalizability and class-imbalanced training. The obvious direction is not clear so feedback on these directions will be much appreciated!
Title: End-to-end word-level disfluency detection and classification in children’s reading assessment.
Abstract: Disfluency detection and classification on children’s speech has a great potential for teaching reading skills. Word-level assessment of children’s speech can help teachers to effectively gauge their students’ progress. Hence, we propose a novel attention-based model to perform word-level disfluency detection and classification in a fully end-to-end (E2E) manner making it fast and easy to use. We develop a word-level disfluency annotation scheme using which we annotate a dataset of children read speech, the reading races dataset (READR).We also annotate disfluencies in the existing CMU Kids corpus. The proposed model significantly outperforms traditional cascaded baselines, which use forced alignments, on both datasets. To deal with the inevitable class-imbalance in the datasets, we propose a novel technique called HiDeC (Hierarchical Detection and Classification) which yields a detection improvement of 23% and 16% and a classification improvement of 3.8% and 19.3% relative F1-score on the READR and CMU Kids datasets respectively.
In Clippers on Tuesday, I’m going to present on the beginning stages of a new project. I’m attempting to design a response generation model for the COSI museum avatar — a virtual question-answering guide at the Language Pod that can answer questions about the pod, linguistics, and other exhibits at COSI. Currently, the avatar, which is modeled after the Virtual Patient project, returns “canned” responses to questions, meaning that it has prescribed, static answers for a set of in-domain questions to which it tries to match user inputs. This can result in a fairly unnatural conversation; if the avatar interprets two utterances as the same question, it will repeat the exact same answer. The goal of my current project is to migrate to using a response generation model that will be more contextually aware and answer questions dynamically, but also adapt to constant changes in content as exhibits in the museum change. To do so, I’m attempting to leverage the capabilities of OpenAI’s ChatGPT to generate training data for a smaller model that will hopefully avoid the pitfalls of LLMs such as toxic behavior. The plan is to eventually train a document-grounded generation model that responds directly to user inputs rather than needing to first map them to prescribed questions. This project is in the early exploratory phases, so I’m hoping to get lots of feedback on design choices and suggestions for other avenues to explore.
Abstract: Large language models (LLMs) have shown a strong generalization capability in the cross-domain text-to-SQL task without using in-domain examples. However, with a few in-domain annotations as demonstration examples, LLMs’ performance can be further improved. In this work, we first investigate the crucial elements of in-domain examples. Based on our findings, we propose to create demonstration examples with minimal in-domain annotation to improve the generalization ability of LLMs.
While there is much recent interest in studying why Transformer-based large language models make predictions the way they do, the complex computations performed within each layer has traditionally posed a strong bottleneck. To mitigate this shortcoming, this work presents a linear decomposition of final hidden states from autoregressive language models based on each initial input token, which is exact if the activation function is piecewise linear. This decomposition allows the definition of probability distributions that ablate the contribution of input tokens, which can be used to analyze their influence on model probabilities over a sequence of upcoming words with only one forward pass from the model. Using the change in next-word probabilities as a measure of importance, this work examines which context words make the biggest contribution to language model predictions. Regression experiments suggest that Transformer-based language models rely primarily on collocational associations, followed by linguistic factors such as syntactic dependencies and coreference relationships in making next-word predictions. Additionally, analyses using these measures to predict syntactic dependencies and coreferent mention spans show that collocational association and repetitions of the same token largely explain the language model’s predictions on the respective tasks.
Title: N-Pathic Speaker Diarization
Abstract: Speaker diarization is mainly studied through clustering speaker embeddings. However, the clustering approach has two major limitations: it doesn’t minimize diarization errors and can’t handle speaker overlaps. To address these problems, End-to-End Neural Diarization (EEND) was introduced. The Encoder-Decoder-Attractor (EDA) was also proposed for recordings with unknown speaker count. In this paper, we present two improvements: (1) N-Pathic, a base model that uses chunked data to reduce attention mechanism length and memory usage, and (2) an improved EDA architecture with increased data efficiency through non-sequence-dependant modules. Our proposed method was evaluated on simulated mixtures, real telephone calls, and real dialogue recordings.
Discourse Relations: Their Role and Use in Natural Language Generation
Speakers make extensive use of discourse connectives (e.g., but, and, so, although etc.) while communicating messages with rich information: Discourse connectives express abstract relations, called discourse relations, between pieces of information they connect. This facilitates understanding the message the speaker wants to communicate. Traditional computational linguistic (CL) approaches to natural language processing heavily rely on modeling discourse relations, in both natural language generation (NLG) and parsing tasks. The recent emergence of neural network-based approaches to natural language modeling led to remarkable advances in many CL tasks, including NLG. Nevertheless, when it comes to discourse-level phenomena, particularly the coherent use of discourse connectives, improvements are less obvious. First, I will present results of my doctoral research concerning design of symbolic, grammatical approaches to discourse, which are in line with the traditional CL approaches to discourse but overcome some important obstacles that previous approaches have. Then, I will review studies we have been systematically carrying out to establish whether neural network-based approaches can be extended/revised to overcome the issues they face. Based on our results, I will argue that reinstating the central, ubiquitous status of discourse relations by explicitly encoding discourse relations in natural language meaning representations, significantly enhances correct and coherent generation of discourse connectives with neural network-based approaches. Finally, I will discuss ample possibilities of exploring synergies of traditional, grammatical approaches and the state-of-the-art neural network-based ones to overcome critical issues, such as, data limitation problems for low-resourced languages, and interpretability of the performance of neural-network based models of language.
What’s it like to be a research scientist/data scientist in industry?
I’ll expand on my short answer, which is in the next paragraph.
It varies with the DNA of the organization. For example, the places I have been earn money in different ways and value different things.
- ETS (non-profit running tests like the GRE and TOEFL)
- Nuance (speech products, often on contract to undisclosed big company)
- Thomson Reuters (broad spectrum information provider)
- Digital Operatives (subcontractor to the security industrial complex)
- Facebook Applied AI. (trying to suppress “harmful content”)
- Facebook Linguistic Engineering (linguistic data and processes FTW)
- LivePerson (chatbot services and products for Fortune 500-ish clients)
- LexisNexis (information with a legal flavor, mostly for lawyers)
If you are a student now you are acquiring skills that will please and amaze people who are in business.
- Communication. Do as much as you can, to as many audiences as you can, orally and in writing.
- Evidence. There is great value in collecting evidence and using it to change your mind when you turn out to be wrong.
- Persistence. Dealing with the fact that the original plan didn’t work as expected, but the problem still needs solving.
Absent from the list of skills is any particular technical tool. If I were giving this talk in 1990, people would be asking whether they could keep using Prolog or Lisp in the commercial world, or in 2000 whether XML and XSLT were going to be important, or now, whether the company uses Keras, PyTorch or MxNet. These are/were all perfectly valid questions, but the answers change as quickly as anything else on the Internet, so don’t count on that kind of expertise to get you where you want to go.