What can computational methods do for sociolinguistics?
This talk provides a brief overview of computational sociolinguistics, an emerging field with the twin goals of improving NLP systems using sociolinguistics and of answering sociolinguistic questions using NLP and other computational methods. I briefly discuss what sociolinguistics can do for NLP, then turn to what NLP/computational methods can do for sociolinguistics, using two examples from my research: (1) using SVMs for word sense disambiguation on Twitter data to compare regional variation in African American versus white US English, and (2) using hierarchical cluster analysis to study individual differences in patterns of social meaning. Finally, I discuss future directions for computational sociolinguistics.
Title: A typology of ambiguity in medical concept normalization datasets
Medical concept normalization (MCN; also called biomedical word sense disambiguation) is the task of assigning unique concept identifiers (CUIs) to mentions of biomedical concepts. Several MCN datasets focusing on Electronic Health Record (EHR) data have been developed over the past decade, and while several challenges due to conceptual ambiguity have been identified in methodological research, the types of lexical ambiguity exhibited by clinical MCN datasets has not been systematically studied. I will present preliminary results of an ongoing analysis of benchmark clinical MCN datasets, describing an initial, domain-specific typology of lexical ambiguity in MCN annotations. I will also discuss desiderata for future MCN research aimed at addressing these challenges in both methods and evaluation.
Lexica distinguishing all morphologically related forms of each lexeme are crucial to many downstream technologies, yet building them is expensive. We propose a frugal paradigm completion approach that predicts all related forms in a morphological paradigm from as few manually provided forms as possible. It induces typological information during training which it uses to determine the best sources at test time. We evaluate our language-agnostic approach on 7 diverse languages. Compared to popular alternative approaches, ours reduces manual labor by 16-63% and is the most robust to typological variation.
Discovery of Semantic Factors in Virtual Patient Dialogues
The NLP community has become fixated on very deep Transformer models for semantic classification tasks, but some research suggests these models are not well suited to tasks with a large label space or data scarcity issues, and their speed at inference time is still unacceptable for real-time uses such as dialogue systems. We adapt a simple one-layer recurrent model utilizing a multi-headed self-attention mechanism for a dialogue task with hundreds of labels in a long-tail distribution over a few thousand examples. We demonstrate significant improvements over a strong text CNN baseline on rare labels, by independently forcing the representations of each attention head through low-dimensional bottlenecks. This requires the model to learn efficient representations, thus discovering factors of the (syntacto-)semantics of the input space that generalize from frequent labels to rare labels. The resulting models lend themselves well to interpretation, and analysis shows clear clustering of representations that span labels in ways that align with human understanding of the semantics of the inputs.