Given raw (in our case, textual) sentences as input, the Paradigm Discovery Problem (PDP) (Elsner et al., 2019, Erdmann et al., 2020) involves a bi-directional clustering of words into paradigms and cells. For instance, solving the PDP requires one to determine that ring and rang belong to the same paradigm, while bring and bang do not, and that rang and banged belong to the same cell, as they realize the same morphosyntactic property set, i.e., past tense. Solving the PDP is necessary in order to bootstrap to solving what’s often referred to as the Paradigm Cell Filling Problem (PCFP) (Ackerman et al., 2009), i.e., predicting forms that fill yet unrealized cells in partially attested paradigms. That is to say, if I want the plural of thesis, but have only seen the singular, I can only predict theses if I’ve solved the PDP in such a way that allows me to make generalizations regarding how number is realized.
Two forthcoming works address constrained versions of the PDP by focusing on a single part of speech at a time (Erdmann et al., 2020; Kann et al., 2020). For my dissertation, I am trying to adapt the system of Erdmann et al. (2020) to handle the unconstrained PDP by addressing scalability and overfitting issues which lock the system into poor predictions regarding the size of paradigms and prematurely eliminate potentially rewarding regions of the search space. This will be a very informal talk, I’m just looking to get some feedback on some issues I keep running into.
High frequency marker categories in grammar induction
High frequency marker words have been shown crucial in first language acquisition where they provide reliable clues for speech segmentation and grammatical categorization of words. Recent work in model selection of grammar induction has also hinted at a similar role played by high frequency marker words in distributionally inducing grammars. In this work, we first expand the notion of high frequency marker words to high frequency marker categories to include languages where grammatical relations between words are expressed by morphology, not word order. Through analysis of data from previous work and experiments with novel induction models, this work shows that high frequency marker categories are the main drive of accurate grammar induction.
Title: An unsupervised discrete-state sequence model of human language acquisition from speech
Abstract: I will present a progress report on an ongoing attempt to apply discrete-state multi-scale recurrent neural networks as models of child language acquisition from speech. The model is inspired by prior arguments that abstract linguistic representations (e.g. phonemes and words) constrain the acoustic form of natural language utterances, and thus that attempting to efficiently store and anticipate auditory signals may emergently guide child learners to discover underlying linguistic structure. In this study, the artificial learner is a recurrent neural network arranged in interacting layers. Information exchange between adjacent layers is governed by binary detector neurons. When the detector neuron fires between two layers, those layers exchange their current analyses of the input signal in the form of discrete binary codes. Thus, in line with much existing linguistic theory, the model exploits both bottom-up and top-down signals to produce a representation of the input signal that is segmental, discrete, and featural. The learner adapts this behavior in service of four simultaneous unsupervised objectives: reconstructing the past, predicting the future, reconstructing the segment given a label, and reconstructing the label given a segment. Each layer treats the layer below as data, and thus learning is partially driven by attempting to model the learner’s own mental state, in line with influential hypotheses from cognitive neuroscience. The model solves a novel task (unsupervised joint segmentation and labeling of phonemes and words from speech), and it is therefore difficult to establish an overall state of the art performance threshold. However, results for the subtask of unsupervised word segmentation currently lag well behind the state of the art.