goodhart.19

Abstract:

The problem of authorship identification (AID) consists of predicting whether two documents were composed by the same author. I will describe the creation of the Colossal Reddit User Dataset (CRUD), a corpus consisting of comment histories by five million anonymous Reddit users. The corpus comprises of 2.2 billion Reddit comments from January 2015 to December 2019. To our knowledge, CRUD is the most extensive corpus of its kind and, as such, may prove a valuable resource for researchers interested in various aspects of user modeling, such as modeling author style. We will also discuss preliminary experimental results from scaling AID models on large datasets inspired by related work on scaling laws for neural language models. Finally, we will discuss ongoing research on the role of interaction graph structures in AID.

Grammar induction is the task of learning a set of syntactic rules from an unlabeled text corpus. Much recent work in this area has focused on learning probabilistic context-free grammar (PCFG) rules; however, these rules are not sufficiently expressive to capture the full variety of structures found in human languages. Bisk and Hockenmaier (2012) present a system for inducing a Combinatory Categorial Grammar, a more expressive formalism, but this system learns from sentences with part-of-speech tags rather than unlabeled data. I will present my initial work toward implementing a categorial grammar induction system that can learn from unlabeled data using a neural network–based architecture.

Ohio State nav bar

Author: goodhart.19

Clippers 11/22: Pranav Maneriker on Scaling Laws and Structure for Stylometry on Reddit

Clippers 11/15: Christian Clark on Categorial Grammar Induction