Clippers 11/22: Pranav Maneriker on Scaling Laws and Structure for Stylometry on Reddit


The problem of authorship identification (AID) consists of predicting whether two documents were composed by the same author. I will describe the creation of the Colossal Reddit User Dataset (CRUD), a corpus consisting of comment histories by five million anonymous Reddit users. The corpus comprises of 2.2 billion Reddit comments from January 2015 to December 2019. To our knowledge, CRUD is the most extensive corpus of its kind and, as such, may prove a valuable resource for researchers interested in various aspects of user modeling, such as modeling author style. We will also discuss preliminary experimental results from scaling AID models on large datasets inspired by related work on scaling laws for neural language models. Finally, we will discuss ongoing research on the role of interaction graph structures in AID.