Clippers 11/29: Chris Brew on NLP Beyond Academia

What’s it like to be a research scientist/data scientist in industry?

I’ll expand on my short answer, which is in the next paragraph.

It varies with the DNA of the organization. For example, the places I have been earn money in different ways and value different things.
  • ETS (non-profit running tests like the GRE and TOEFL)
  • Nuance (speech products, often on contract to undisclosed big company)
  • Thomson Reuters (broad spectrum information provider)
  • Digital Operatives (subcontractor to the security industrial complex)
  • Facebook Applied AI. (trying to suppress “harmful content”)
  • Facebook Linguistic Engineering (linguistic data and processes FTW)
  • LivePerson (chatbot services and products for Fortune 500-ish clients)
  • LexisNexis (information with a legal flavor, mostly for lawyers)

If you are a student now you are acquiring skills that will please and amaze people who are in business.

  • Communication. Do as much as you can, to as many audiences as you can, orally and in writing.
  • Evidence. There is great value in collecting evidence and using it to change your mind when you turn out to be wrong.
  • Persistence. Dealing with the fact that the original plan didn’t work as expected, but the problem still needs solving.
Absent from the list of skills is any particular technical tool. If I were giving this talk in 1990, people would be asking whether they could keep using Prolog or Lisp in the commercial world, or in 2000 whether XML and XSLT were going to be important, or now,  whether the company uses Keras, PyTorch or MxNet. These are/were all perfectly valid questions, but the answers change as quickly as anything else on the Internet, so don’t count on that kind of expertise to get you where you want to go.

Clippers 11/22: Pranav Maneriker on Scaling Laws and Structure for Stylometry on Reddit


The problem of authorship identification (AID) consists of predicting whether two documents were composed by the same author. I will describe the creation of the Colossal Reddit User Dataset (CRUD), a corpus consisting of comment histories by five million anonymous Reddit users. The corpus comprises of 2.2 billion Reddit comments from January 2015 to December 2019. To our knowledge, CRUD is the most extensive corpus of its kind and, as such, may prove a valuable resource for researchers interested in various aspects of user modeling, such as modeling author style. We will also discuss preliminary experimental results from scaling AID models on large datasets inspired by related work on scaling laws for neural language models. Finally, we will discuss ongoing research on the role of interaction graph structures in AID.

Clippers 11/15: Christian Clark on Categorial Grammar Induction

Grammar induction is the task of learning a set of syntactic rules from an unlabeled text corpus. Much recent work in this area has focused on learning probabilistic context-free grammar (PCFG) rules; however, these rules are not sufficiently expressive to capture the full variety of structures found in human languages. Bisk and Hockenmaier (2012) present a system for inducing a Combinatory Categorial Grammar, a more expressive formalism, but this system learns from sentences with part-of-speech tags rather than unlabeled data. I will present my initial work toward implementing a categorial grammar induction system that can learn from unlabeled data using a neural network–based architecture.

Clippers 11/8: Vishal Sunder on Textual Knowledge Transfer for Speech Understanding

Title: Fine-grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding

Abstract: RNN Tranducer (RNN-T) technology is very popular for building deployable models for end-to-end (E2E) automatic speech recognition (ASR) and spoken language understanding (SLU). Since these are E2E models operating on speech directly, there remains a potential to improve their performance using purely text-based models like BERT, which have strong language understanding capabilities. In this work, we propose a new training criterion for RNN-T based E2E ASR and SLU to transfer BERT’s knowledge into these systems. In the first stage of our proposed mechanism, we improve ASR performance by using a fine-grained, tokenwise knowledge transfer from BERT. In the second stage, we fine-tune the ASR model for SLU such that the above knowledge is explicitly utilized by the RNN-T model for improved performance. Our techniques improve ASR performance on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and on the recently released SLURP dataset on which we achieve a new state-of-the-art performance. For SLU, we show significant improvements on the SLURP slot filling task, outperforming HuBERT-base and reaching a performance close to HuBERT-large. Compared to large transformer-based speech models like HuBERT, our model is significantly more compact and uses only 300 hours of speech pretraining data.