We were pleased to host Dan Garrette from Google the previous Friday, who gave a talk in the NLP/AI series.
Title: Learning from Weak Supervision: Combinatory Categorial Grammars and Historical Document Transcription
Abstract:
As we move NLP toward domains and languages where supervised training resources are not available, there is an increased need to learn models from less annotation. In this talk, I will describe two projects on learning from weak supervision. First, I will discuss work on learning combinatory categorial grammars (CCGs) from incomplete information. In particular, I will show how universal, intrinsic properties of the CCG formalism can be encoded as priors and used to guide the learning of supertaggers and parsers. These universal priors can, in turn, be combined with corpus-specific knowledge derived from limited amounts of available annotation to further improve performance. Second, I will present work on learning to automatically transcribe historical documents that feature heavy use of code-switching and non-standard orthographies that include obsolete spellings, inconsistent diacritic use, typos, and archaic shorthands. Our state-of-the-art model is able to induce language-specific probabilistic mappings from language model data with standard orthography to the document-specific orthography on the page by jointly modeling both variant-preserving and normalized transcriptions. I will conclude with a discussion of how our work has opened up new avenues of research for scholars in the digital humanities, with a focus on transcribing books printed in Mexico in the 1500s
Bio:
Dan is a research scientist at Google in NYC. He was previously a postdoctoral researcher at the University of Washington working with Luke Zettlemoyer, and obtained his PhD at the University of Texas at Austin under the direction of Jason Baldridge and Ray Mooney.
Host: Alan Ritter