Clippers Tuesday: Manirupa Das on Query Expansion for IR

At Clippers Tuesday, Manirupa will present “A Phrasal Embedding–based General Language Model for Query Expansion in Information Retrieval”:

Traditional knowledge graphs driven by knowledge bases can represent facts about and capture relationships among entities very well, thus performing quite accurately in factual information retrieval. However, in addressing the complex information needs of subjective queries requiring adaptive decision support, these systems can fall short as they are not able to fully capture novel associations among potentially key concepts. In this work, we explore a novel use of language model–based document ranking to develop a fully unsupervised method for query expansion by associating documents with novel related concepts extracted from the text. To achieve this we extend the word embedding-based generalized language model due to Ganguly et al. (2015) to employ phrasal embeddings, and evaluate its performance on an IR task using the TREC 2016 clinical decision support challenge dataset. Our model, used for query expansion both directly and via feedback loop, shows statistically significant improvement not just over various baselines utilizing standard MeSH terms and UMLS concepts for query expansion (Rivas et al., 2014), but also over our word embedding-based language model baseline, built on top of a standard Okapi BM25 based document retrieval system.

NLP/AI, previously: Dan Garrette (Google) on CCG Parsing and Historical Document Transcription

We were pleased to host Dan Garrette from Google the previous Friday, who gave a talk in the NLP/AI series.

Title: Learning from Weak Supervision: Combinatory Categorial Grammars and Historical Document Transcription

As we move NLP toward domains and languages where supervised training resources are not available, there is an increased need to learn models from less annotation. In this talk, I will describe two projects on learning from weak supervision. First, I will discuss work on learning combinatory categorial grammars (CCGs) from incomplete information. In particular, I will show how universal, intrinsic properties of the CCG formalism can be encoded as priors and used to guide the learning of supertaggers and parsers. These universal priors can, in turn, be combined with corpus-specific knowledge derived from limited amounts of available annotation to further improve performance. Second, I will present work on learning to automatically transcribe historical documents that feature heavy use of code-switching and non-standard orthographies that include obsolete spellings, inconsistent diacritic use, typos, and archaic shorthands. Our state-of-the-art model is able to induce language-specific probabilistic mappings from language model data with standard orthography to the document-specific orthography on the page by jointly modeling both variant-preserving and normalized transcriptions. I will conclude with a discussion of how our work has opened up new avenues of research for scholars in the digital humanities, with a focus on transcribing books printed in Mexico in the 1500s

Dan is a research scientist at Google in NYC. He was previously a postdoctoral researcher at the University of Washington working with Luke Zettlemoyer, and obtained his PhD at the University of Texas at Austin under the direction of Jason Baldridge and Ray Mooney.

Host: Alan Ritter

Clippers Tuesday: Joo-Kyung Kim on Cross-lingual Transfer Learning for POS Tagging

This Tuesday, Joo-Kyung Kim will be talking about his current work on cross-lingual transfer learning for POS tagging:

POS tagging is a relatively easy task given sufficient training examples, but since each language has its own vocabulary space, parallel corpora are usually required to utilize POS datasets in different languages for transfer learning. In this talk, I introduce a cross-lingual transfer learning model for POS tagging, which utilizes language-general and language-specific representations with auxiliary objectives such as language-adversarial training and language modeling. Evaluating on POS datasets from Universal Dependencies 1.4, I show preliminary results that the proposed model can be effectively used for cross-lingual transfer learning without any parallel corpora or gazetteers.