Word representations are a key technology in the NLP toolbox, but extending their success into representations of phrases and knowledge base entities has proven challenging. In this talk, I will present a method for jointly learning embeddings of words, phrases, and entities from uannotated text, using only a list of mappings between entities and surface forms. I compare these against prior methods that have relied on explicitly annotated text or the rich structure of knowledge graphs, and show that our learned embeddings better capture similarity and relatedness judgments and some relational domain knowledge.
I will also discuss experiments on augmenting the embedding model to learn soft entity disambiguation from contexts, and using member words to augment the learning of phrases. These additions harm model performance on some evaluations, and I will show some preliminary analysis of why the specific modeling approach for these ideas may not be the right one. I hope to brainstorm ideas on how to better model joint phrase-word learning and contextual disambiguation, as part of ongoing work.
Virtual patients are an effective, cost-efficient tool for training medical professionals to interview patients in a standardized environment. Technological limitations have thus far limited these tools to typewritten interactions; however, as speech recognition systems have improved, full-scale deployment of a spoken dialogue system for this purpose has edged into the range of feasibility. To build the best such system possible, we propose to take advantage of work done to improve the functioning of virtual patients in the typewritten domain. Specifically, our approach is to noisily map spoken utterances into text using off-the-shelf speech recognition, whereupon the text can be used to train existing question classification architectures. We expect that phoneme-based CNNs may mitigate recognition errors in the same way that character-based CNNs mitigate e.g., spelling errors in the typewritten domain. In this talk I will present the architecture of the system being developed to collect speech data, the experimental design, and some baseline results.
Automatic paraphrasing with lexical substitution
Generating automatic paraphrases with lexical substitution is a difficult task, but can be useful to supplement data in domain specific machine learning tasks. The Virtual Patient Project is an exact example of this problem, where have limited domain specific training data but need to accurately identify a user’s intended question, an example of which we may have only seen once. In this talk, I will present the progress Amad Hussein, Michael White, and I have made in automatically generating paraphrases, using unsupervised lexical substitution with WordNet, word embeddings, and the Paraphrase Database. Although currently our oracle accuracy in automatically classifying question types is only moderately above our baseline, they are modestly significant and give an estimate of what can be accomplished with human filtering. We propose future work in this direction that utilizes machine translation and phrase level substitution.