At Clippers on Tuesday, Wuwei Lan will be presenting his EMNLP 2017 paper (with Wei Xu)
Title: Automatic Paraphrase Collection and Identification in Twitter
Paraphrase is a restatement of the meaning of a text or passage using other words, which is helpful in many NLP applications, including machine translation, question answering, semantic parsing and textual similarity. Paraphrase resource is valuable and important, but it is hard to get at large scale, especially for sentence level paraphrases. Here we propose a smart way to automatically collect enormous sentential paraphrases from Twitter, which is simply grouping tweets through shared URLs. We gave the largest human-labeled golden corpus of 51,524 pairs, as well as a silver standard corpus which can grow 30k pairs per month with 70% precision. Based on this paraphrase dataset from Twitter, we experimented with deep learning models for automatic paraphrase identification. We find that without pretrained word embedding, we can still achieve state-of-the-art or more competitive results on social media dataset with only character or subword embedding, which is useful in domain with more out-of-vocabulary words or more spelling variations.