At Clippers Tuesday, Zhen Wang will present joint work with Huan Sun on separating code from natural language text.
Title: Separating Text and Code for Next Utterance Classification in Stack Overflow
Abstract: In this talk, we will discuss our ongoing work on (1) developing tools to separate natural language text and programming code in a Stack Overflow (SO) comment, and (2) applying them to the Next Utterance Classification (NUC) task. In SO, a comment is posted after a question or answer post, and usually contains much information about follow-up questions, suggestions, opinions, etc. It is often a mixture of two different modalities: natural language and programming language, which distinguishes itself from other comments on social media like Twitter and Facebook. Such bi-modal mixture property makes it more difficult for machine to understand. We hypothesize that separating code and natural text should be the first step for tasks involving understanding programming-related text. While careful comment writers may use special formatting to distinguish natural words and programming tokens, noisy SO comments like “You will first need to: import collections # to use defaultdict” that simply mix text and code together are also very common. Therefore, in our first task, we study automatically separating code and text in noisy SO comments, which is casted as a sequence labeling problem. In our preliminary experiments, we tested a series of baseline models including traditional CRF with hand-crafted features and the state-of-the-art neural methods for NER task. Our results show that for tokens that can appear in both programming and natural language context, such as “exception”, “timeout”, and “flatten”, the baseline models cannot make accurate predictions of their labels. We are trying to improve the baseline models using domain-specific knowledge as well as more advancedneural architectures.
In our second task, we investigate whether separately modeling text and code can help the Next Utterance Classification (NUC) task on SO comments, which is to classify whether an utterance is a response to another. For training/validating/testing models, we design special rules to collect context-response pairs on Stackoverflow comments containing both natural language and code snippets.Siamese networks with tied Bi-LSTM were implemented for NUC task, with and without code snippets treated differently from natural text. Beyond the current work, our research plan is to mine the rich resources in Stack Overflow, understand text-code mixed data, and develop programming related intelligent assistants in the long run.
Any suggestions and comments are highly appreciated.