Clippers 4/11: Alyssa Allen on Line-by-Line Comment Generation for SQL

This work is rooted in a larger project aimed at developing a dialogue system that helps non-expert SQL users comprehend database query outputs. Prior research in SQL comment-generation has focused on comments which summarize entire SQL queries and translations of SQL to templated English (Eleftherakis et al., 2021; Narechania et al., 2021). These approaches can be helpful in comprehending SQL but are limited in their ability to guide users through the query steps and connect formal notation with intuitive concepts. To address this limitation, the project aims to generate line-by-line comments that leverage language from user questions, connecting formal SQL notation with user-friendly concepts (e.g. “tallest” or “alphabetical order”).

Due to a lack of pre-existing training data, 100 SQL queries from the SPIDER dataset (Yu et al., 2018) have been manually annotated. These 100 examples will then be used as a base for generating a more robust training set through self-training and prompting. I have been experimenting with using ChatGPT to generate comments for more queries as well as fine-tuning BART for the task. This approach will allow us to scale the training set quickly and minimize time spent writing comments by hand. This presentation will discuss the annotation process and preliminary results for comment generation using the above methods.