Clippers 4/8: Alyssa Allen on LLM as a Judge to Identify Errors in Generated Clause-Level SQL Comments

Leveraging LLM as a Judge Techniques to Identify Errors in Generated Clause-Level SQL Comments

Database query languages (e.g. SQL) require a user to be knowledgeable in data structures in order to properly interpret how the system interpreted the user question and subsequently generated the output. Prior research in SQL explainability has largely focused on generating natural language summaries at the query level (Eleftherakis et al., 2021; Kokkalis et al., 2012) or translating the queries into natural language templates at the phrase or clause level—where each command in the query is the start of a clause) (Narechania et al., 2021; Tian et al., 2023). In this work, we 1) use LLMs to generate easy-to-understand step-by-step comments that bridge the gap between the user question and technical query structure information and 2) leverage an LLM as a judge to determine if a generated comment contains an error.

Generated comments should 1) maintain semantic fidelity between user question and SQL query, 2) leverage user language, 3) use anaphoric information as needed, 4) avoid technical database terminology. We find that comments generated by LLMs exhibit less semantic fidelity to the SQL query than if templated comments were used but are more aligned with language from a user’s question. Instead of improving semantic fidelity performance of the generation models, we explore using LLMs as evaluators. Model evaluation plus confidence is used to identify generated comments that mislead the user or are unfaithful to the content of the SQL query. Ultimately the project is aimed at developing a dialogue system that helps increase transparency of database query outputs for non-expert SQL users.