Clippers 8/30: Shuaichen Chang on Robustness Evaluation for Text-to-SQL

Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries on unseen databases. However, recent studies reveal that text-to-SQL models are vulnerable to adversarial perturbations. In this paper, we propose a comprehensive robustness evaluation benchmark based on Spider, a cross-domain text-to-SQL benchmark to evaluate the robustness of models. We design 17 realistic perturbations for databases, natural questions, and SQLs to systematically measure the robustness of text-to-SQL models from various task-specific aspects. We leverage the structural nature of the task for database and SQL perturbation and utilize large pretrained language model (PLM) to simulate human users for natural question perturbations. We conduct a diagnostic study of the state-of-the-art models on robustness with our evaluation set. The experimental results reveal that even the best model suffers around 50\% performance drop on certain perturbations. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.