It wasn't clear to me what evaluation method was being used, the chart in the blog says Execution Accuracy, but the numbers that seem to be used appear to correlate with "Exact Set Match" (comparing on SQL) instead of the "Execution With Values" (comparing on result set values). For example, DIN-SQL + GPT-4 achieves an 85.3% "Execution With Values" score. Is that what is being used here?
Hello, thank you very much for your meticulous comment. The 85.3% accuracy reported in our paper (I'm one of the authors of the DIN-SQL paper) pertains to the test set. However, in the blog post, we are reporting the performance on the development set, which stands at 74.2%.
See the following for more info:
https://yale-lily.github.io/spider https://github.com/taoyds/spider/tree/master/evaluation_exam...