The big assumption here is that the data you'll be querying is, in fact, decently structured. I like the idea, but I've come across too many structures that aren't organized to be queryable at all, because they've been constructed by people without any understanding of relational concepts, or indeed data integrity or normalization.
"If the 3rd character in "customer ID" is 7 or 8, and the start date is after 2007, then they are a "premium" customer, and the maximum order amount doesn't apply, if they're shipping an order to Michigan, Texas or Florida".
"If customer ID is greater than 80000 and the date is after November 2009, then the real customer number should be reduced by 10000, because we had a problem and needed to reuse customer IDs". (meaning, an invoice for customer 12000 on November 2004 is related to a different customer than customer 12000 for an invoice on November 2010).
"If the employee's start date is >2005, then check table 1 for employee data if their last name starts with A-M, and table 2 if their last name starts with N-Z, otherwise check table 0 (legacy) Oh, and in table 4, if the employee ID starts with L, that means "legacy", so use table 0 to find their information, but remove the L".
These are situations I've run in to in the last few years, and I'm sure many of you have similar WTFs in your experiences. If someone has their data in good, solid, structured formats/tables, natural language syntax might fun/easy/exciting, but those people can also be served by things like Crystal Reports, some books, and a few hours of learning. The companies that most desperately need NLP->SQL probably also have the worst data.
Yeah, there's certainly a level of schema insanity beyond which we won't be able to offer a lot of value. We can still in those cases consume data that looks like subject-predicate-object, but the onus would be on integrators to supply that, and then you have security and timeliness issues.
For people with good data, even those with the expertise to query it, they'll still often have end users who want the data. The cycle of 'call IT department, ask for data, wait for data' or 'email SaaS provider, request report, wait for report' can be short circuited in these cases, and I believe that's of value.
"If the 3rd character in "customer ID" is 7 or 8, and the start date is after 2007, then they are a "premium" customer, and the maximum order amount doesn't apply, if they're shipping an order to Michigan, Texas or Florida".
"If customer ID is greater than 80000 and the date is after November 2009, then the real customer number should be reduced by 10000, because we had a problem and needed to reuse customer IDs". (meaning, an invoice for customer 12000 on November 2004 is related to a different customer than customer 12000 for an invoice on November 2010).
"If the employee's start date is >2005, then check table 1 for employee data if their last name starts with A-M, and table 2 if their last name starts with N-Z, otherwise check table 0 (legacy) Oh, and in table 4, if the employee ID starts with L, that means "legacy", so use table 0 to find their information, but remove the L".
These are situations I've run in to in the last few years, and I'm sure many of you have similar WTFs in your experiences. If someone has their data in good, solid, structured formats/tables, natural language syntax might fun/easy/exciting, but those people can also be served by things like Crystal Reports, some books, and a few hours of learning. The companies that most desperately need NLP->SQL probably also have the worst data.