We've been chasing this rabbit since the beginning. It currently seems to be uncatchable for the use cases that would be most valuable to us - writing complex queries we've never seen before. Our use case seems to confound any notion of training or fine-tuning, since the cases we need the most help with are also the ones we have the fewest examples of.
Instead of going for generative, few-shot models, I am starting to look at the other end of the spectrum: Binary classification into deterministic query building.
With ChatGPT, you cannot realistically explain why you got some output in a way that anyone other than an AI/ML expert would find satisfying. With binary classifiers, you can precisely explain how some input resulted in some output in terms that a business person could easily digest - "You mentioned this table so it assumed you wanted to constrain on XYZ. Here's the trace from the classifiers...".
I've proposed a few schemes where you define groups of classifiers for each SQL building concern - Which tables are involved, which columns, is a join or aggregate implied, general context of business use, etc. Clearly, there are holes with this scheme, but in our domain we could plausibly fine-tune our humans to be a little bit more verbose in their use of the automagic SQL vending machine. Several hours spent training humans is probably a lot cheaper & easier than getting ChatGPT, et. al. to consistently play by our rules.
I think you've fallen into the trap of "AIs don't generalize, they memorize." But they do in fact generalize. The reason ChatGPT is so valuable is precisely because it can help out with situations that have never been seen before, not because it merely unlocks old preexisting knowledge. The fella who saved their dog with ChatGPT comes to mind. https://nypost.com/2023/03/27/chatgpt-saved-my-dogs-life-aft...
I think you have fallen into the trap of mistaking interpolation for generalization.
Working with these models every day, it's clear that they can certainly interpolate between points in latent space and generate sensible answers to unseen questions, but it's pretty clear that they don't generalize. I've seen far to many examples of models failing to display any sense of generalization to believe otherwise.
That's not to say that interpolation in a rich latent space of text isn't very useful. But it's not the same level of abstraction that comes from true generalization in this space.
>it's pretty clear that they don't generalize. I've seen far to many examples of models failing to display any sense of generalization to believe otherwise
Failing to generalize on whatever you have in mind is not evidence that the models are incapable of generalization. If you really think so, just be prepared to write off a good chunk of humans as well lol.
Given a multi-dimensional latent space with enough dimensions it's hard to imagine cases of generalization that aren't interpolation between points in latent space (given enough data).
The one possible exception is logical inference, and this problem seems tractable with tool use or programming.
I object on geometric grounds. You can't interpolate outside the convex hull, if your have an outlier in any dimension then you're going to need to extrapolate, that seems to me like a reasonable way to ask for generalisation that isn't interpolation.
-ChatGPT initially gives the same diagnosis the vets did, Babesiosis
-It notes that Babesiosis may have either been a misdiagnosis or there may be a secondary condition/infection causing the remaining symptoms after the Babesiosis treatment didn't resolve all of them
-It suggests such a hypothetical secondary condition could be IMHA, which the article notes is an extremely common complication of Babesiosis with this specific dog breed
-A quick Google search brings up a fair amount of literature about the association between Babesiosis and IMHA
So in fact this is the opposite of a never before seen situation, ChatGPT was just regurgitating common comorbidities of Babesiosis and the vets in question are terrible at their job.
Are you suggesting that it was commonplace for machine learning models to be able to extrapolate medical case reports into an actual diagnosis? Even being able to read and understand a case report is a minor miracle in extrapolation, and it’s interesting how far the goal posts move.
I have another example of ChatGPT not generalizing but just being a really good statistical model. I needed a solution to a problem that you can't find on Google, and a variation of a problem that also can't be found on Google. I attempted to obtain around 15 lines of code from ChatGPT that would solve the problem, but it consistently failed to produce the correct solution. I spent a few hours trying different prompts, indicating its errors and receiving apologies, only for it to generate another incorrect solution while acknowledging its mistake.
Solving out of distribution problems correctly seems almost impossible for it.
GPT 3.5 or 4? Surprisingly it makes a huge difference. I think a lot of peoples’ impressions are with 3.5, but many startups couldn’t have been built on it, whereas with 4 they can.
If it was 4 I’d be curious about the specific problem if you’d be willing to link to the chat.
I had similar issue with gpt4, was looking for a library that given a grammar and a string, would produce a list of next valid symbols.
Gpt4 only ever suggested grammar validator to the point I had given up and was going to write a grammar generator, and so I started looking for the equivalent of antlr in python, and in three searches I find out nltk.grammar that actually solves the original problem
It's not a new library either, so I'm dumbfounded by why gpt4 couldn't make sense of my request.
I tried this to see. It does suggest using validators, though in a way that mostly solves the problem. The failure mode is when the string is valid complete or extended. It does have a bit of a workaround for that.
No because if I add space to the original string I don't get a valid string
Or given the context space is a valid production from the last error, but then I would need to roll back the incomplete "do" terminal and lose that context.
> With ChatGPT, you cannot realistically explain why you got some output in a way that anyone other than an AI/ML expert would find satisfying
IMHO OP is talking about "explainability" of the results, which is notoriously bad for current AI. For certain applications (idk if SQL would be one but mortgage application might be one) it is required to be able to explain how the computer got to the decision.
Yeah louie.ai query generation internals feel more like classic program synthesis, where's it's more about staged compilation, iterative refinement, etc, and we use gpt3/gpt4 for each stage rather than a traditional solver. "Which tables would help this query, which parts should go to text search vs vector search, ...", and we iteratively rewrite and pass along until we reconstitute.
At the same time, we're also moving to make a lot more of that process AI controlled, just across LLM calls (e.g., LLM-dictated ones), so it's a funny maturity process.
Curious to learn more about your use case. If fine-tuning is only ineffective for your most complex queries (and presumably those are less frequent as well, since you mentioned you have few examples), then couldn't you use fine-tuning to handle the simpler queries (presumably the lion's share) and thus free up excess man hours to focus on the more complex queries? Is there any benefit to AI being able to answer 90% of queries vs 0%?
I tried something similar to this so I assume it’s to classify a table as being part of the resulting query or not, and using the schemas of the resulting tables in the prompt for the question. I found it just as useful to give it the full schema and ask it for a list of relevant tables, certainly more cost effective.
In the end, the one who possesses knowledge will turn it into value going the most obvious and straightforward way: writing a piece of useful SQL or code hot from the head.
GPT is a tool for learners, and they will keep shooting them feet until they learn, just the weapon is different now.
Look at the law system, written in natural language. It is misleading and doesn't work well, so we have to gather thousands of people around courts doing non-deterministic work in order to process them. Natural language is a tool for learning, not making systems. You have to shrink your vocabulary down to code at some point in order to make systems, and you can do it much faster and better than GPT.
True. The goal would be write the complex SQL query once, then file it away as golden SQL so that in the future a similarly complex query could be done in natural language.
Are you planning on submitting this model to be evaluated against the Spider holdout set?
Also, wondering if anyone has found research on the inverse of this approach to the problem, i.e., instead of training the model to understand the data, you improve the data to be more understandable to a model? This seems more promising when you are looking at enterprise use cases without much training data. Spider seems like quite a simple dataset compared to the ones I encounter on the job, and LLMs struggle even with those.
In theory one could create domain specific (or industry specific) templates for data. However coming up with a universal structure might be challenging since data is so varied.
Since the issue is often the context, plugging in data dictionaries (and passing those to the LLM) can help
Yes, we have already submitted the model for evaluation on the Spider holdout test set. While your suggestion is certainly intriguing, implementing a universal solution could be quite challenging, as it would heavily depend on the specifics of the dataset.
I don’t think it’s necessarily about a “universal” solution, just “better”. Make the column names more verbose, changing numeric enums to text ones, disambiguating column names, etc. One of the spider datasets is a stadium table and one of the column names is “average”, which means average capacity, but it’s super ambiguous. If you asked an LLM to “make these table columns more verbose” I bet it would call that “average_capacity” and all of the sudden some NLQ queries that confused the function and the column name would start to work.
I often do complicated reports in SQL or complicated transformations in SQL for system migrations. To really write a query and 'get it right' you usually need insider knowledge that you can't glean from the column names.
I see in their training set they've got comments about columns too. e.g.
"Country" TEXT NOT NULL, - country where the singer born
But thats still not enough.
You also need a bunch of information about the real business that the data is describing. And you also need to analyse the whole database - is that field actually used? What are the common values for this picklist? What does that status actually mean in terms of business? If there are two of those rows that match the join but I want to avoid duplicates, which one do I take? - the newest, or the one with a certain status? etc.
Also, with complicated queries or reports, there's quite a big element of "do the results look roughly as I expect?" - e.g does the total income for last month look about right? Then you might do a detailed reconciliation and find out your total is 1% off due to not accounting for refunds and then you go and adjust your query and try again.
> You also need a bunch of information about the real business that the data is describing.
While the article focuses on finetuning GPT-3.5-turbo, how you use the text-to-SQL engine within the architecture of your overall solution is for you to decide. Providing this business context from vectorized context stores in the actual prompt would be a step in the right direction.
Why dont you collect a sample of the data so it knows what to expect then? Include 10 handpicked rows in your prompt and it should be able to pick up the patterns. If you find it doesn't handle something, append that row and restart.
So if they're doing 50,000,000 tokens to do the fine-tune, @ $0.08/1000 tokens and 2 epochs, that's about $800 to do the fine-tune. Not prohibitive from a business standpoint.
A tutorial on how to fine-tune a GPT3.5 model for Natural Language to SQL tasks and a comparison of its performance vs Retrieval Augmented Generation.
Based on the results, fine-tuning can match and outperform RAG (the approach matches the state of the art on accuracy while being far faster and cheaper). The big challenge for fine-tuning tasks like this is building the training datasets. Things should get even more interesting when GPT-4 is opened for fine-tuning.
This brings up a question I've had about OpenAI's fine-tuning APIs.
The training format is a series of question/answer pairs: i.e, what we might think of as "supervised" learning. It can be challenging to build good data sets for this scenario.
But transformer-based models actually don't require supervised learning: the bulk training of the GPT family is to just throw masses of text at it and have it do next-token prediction.
What's going on with the difference? When I "fine-tune" GPT-3.5 turbo, am I actually training the transformer, or the RLHF model that sits on top? Or both?
More to the point, is there any way to fine-tune OpenAIs models in a n "unsupervised" fashion? I.e, if I want to teach a model SQL, do I need to get a curated data set of question/answer pairs, or I can I just dump in a bunch of schema and SQL queries that will in theory make the model "better at SQL" in a generic way?
My understanding is that there’s no RLHF model sitting on top of ChatGPT. There’s a separate reward model during training, but that reward model is just used to fine-tune the decoder model. This is sort of confusing in the literature, because they will switch from calling it a “language model” to a “policy”, once they’ve used PPO, but if you trace the citations back to [1], you can see that the policy is initialized with a pretrained language model (in that case, GPT-2).
What I don’t know is how they’re applying RLHF after user fine-tuning. Are they redoing the PPO with the original reward model after tuning on your input? Are they just letting it slide, and hoping that the fine-tuning doesn’t cause the model to forget the RLHF? It’s unclear from what I’ve read.
I think natural language to SQL may not be as great as it sounds in many real-world applications. You often have permissions, tenants, rules about who-can-see-what. I love the idea of letting users use natural language to query and possibly even bulk-update data. But if the app has rules like those, this would just be a SQL injection vulnerability. You could possibly limit it - just allow query over certain views or something - but the risk of the GPT crafting a dangerous query is still high.
What would be more useful IMO is natural language to OData, GraphQL, and OpenAPI/Swagger. Then you could let users do ad-hoc query but only against data they are allowed. I did a PoC using GPT3 to query OData and it was pretty fun, but did occasionally get the query wrong. I also ran into the context window issue. It would get lost when fed larger OData schema.
It wasn't clear to me what evaluation method was being used, the chart in the blog says Execution Accuracy, but the numbers that seem to be used appear to correlate with "Exact Set Match" (comparing on SQL) instead of the "Execution With Values" (comparing on result set values). For example, DIN-SQL + GPT-4 achieves an 85.3% "Execution With Values" score. Is that what is being used here?
Hello, thank you very much for your meticulous comment. The 85.3% accuracy reported in our paper (I'm one of the authors of the DIN-SQL paper) pertains to the test set. However, in the blog post, we are reporting the performance on the development set, which stands at 74.2%.
I think natural language to SQL is a worthwhile goal and will have its use cases, but perhaps fine-tuning a model on a semantic layer and having it talk to a semantic data modeling API will yield a more useful/practical result. You may even find the semantic layer makes a natural language approach moot in many cases.
In Zillion* I added an experimental feature that uses OpenAI to form a report API call from natural language. Fine tuning (vs prompt tuning) this on the semantic model with gpt-3.5 would probably yield some notable improvement in abilities, as I view it as more of a toy feature at the moment.
I think there are probably a million posts and papers talking about text to SQL as it's a problem that's been tackled for a long time now...just with more vigor recently now that LLMs provide an improved/shiny approach. I think many companies now offer or consider themselves semantic layers in some fashion (google "semantic layer" + tableau, cube.dev, dbt...). I have not come across any blog or paper comparing text-to-sql directly to text-semantic-api. Certainly the latter is not as flexible as the former in that it can't produce every possible SQL query. In practice (for me at least) that's often not necessary, or is necessary rarely so you just do your direct SQL as needed and use your semantic layer >95% of the time.
From our good friend GPT-4 who can summarize it better than I can...
A semantic data layer is an abstraction layer that sits between the raw data stored in databases and the applications or analytics tools that use this data. The purpose of this layer is to provide a consistent, business-friendly interface to data that might be stored in a variety of formats, tables, or systems. By doing so, the semantic data layer helps in simplifying complex data structures into more meaningful, understandable models.
Key Features:
Unified View: It provides a unified view of data from multiple sources, making it easier for users to access and understand the data without having to know the underlying structure or complexity.
Data Governance: It can enforce business rules, security policies, and data quality measures, ensuring that data is consistent, compliant, and accessible only by authorized users.
Flexibility: A semantic layer is often designed to be flexible, allowing business users to adapt to changes in the data model without requiring changes to the applications themselves.
Query Simplification: The layer simplifies the process of querying data by providing a more user-friendly way of accessing and manipulating data, often through a drag-and-drop interface or other graphical tools.
Decoupling: It decouples application development from data source changes. When underlying data changes, you don't necessarily have to update all the applications that use it; you might only need to update the semantic layer.
Data Integration: It can integrate data from multiple sources, providing a single "source of truth" for business users and applications.
Purpose:
Simplification: Make data more accessible and easier to understand for non-technical users.
Consistency: Ensure that everyone is working from the same definitions and business rules.
Efficiency: Reduce the time and complexity involved in generating reports, analytics, and other data-driven functions.
Data Quality: Help to enforce data quality and governance policies.
Security: Provide a mechanism for enforcing security rules on who can see or modify data.
Adaptability: Enable quicker adaptation to changes in business needs or data structures.
By providing a semantic data layer, organizations can ensure that their data is not only high-quality and secure but also that it can be easily used for making informed decisions. This is particularly valuable in today's data-driven business environment.
I believe this article underscores the significance and efficacy of fine-tuning for specific tasks. Looking ahead, I envision the integration of fine-tuned models with RAG agents and models, further enhancing overall performance.
Spider isn't anything like the queries that analysts and data scientists write against DBs. I don't think it even has many joins. Do you not have access to a more realistic training set?
Agree that it’s not representative of real world queries, but I think it’s more of a “measure progress against a simple but consistent baseline” and SOTA is still struggling so it’s clearly not ready for real queries. Most of the papers on this topic mention a lack of large volumes of high quality training data… Spider is probably the best public one right now.
I agree that Spider queries are not necessarily representative of the SQL you might see in the wild from real users, but looking at some analysis I did of the dataset around 43% of the queries had joins, and a number had 3, 4, or 5-way joins.
Although Spider is better known in the text-to-SQL world, you're right that BiRD may provide a better testing ground. Comparing against the current leaderboard on that standard is on the docket!
The cost per question seems super high. I can't even think of an API where a single call would cost $1cent. You better have a good pricing model to follow up on this.
I mean yes, I have managed BigQuery jobs that cost thousand of dollars per run but the actual call is not the thing that is expensive and I'm in control of what it costs me. It's not exactly the same scenario imo
I think the focus here isn't necessarily on compute cost. When companies hire data scientists or analysts, they're niche-skilled and expensive. If those people spend 50-60% of their time courting ad-hoc questions from various people in the org, the cost of that employee's time (and the money spent on them doing menial tasks that are a waste of their skillset) is the biggest factor.
This is basically what put me off when trying to use GPT-3.5 for this with SQL LangChain. And you can’t go cheaper LLM from OpenAI because then accuracy plummeted. GPT-3.5 is the sweet spot.
Sure, some use cases might work but it’s not going to be a thing that Just Works™ for products even accuracy issues aside. There’s just so much data to feed into each and every prompt, schemas and all. Many of them too if you want to enable joins.
1¢ per 1minute is fair as it would take the best translator much more time than that. 1¢ per sql query seems less fair to me, but we could argue it would also cost more to ask a dev.
Finetuned is 1.2cents/1k in and 1.6cents/1k out.
So it'll likely be closer to 2cents depending on what you're doing.
I'm not saying it's not useful, at 2c per query you have to be more "purposeful" as they could certainly add up depending on how you use it compared to 0.2c.
Most RAG approaches use a vectorDB and embeddings for schema linking. In this case the fine-tuning is handling schema linking and there is no vectorDB.
Instead of going for generative, few-shot models, I am starting to look at the other end of the spectrum: Binary classification into deterministic query building.
With ChatGPT, you cannot realistically explain why you got some output in a way that anyone other than an AI/ML expert would find satisfying. With binary classifiers, you can precisely explain how some input resulted in some output in terms that a business person could easily digest - "You mentioned this table so it assumed you wanted to constrain on XYZ. Here's the trace from the classifiers...".
I've proposed a few schemes where you define groups of classifiers for each SQL building concern - Which tables are involved, which columns, is a join or aggregate implied, general context of business use, etc. Clearly, there are holes with this scheme, but in our domain we could plausibly fine-tune our humans to be a little bit more verbose in their use of the automagic SQL vending machine. Several hours spent training humans is probably a lot cheaper & easier than getting ChatGPT, et. al. to consistently play by our rules.