Fine-tuning GPT-3.5-turbo for natural language to SQL

bob1029 · on Aug 31, 2023

We've been chasing this rabbit since the beginning. It currently seems to be uncatchable for the use cases that would be most valuable to us - writing complex queries we've never seen before. Our use case seems to confound any notion of training or fine-tuning, since the cases we need the most help with are also the ones we have the fewest examples of.

Instead of going for generative, few-shot models, I am starting to look at the other end of the spectrum: Binary classification into deterministic query building.

With ChatGPT, you cannot realistically explain why you got some output in a way that anyone other than an AI/ML expert would find satisfying. With binary classifiers, you can precisely explain how some input resulted in some output in terms that a business person could easily digest - "You mentioned this table so it assumed you wanted to constrain on XYZ. Here's the trace from the classifiers...".

I've proposed a few schemes where you define groups of classifiers for each SQL building concern - Which tables are involved, which columns, is a join or aggregate implied, general context of business use, etc. Clearly, there are holes with this scheme, but in our domain we could plausibly fine-tune our humans to be a little bit more verbose in their use of the automagic SQL vending machine. Several hours spent training humans is probably a lot cheaper & easier than getting ChatGPT, et. al. to consistently play by our rules.

sillysaurusx · on Aug 31, 2023

I think you've fallen into the trap of "AIs don't generalize, they memorize." But they do in fact generalize. The reason ChatGPT is so valuable is precisely because it can help out with situations that have never been seen before, not because it merely unlocks old preexisting knowledge. The fella who saved their dog with ChatGPT comes to mind. https://nypost.com/2023/03/27/chatgpt-saved-my-dogs-life-aft...

PheonixPharts · on Sept 1, 2023

I think you have fallen into the trap of mistaking interpolation for generalization.

Working with these models every day, it's clear that they can certainly interpolate between points in latent space and generate sensible answers to unseen questions, but it's pretty clear that they don't generalize. I've seen far to many examples of models failing to display any sense of generalization to believe otherwise.

That's not to say that interpolation in a rich latent space of text isn't very useful. But it's not the same level of abstraction that comes from true generalization in this space.

og_kalu · on Sept 1, 2023

>it's pretty clear that they don't generalize. I've seen far to many examples of models failing to display any sense of generalization to believe otherwise

Failing to generalize on whatever you have in mind is not evidence that the models are incapable of generalization. If you really think so, just be prepared to write off a good chunk of humans as well lol.

This is generalization. https://general-pattern-machines.github.io/

Just seems to me like you've taken examples of generalization and invented an alternate meaning just so you can't admit it generalizes.

"Oh but you see this example of answering an unseen question is "insert meaningless distinction" so it doesn't really count"

nl · on Sept 1, 2023

Given a multi-dimensional latent space with enough dimensions it's hard to imagine cases of generalization that aren't interpolation between points in latent space (given enough data).

The one possible exception is logical inference, and this problem seems tractable with tool use or programming.

Y_Y · on Sept 1, 2023

I object on geometric grounds. You can't interpolate outside the convex hull, if your have an outlier in any dimension then you're going to need to extrapolate, that seems to me like a reasonable way to ask for generalisation that isn't interpolation.

riwsky · on Sept 1, 2023

I think we’ve all fallen into the trap of mistaking pithiness for evidence

johndough · on Sept 1, 2023

With a sufficiently large dataset, every problem becomes interpolation.

Zircom · on Aug 31, 2023

Did you read the article you posted?

-ChatGPT initially gives the same diagnosis the vets did, Babesiosis

-It notes that Babesiosis may have either been a misdiagnosis or there may be a secondary condition/infection causing the remaining symptoms after the Babesiosis treatment didn't resolve all of them

-It suggests such a hypothetical secondary condition could be IMHA, which the article notes is an extremely common complication of Babesiosis with this specific dog breed

-A quick Google search brings up a fair amount of literature about the association between Babesiosis and IMHA

So in fact this is the opposite of a never before seen situation, ChatGPT was just regurgitating common comorbidities of Babesiosis and the vets in question are terrible at their job.

sillysaurusx · on Aug 31, 2023

Are you suggesting that it was commonplace for machine learning models to be able to extrapolate medical case reports into an actual diagnosis? Even being able to read and understand a case report is a minor miracle in extrapolation, and it’s interesting how far the goal posts move.

lossolo · on Aug 31, 2023

I have another example of ChatGPT not generalizing but just being a really good statistical model. I needed a solution to a problem that you can't find on Google, and a variation of a problem that also can't be found on Google. I attempted to obtain around 15 lines of code from ChatGPT that would solve the problem, but it consistently failed to produce the correct solution. I spent a few hours trying different prompts, indicating its errors and receiving apologies, only for it to generate another incorrect solution while acknowledging its mistake. Solving out of distribution problems correctly seems almost impossible for it.

sillysaurusx · on Aug 31, 2023

GPT 3.5 or 4? Surprisingly it makes a huge difference. I think a lot of peoples’ impressions are with 3.5, but many startups couldn’t have been built on it, whereas with 4 they can.

If it was 4 I’d be curious about the specific problem if you’d be willing to link to the chat.

avereveard · on Sept 1, 2023

I had similar issue with gpt4, was looking for a library that given a grammar and a string, would produce a list of next valid symbols.

Gpt4 only ever suggested grammar validator to the point I had given up and was going to write a grammar generator, and so I started looking for the equivalent of antlr in python, and in three searches I find out nltk.grammar that actually solves the original problem

It's not a new library either, so I'm dumbfounded by why gpt4 couldn't make sense of my request.

IanCal · on Sept 1, 2023

I tried this to see. It does suggest using validators, though in a way that mostly solves the problem. The failure mode is when the string is valid complete or extended. It does have a bit of a workaround for that.

https://chat.openai.com/share/54cb8876-96ff-4434-a479-4d2dde...

avereveard · on Sept 1, 2023

using the exception only works at whole symbol level, i.e. with "cat do" as imput

No terminal matches 'd' in the current parser context, at line 1 col 5

cat do ^ Expected one of: * CAT * SPACE * FISH * DOG

Next valid symbols: ['* CAT\n\t* SPACE\n\t* FISH\n\t* DOG']

IanCal · on Sept 1, 2023

I'm not sure I understand, what that reports is enough to solve it right? Was the problem you couldn't get it to solve it from that point?

avereveard · on Sept 2, 2023

No because if I add space to the original string I don't get a valid string

Or given the context space is a valid production from the last error, but then I would need to roll back the incomplete "do" terminal and lose that context.

IanCal · on Sept 2, 2023

> No because if I add space to the original string I don't get a valid string

Yeah, it's telling you what it was expecting when it hit an invalid token.

You take the invalid part and filter the list of suggestions with it. If you tell chatgpt about the problem it solves it.

(Edit - when I was running this, having any number of spaces was parsed as valid btw if that's the concern - space was a valid option)

disgruntledphd2 · on Sept 1, 2023

If it hasn't been used much/talked about much for this purpose then you wouldn't expect it to, right?

avereveard · on Sept 1, 2023

true, but then it's neither interpolating between the public documentation nor extrapolating from the possible similarities between concepts

mejutoco · on Sept 1, 2023

> With ChatGPT, you cannot realistically explain why you got some output in a way that anyone other than an AI/ML expert would find satisfying

IMHO OP is talking about "explainability" of the results, which is notoriously bad for current AI. For certain applications (idk if SQL would be one but mortgage application might be one) it is required to be able to explain how the computer got to the decision.

bob1029 · on Aug 31, 2023

> I think you've fallen into the trap of "AIs don't generalize, they memorize."

Binary classifiers don't generalize?

Just because my output is not generative does not mean we are cannot learn / generalize elsewhere. Think of it as a 2-stage process.

lmeyerov · on Sept 1, 2023

Yeah louie.ai query generation internals feel more like classic program synthesis, where's it's more about staged compilation, iterative refinement, etc, and we use gpt3/gpt4 for each stage rather than a traditional solver. "Which tables would help this query, which parts should go to text search vs vector search, ...", and we iteratively rewrite and pass along until we reconstitute.

At the same time, we're also moving to make a lot more of that process AI controlled, just across LLM calls (e.g., LLM-dictated ones), so it's a funny maturity process.

Nick_Molter · on Aug 31, 2023

Curious to learn more about your use case. If fine-tuning is only ineffective for your most complex queries (and presumably those are less frequent as well, since you mentioned you have few examples), then couldn't you use fine-tuning to handle the simpler queries (presumably the lion's share) and thus free up excess man hours to focus on the more complex queries? Is there any benefit to AI being able to answer 90% of queries vs 0%?

bob1029 · on Aug 31, 2023

These tools are already fantastic at our 80% average case, even without fine tuning. We are seeing some value, but the real pain is in that other 20%.

MuffinFlavored · on Aug 31, 2023

> Binary classification into deterministic query building.

I have no clue what this means.

philipodonnell · on Sept 1, 2023

I tried something similar to this so I assume it’s to classify a table as being part of the resulting query or not, and using the schemas of the resulting tables in the prompt for the question. I found it just as useful to give it the full schema and ask it for a list of relevant tables, certainly more cost effective.

TedDallas · on Sept 1, 2023

At a certain level of complexity it is easier to write the damn SQL than it is to explain your query to GPT.

gloosx · on Sept 1, 2023

In the end, the one who possesses knowledge will turn it into value going the most obvious and straightforward way: writing a piece of useful SQL or code hot from the head.

GPT is a tool for learners, and they will keep shooting them feet until they learn, just the weapon is different now.

Look at the law system, written in natural language. It is misleading and doesn't work well, so we have to gather thousands of people around courts doing non-deterministic work in order to process them. Natural language is a tool for learning, not making systems. You have to shrink your vocabulary down to code at some point in order to make systems, and you can do it much faster and better than GPT.

saigal · on Sept 1, 2023

True. The goal would be write the complex SQL query once, then file it away as golden SQL so that in the future a similarly complex query could be done in natural language.

philipodonnell · on Aug 31, 2023

Are you planning on submitting this model to be evaluated against the Spider holdout set?

Also, wondering if anyone has found research on the inverse of this approach to the problem, i.e., instead of training the model to understand the data, you improve the data to be more understandable to a model? This seems more promising when you are looking at enterprise use cases without much training data. Spider seems like quite a simple dataset compared to the ones I encounter on the job, and LLMs struggle even with those.

aazo11 · on Aug 31, 2023

In theory one could create domain specific (or industry specific) templates for data. However coming up with a universal structure might be challenging since data is so varied.

Since the issue is often the context, plugging in data dictionaries (and passing those to the LLM) can help

MrezaPourreza · on Aug 31, 2023

Yes, we have already submitted the model for evaluation on the Spider holdout test set. While your suggestion is certainly intriguing, implementing a universal solution could be quite challenging, as it would heavily depend on the specifics of the dataset.

philipodonnell · on Aug 31, 2023

I don’t think it’s necessarily about a “universal” solution, just “better”. Make the column names more verbose, changing numeric enums to text ones, disambiguating column names, etc. One of the spider datasets is a stadium table and one of the column names is “average”, which means average capacity, but it’s super ambiguous. If you asked an LLM to “make these table columns more verbose” I bet it would call that “average_capacity” and all of the sudden some NLQ queries that confused the function and the column name would start to work.

codeulike · on Aug 31, 2023

I often do complicated reports in SQL or complicated transformations in SQL for system migrations. To really write a query and 'get it right' you usually need insider knowledge that you can't glean from the column names.

I see in their training set they've got comments about columns too. e.g.

    "Country" TEXT NOT NULL, - country where the singer born

But thats still not enough.

You also need a bunch of information about the real business that the data is describing. And you also need to analyse the whole database - is that field actually used? What are the common values for this picklist? What does that status actually mean in terms of business? If there are two of those rows that match the join but I want to avoid duplicates, which one do I take? - the newest, or the one with a certain status? etc.

codeulike · on Sept 1, 2023

Also, with complicated queries or reports, there's quite a big element of "do the results look roughly as I expect?" - e.g does the total income for last month look about right? Then you might do a detailed reconciliation and find out your total is 1% off due to not accounting for refunds and then you go and adjust your query and try again.

ainesh93 · on Aug 31, 2023

> You also need a bunch of information about the real business that the data is describing.

While the article focuses on finetuning GPT-3.5-turbo, how you use the text-to-SQL engine within the architecture of your overall solution is for you to decide. Providing this business context from vectorized context stores in the actual prompt would be a step in the right direction.

capableweb · on Aug 31, 2023

Why dont you collect a sample of the data so it knows what to expect then? Include 10 handpicked rows in your prompt and it should be able to pick up the patterns. If you find it doesn't handle something, append that row and restart.

kristianp · on Aug 31, 2023

So if they're doing 50,000,000 tokens to do the fine-tune, @ $0.08/1000 tokens and 2 epochs, that's about $800 to do the fine-tune. Not prohibitive from a business standpoint.

saigal · on Aug 31, 2023

A tutorial on how to fine-tune a GPT3.5 model for Natural Language to SQL tasks and a comparison of its performance vs Retrieval Augmented Generation. Based on the results, fine-tuning can match and outperform RAG (the approach matches the state of the art on accuracy while being far faster and cheaper). The big challenge for fine-tuning tasks like this is building the training datasets. Things should get even more interesting when GPT-4 is opened for fine-tuning.

lukev · on Sept 1, 2023

This brings up a question I've had about OpenAI's fine-tuning APIs.

The training format is a series of question/answer pairs: i.e, what we might think of as "supervised" learning. It can be challenging to build good data sets for this scenario.

But transformer-based models actually don't require supervised learning: the bulk training of the GPT family is to just throw masses of text at it and have it do next-token prediction.

What's going on with the difference? When I "fine-tune" GPT-3.5 turbo, am I actually training the transformer, or the RLHF model that sits on top? Or both?

More to the point, is there any way to fine-tune OpenAIs models in a n "unsupervised" fashion? I.e, if I want to teach a model SQL, do I need to get a curated data set of question/answer pairs, or I can I just dump in a bunch of schema and SQL queries that will in theory make the model "better at SQL" in a generic way?

mistercow · on Sept 1, 2023

My understanding is that there’s no RLHF model sitting on top of ChatGPT. There’s a separate reward model during training, but that reward model is just used to fine-tune the decoder model. This is sort of confusing in the literature, because they will switch from calling it a “language model” to a “policy”, once they’ve used PPO, but if you trace the citations back to [1], you can see that the policy is initialized with a pretrained language model (in that case, GPT-2).

What I don’t know is how they’re applying RLHF after user fine-tuning. Are they redoing the PPO with the original reward model after tuning on your input? Are they just letting it slide, and hoping that the fine-tuning doesn’t cause the model to forget the RLHF? It’s unclear from what I’ve read.

[1] https://arxiv.org/pdf/1909.08593.pdf

shireboy · on Aug 31, 2023

I think natural language to SQL may not be as great as it sounds in many real-world applications. You often have permissions, tenants, rules about who-can-see-what. I love the idea of letting users use natural language to query and possibly even bulk-update data. But if the app has rules like those, this would just be a SQL injection vulnerability. You could possibly limit it - just allow query over certain views or something - but the risk of the GPT crafting a dangerous query is still high.

What would be more useful IMO is natural language to OData, GraphQL, and OpenAPI/Swagger. Then you could let users do ad-hoc query but only against data they are allowed. I did a PoC using GPT3 to query OData and it was pretty fun, but did occasionally get the query wrong. I also ran into the context window issue. It would get lost when fed larger OData schema.

DevX101 · on Sept 1, 2023

You could implement row level security on the database to prevent unauthorized access.

tmostak · on Aug 31, 2023

It wasn't clear to me what evaluation method was being used, the chart in the blog says Execution Accuracy, but the numbers that seem to be used appear to correlate with "Exact Set Match" (comparing on SQL) instead of the "Execution With Values" (comparing on result set values). For example, DIN-SQL + GPT-4 achieves an 85.3% "Execution With Values" score. Is that what is being used here?

See the following for more info:

https://yale-lily.github.io/spider https://github.com/taoyds/spider/tree/master/evaluation_exam...

MrezaPourreza · on Sept 1, 2023

Hello, thank you very much for your meticulous comment. The 85.3% accuracy reported in our paper (I'm one of the authors of the DIN-SQL paper) pertains to the test set. However, in the blog post, we are reporting the performance on the development set, which stands at 74.2%.

totalhack · on Sept 1, 2023

I think natural language to SQL is a worthwhile goal and will have its use cases, but perhaps fine-tuning a model on a semantic layer and having it talk to a semantic data modeling API will yield a more useful/practical result. You may even find the semantic layer makes a natural language approach moot in many cases.

In Zillion* I added an experimental feature that uses OpenAI to form a report API call from natural language. Fine tuning (vs prompt tuning) this on the semantic model with gpt-3.5 would probably yield some notable improvement in abilities, as I view it as more of a toy feature at the moment.

* https://github.com/totalhack/zillion

antupis · on Sept 1, 2023

Yeah it is literally billion dollar question how to do that querying. Is there some good papers or blog posts that go through different approaches?

totalhack · on Sept 1, 2023

I think there are probably a million posts and papers talking about text to SQL as it's a problem that's been tackled for a long time now...just with more vigor recently now that LLMs provide an improved/shiny approach. I think many companies now offer or consider themselves semantic layers in some fashion (google "semantic layer" + tableau, cube.dev, dbt...). I have not come across any blog or paper comparing text-to-sql directly to text-semantic-api. Certainly the latter is not as flexible as the former in that it can't produce every possible SQL query. In practice (for me at least) that's often not necessary, or is necessary rarely so you just do your direct SQL as needed and use your semantic layer >95% of the time.

noorvir · on Sept 1, 2023

Can you describe in more detail what you mean by semantic? Doesn’t the DB schema capture the business logic to a large extent?

totalhack · on Sept 1, 2023

From our good friend GPT-4 who can summarize it better than I can...

A semantic data layer is an abstraction layer that sits between the raw data stored in databases and the applications or analytics tools that use this data. The purpose of this layer is to provide a consistent, business-friendly interface to data that might be stored in a variety of formats, tables, or systems. By doing so, the semantic data layer helps in simplifying complex data structures into more meaningful, understandable models. Key Features:

    Unified View: It provides a unified view of data from multiple sources, making it easier for users to access and understand the data without having to know the underlying structure or complexity.

    Data Governance: It can enforce business rules, security policies, and data quality measures, ensuring that data is consistent, compliant, and accessible only by authorized users.

    Flexibility: A semantic layer is often designed to be flexible, allowing business users to adapt to changes in the data model without requiring changes to the applications themselves.

    Query Simplification: The layer simplifies the process of querying data by providing a more user-friendly way of accessing and manipulating data, often through a drag-and-drop interface or other graphical tools.

    Decoupling: It decouples application development from data source changes. When underlying data changes, you don't necessarily have to update all the applications that use it; you might only need to update the semantic layer.

    Data Integration: It can integrate data from multiple sources, providing a single "source of truth" for business users and applications.

Purpose:

    Simplification: Make data more accessible and easier to understand for non-technical users.

    Consistency: Ensure that everyone is working from the same definitions and business rules.

    Efficiency: Reduce the time and complexity involved in generating reports, analytics, and other data-driven functions.

    Data Quality: Help to enforce data quality and governance policies.

    Security: Provide a mechanism for enforcing security rules on who can see or modify data.

    Adaptability: Enable quicker adaptation to changes in business needs or data structures.

By providing a semantic data layer, organizations can ensure that their data is not only high-quality and secure but also that it can be easily used for making informed decisions. This is particularly valuable in today's data-driven business environment.

MrezaPourreza · on Aug 31, 2023

I believe this article underscores the significance and efficacy of fine-tuning for specific tasks. Looking ahead, I envision the integration of fine-tuned models with RAG agents and models, further enhancing overall performance.

iandanforth · on Aug 31, 2023

Spider isn't anything like the queries that analysts and data scientists write against DBs. I don't think it even has many joins. Do you not have access to a more realistic training set?

philipodonnell · on Aug 31, 2023

Agree that it’s not representative of real world queries, but I think it’s more of a “measure progress against a simple but consistent baseline” and SOTA is still struggling so it’s clearly not ready for real queries. Most of the papers on this topic mention a lack of large volumes of high quality training data… Spider is probably the best public one right now.

tmostak · on Aug 31, 2023

I agree that Spider queries are not necessarily representative of the SQL you might see in the wild from real users, but looking at some analysis I did of the dataset around 43% of the queries had joins, and a number had 3, 4, or 5-way joins.

ainesh93 · on Aug 31, 2023

Although Spider is better known in the text-to-SQL world, you're right that BiRD may provide a better testing ground. Comparing against the current leaderboard on that standard is on the docket!

h1fra · on Aug 31, 2023

The cost per question seems super high. I can't even think of an API where a single call would cost $1cent. You better have a good pricing model to follow up on this.

beebmam · on Aug 31, 2023

Really? I sure can. I can even imagine a single API request that triggers a job that costs $100,000. These APIs exist. They're not for individuals.

h1fra · on Aug 31, 2023

I mean yes, I have managed BigQuery jobs that cost thousand of dollars per run but the actual call is not the thing that is expensive and I'm in control of what it costs me. It's not exactly the same scenario imo

ainesh93 · on Aug 31, 2023

I think the focus here isn't necessarily on compute cost. When companies hire data scientists or analysts, they're niche-skilled and expensive. If those people spend 50-60% of their time courting ad-hoc questions from various people in the org, the cost of that employee's time (and the money spent on them doing menial tasks that are a waste of their skillset) is the biggest factor.

riku_iki · on Aug 31, 2023

> I can't even think of an API where a single call would cost $1cent.

paying engineer for the same job will be many factors more expensive.

jug · on Aug 31, 2023

This is basically what put me off when trying to use GPT-3.5 for this with SQL LangChain. And you can’t go cheaper LLM from OpenAI because then accuracy plummeted. GPT-3.5 is the sweet spot.

Sure, some use cases might work but it’s not going to be a thing that Just Works™ for products even accuracy issues aside. There’s just so much data to feed into each and every prompt, schemas and all. Many of them too if you want to enable joins.

claytonjy · on Aug 31, 2023

Google Speech-to-Text is one I've dealt with recently; costs over 1¢ to transcribe 1 minute of audio. Almost 8¢/minute for the medical version.

h1fra · on Aug 31, 2023

1¢ per 1minute is fair as it would take the best translator much more time than that. 1¢ per sql query seems less fair to me, but we could argue it would also cost more to ask a dev.

eli · on Aug 31, 2023

It would obviously cost much more to pay a person to write SQL for you

Legend2440 · on Aug 31, 2023

What? Some of the APIs I use at work cost $100+ per call. (it's not really about the call, it's about the data it provides)

IanCal · on Aug 31, 2023

It's about 0.2 cents right?

It's hard for something you interact with manually to provide positive value less than 0.2c.

thewataccount · on Aug 31, 2023

That's for the non-finetuned GPT3.5turbo model.

Finetuned is 1.2cents/1k in and 1.6cents/1k out. So it'll likely be closer to 2cents depending on what you're doing.

I'm not saying it's not useful, at 2c per query you have to be more "purposeful" as they could certainly add up depending on how you use it compared to 0.2c.

IanCal · on Aug 31, 2023

Ah thanks I read that wrong.

Zetobal · on Aug 31, 2023

If your app is able to function better for the cost of the API calls the value might just be there.

doitLP · on Sept 1, 2023

But still only ~75% accurate execution. Completely unusable and it can’t explain how it got the data it got.

philipodonnell · on Sept 1, 2023

Did you provide any db schema information in the prompts after the model was fine tuned?

sandGorgon · on Aug 31, 2023

do you not use a vector db and embeddings search to get the table structure, etc ?

aazo11 · on Aug 31, 2023

Most RAG approaches use a vectorDB and embeddings for schema linking. In this case the fine-tuning is handling schema linking and there is no vectorDB.

sandGorgon · on Sept 1, 2023

that is super interesting. Do you find that schema-linking using finetuning works better ?