For an article about LLM+OLAP, it doesn't spend much time on that part. Specifically it seems like their strategy is around using an LLM to generate a DSL query for an unnamed semantic layer, then everything downstream of that is normal warehousing, with the semantic layer handling actual SQL creation.
I wish it spent time on talking about how they trained their LLM to reliably generate parsable queries for the semantic layer, and what the accuracy rate of what the user intended vs what they got.
I do think the only way a LLM based analytics tool can succeed is via a semantic layer rather than direct SQL, since database schemas fail to encode a lot of information about the data (EG a warehouse might not even know user.customer_id = customer.id).
Yeah, similar to what you and the other commenter from Definite said, we (Delphi)[0] find semantic layers way better for this kind of work than just going straight to a database/data warehouse.
One thing you really need with LLMs is consistency. Text-to-SQL kind of lets the LLM do whatever it wants - join tables that shouldn't be joined, define aggregates one way in one query and another way in the next.
Because semantic layers define how tables should join, measure definitions, etc., they mean people get consistent results from one query to the next, which builds trust in the LLM.
They're kind of an updated version of OLAP cubes if you're familiar with those.
Typically semantic layers sit on top of a data warehouse, let you define metrics using code or a UI, and provide APIs or SQL connectors so that you can query them.
Agreed, that's exactly what we're doing with Definite[0]. We spin up Cube[1] for all our customers and the results vs. directly generating SQL are much better. Cube has some other really nice out of the box features too (e.g. caching).
Eh, many of them have some way to provide markup even when its informational only, because a data catalog or dictionary is required to use most large olap products.
eg Snowflake lets you declare all the foreign keys you want, but does nothing with that info except let you use it.
Sure, some OLAP databases let you add the same metadata that a OLTP database gives you as constraints, especially enterprise ones. A lot still don't, like Clickhouse, afaik.
No OLAP database I know of would let you encode other semantic layer things like aggregations or metrics. EG defining a DAU/MAU metric as "The distinct number of users logged in that day vs the distinct number of users in the 28 days before that day."
Those types of definitions usually live in the semantic layer or bi layer, which a LLM analysis tool would need to solve for.
From making a few variations on data chatbots in the past year, I found that my favorite / most fun to use ones seem to be more "chain-of-thought" and conversational rather than "retrieval-augmented" style.
Less about one-shotting the answer, and more about showing its work, if it errors, letting it self-correct. Latency goes up, but quality of the entire conversation also goes up, and feels like it builds more trust with the user. Key steps are asking it to "check its work", and watching it work through new code etc. (I open-sourced one version of this: https://github.com/approximatelabs/datadm that can be run entirely locally / privately)
From their article: I'm surprised they got something working well by going through an intermediate DSL -- thats moving even further away from the source-material that the LLMs are trained on, so it's an entirely new thing to either teach or assume is part of the in-context learning.
All that said, interesting: I'll definitely have to try out tencentmusic/supersonic and see how it feels myself.
Has anyone attempted to use Doris or evaluated it against Clickhouse? I have to admit Inever heard about it before, is it used beyond Tencent-owned companies ?
I would really like to see (and work for) a company that is building novel understanding of actual data and schemas with LLMs. Characterizing data and a limited number of transforms for an LLM should produce much more reliable tools than just piping direct text to a non enhanced LLM. Has anyone seen companies where they are doing this?
It will be difficult because of how organizations work. For example, finance and accounting people only care about shipped sales because that's when revenue is recognized whereas marketing and supply chain people think of demand sales (when order was placed). So you would need something to be able to interpret the difference depending on the audience or train the audience to be clear in their questioning.
Same goes for calendar vs. fiscal year for companies that have different fiscal and calendar begin dates. Something as simple as "2023 YTD" will mean different things depending on the audience within an organization.
When Veezoo connects to a database / dwh for the first time, an initial Semantic Layer / Knowledge Graph gets built automatically based on the data itself. We try to recognize how the columns link to other tables, try to identify units, and other semantic information e.g. if something is a "Location" or a "Country" and so on.
The whole conversational "plain english" querying then operates on top of the semantic layer, ensuring business logic (and other governance topics) are always respected.
That's what we're doing with Definite[0]. We spin up Cube[1] for all our customers and the results vs. directly generating SQL are much better. Cube has some other really nice out of the box features too (e.g. caching).
It looks ClickHouse's competitors are catching up quickly. Particularly StarRocks, which was first a fork of Apache Doris and then a rewrite. They claimed to have faster query engines with cost-based optimizers and cross-table joins. I was wondering if ClickHouse will release something major soon too.
I was thinking same, had hobby project where I did uploaded csv to DuckDb and then I use generate queries with chatgpt but building sematic layer top op duckdb sound much better.
I wish it spent time on talking about how they trained their LLM to reliably generate parsable queries for the semantic layer, and what the accuracy rate of what the user intended vs what they got.
I do think the only way a LLM based analytics tool can succeed is via a semantic layer rather than direct SQL, since database schemas fail to encode a lot of information about the data (EG a warehouse might not even know user.customer_id = customer.id).
Malloy could be an interesting target here.