Sounds pretty much what Postgres is doing since v11

bloomer · on Feb 24, 2020

Actually this is quite a bit different. This is more a columnar-store version of SQLite, basically an embedded OLAP database, which is pretty cool. I’m not aware of other column-stores in this niche, most are distributed systems meant for big data and so are much more complicated to setup and manage.

phunge · on Feb 24, 2020

Have a look at duckdb, it's is another interesting tool that's columnar and embedded.

It would be amazing if the world of open-source column stores matured a little bit relative to where we are today...

castorp · on Feb 24, 2020

I was referring to the native execution using LLVM. Which should be independent of the underlying storage strategy (column-store, row-store)

stereosteve · on Feb 24, 2020

ClickHouse is easy to run on local machine and has great performance.

baq · on Feb 24, 2020

since llvm is so slow you really have to validate if jit does work for your queries (obviously like for anything... duh) in my case i managed to slow the DB to a crawl with queries that were estimated to be super expensive but 95% of the plan wasn't actually ever executed.

anarazel · on Feb 24, 2020

Yea, we really need to improve the handling of those cases. I think there's four (was three) major angles:

1) I'd hoped to get caching for JITed queries into 13 (or at least the major prerequisite), but that looks like it might miss the mark (job changes are disruptive, even if they end up allowing for more development time). The nicest bit is that that the necessary changes also result in significantly better generated code.

2) Background JIT compilation. Right now the JIT compilation happens in the foreground. We really ought to only do the IR generation in foreground, and then do the compilation in the background, while continuing with interpreted execution. Only once codegen is done, we'd redirect to the JITed program (there'd be a bit higher overhead during the interpreted phase, rechecking whether to now redirect, but not that large).

3) Improve costing logic. E.g. we don't take the size of the necessary generated code into account at the moment, and we should. The worker count isn't taken into account either.

4) Improve optimization pipeline. There's plenty cases where we don't run beneficial and cheap-ish optimization passes, and there's plenty cases where we run unlikely to be helpful and really expensive optimization passes.

Edit: Added 4).

stereosteve · on Feb 24, 2020

Umbra DB was recently posted on HN and has the ability to avoid LLVM for simpler queries, in addition to novel buffer management and index approaches:

https://umbra-db.com/

Look forward to hearing more about it in the future.