I understand SQL, and I kind of understand LLVM, but I don't understand why SQL ...

cryptonector · on Feb 24, 2020

Because an RDBMS need not be I/O-bound. It might be compute bound (e.g., if the dataset fits in memory), so optimizing the compute-side can help. RDBMes generally compile queries into a "plan" that is then interpreted to execute it (SQLite3 compiles queries into bytecode, while PostgreSQL compiles them into AST-like tree structures). JITting certain portions of a query plan can help it go faster.

anarazel · on Feb 24, 2020

In fact, I'd say it's extremely common for database workloads to be compute bound on modern hardware.

For one, storage bandwidth has increased massively (my laptop's NVMe drives can do 2 x 3.2GB/s reads, you can get quite a few into even small servers). While memory latency and CPU throughput have not increased to the same degree.

To benefit from the increases in CPU throughput, one needs to take advantage of superscalar execution. Which isn't significantly possible with interpreted execution.

> SQLite3 compiles queries into bytecode, while PostgreSQL compiles them into AST-like tree structures

FWIW, expressions are compiled into bytecode in PostgreSQL as well. While there'd plenty benefit of doing that for query trees as well, there are not quite as much raw execution speed reason for it as there is for expressions (as the individual "steps" are much coarser, so the tree walk overhead is proportionally smaller).

timgarmstrong · on Feb 25, 2020

Being compute-bound is pretty standard for analytic queries (i.e. computing aggregate things over larger data sets). A lot of workloads do have high reuse rates of data so you'll get a lot of data cached in memory, and a lot of the processing is pretty CPU-intensive. Columnar data formats can also achieve very high compression rates, so a relatively small amount of data read off disk turns into a large number of rows. Plus, real queries often have insanely complex expressions (giant case statements, for example), that can burn a lot of compute.

It's very different from a OLTP workload where a query will read 10s or 100s of rows via a btree index.

ww520 · on Feb 24, 2020

This would be good for in-memory database when the IO bottleneck becoming less a bottleneck.

alecco · on Feb 24, 2020

Actually for the case it was on SSD. Most of the access is to compressed columnar data. The main table had 500+ columns of ~140m rows but with very low entropy (e.g. gender, city/district). Each column had its own file. And queries were relatively simple. At 300MB/s (at the time) it was pretty fast. Data was mmap since as in most real life scenarios, the access to the columns had a power law distribution. Most queries could resolve with RAM data (~24GB/s at the time). Data size was over 10x the size of RAM (~128GB). IO is fast if you do mostly sequential access. Note server NVMe today is ~8x faster so the principle still applies.

The main thing here is the simplicity of the architecture and combining HyPer, careful mmap usage, and having SQLite engine do all the other non-critical work. So this way this system could be implemented on time by 1 main developer (rockstar dev, Marcelo) and a tiny little help from me.

(Actual author of the idea/architecture here, not mentioned on the paper, but that's life, haha)

ww520 · on Feb 25, 2020

That's pretty sweet information. For columnar db, LLVM definitely helps as columnar wants to cram as much small type data as possible in memory and runs the CPU over them. I would imagine the condition evaluation in the where-clause would be benefited the most since it's mostly an evaluation of dynamic expression.

blattimwind · on Feb 24, 2020

> Note server NVMe today is ~8x faster so the principle still applies.

But memory isn't. With NVMe the disk-memory-gap has shrunk considerably over previous SSDs and HDD RAIDs.

ajross · on Feb 24, 2020

It's not completely uncommon for SQL code execution, especially in complicated legacy systems, to become a performance bottleneck. You normally think of data access being the resource the server is trying to manage, but occasionally it needs to churn on that data as its querying it (because the computed values can't be stored for some crazy reason). In such situations, it would surely be nice to have SQL code itself able to access low level optimizations.

jessermeyer · on Feb 24, 2020

Most SQL engines implement the query engine as a virtual machine that traverses the result on a row by row basis, under the presumption that I/O is the bottleneck. In other words, this is very slow on modern hardware.

stingraycharles · on Feb 24, 2020

Ok you have to elaborate a bit further here. Does this mean they are actually leveraging llvm to, at runtime, compile queries?

saagarjha · on Feb 24, 2020

Yes, that's what the paper says.

gopalv · on Feb 24, 2020

> I don't understand why SQL on LLVM?

SQL is mostly "do the same thing a billion times", which aligns very closely with generating code for the inner loop, particularly when the inner loop can be generated without branches.

SQL codegen is basically an industry standard for large-scale sql query processing on top of a large-scale column store. I work one of the few engines at this scale which does not do go query stage codegen & instead uses source codegen vectorized primitives instead of query time (Apache Hive, specifically & the only other one in the same style is Google's supersonic engine).

Apache Impala - SQL to LLVM IR (C++)

https://llvm.org/devmtg/2013-11/slides/Wanderman-Milne-Cloud... [PDF]

Gandiva - SQL to LLVM IR (Java)

https://github.com/dremio/gandiva

MemSQL - SQL to MPL to LLVM IR

http://highscalability.com/blog/2016/9/7/code-generation-the...

Greenplum - SQL to LLVM IR

http://engineering.pivotal.io/post/codegen-gpdb-qx/

Postgres 9.4 Vitesse - SQL to LLVM IR

<https://www.postgresql.org/message-id/CAJNt7%3DZ6w5%2BwyeTKK...

Postgres 11 LLVM - SQL to LLVM IR

https://www.postgresql.org/docs/11/jit-reason.html

SparkSQL - SQL to Java source (Janino + javac)

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark...

And as far as I know Redshift's horrible query compilation performance is derived from using C++ codegen which forks gcc inside (not sure why this is so slow though, almost feels like untuned gcc flags or something).

https://docs.aws.amazon.com/redshift/latest/dg/c-query-plann...