This is a Distributed SQL engine not a database. We store no data. You store your data in HDFS, S3, posix, NFS etc. We allow you to query directly from these filesystems of the file formats you have already. You can look here to see the file formats cudf supports. https://github.com/rapidsai/cudf/tree/branch-0.9/cpp/src/io
Greatly increased processing capacities. We can just perform orders of magnitudes more instructions per second than a cpu with the gpus we are using.
Decompression and parsing of formats like CSV and parquet happens in the GPU orders of magnitude faster than the best cpu alternatives.
You can take the output of your queries and provide it to machine learning jobs with zero copy ipc and get the results back the same way. We are all about interoperability with the rapidsai eco system.
Is there any reason why a SQL format isn't is that list? Wondering if there's a way to join SQL sources with file storage sources. An example of this would be filtering or enrichment operations.
When you say SQL format do you mean being able to read the output of a jdbc or odbc driver?
If this is the case then mostly just time. You are not the first person to ask about this and now that there are java bindings in cudf this might become easier to make a reality in the next few months.
Or do you mean being able to read a database's file format natively?
If this is the case there are many reasons.
1. There are many poorly/non documented formats
2. Even if you decide to read some other DB's format natively, those formats change over time
3. Little control of how and where the data is laid out
I've read the website, but I could't find a hint that the engine is distributed. Even the spark benchmarks compare a single instance with multiple nodes.
Is it distributed? How do I set it up in a distributed mode?
Does it support nested parquet (something that even spark itself struggles to support inside SQL).
You can try it out yourself here https://colab.research.google.com/drive/1r7S15Ie33yRw8cmET7_...
Or use dockerhub https://hub.docker.com/r/blazingdb/blazingsql/
The benefits are.
Greatly increased processing capacities. We can just perform orders of magnitudes more instructions per second than a cpu with the gpus we are using.
Decompression and parsing of formats like CSV and parquet happens in the GPU orders of magnitude faster than the best cpu alternatives.
You can take the output of your queries and provide it to machine learning jobs with zero copy ipc and get the results back the same way. We are all about interoperability with the rapidsai eco system.