Hello HN!
I built stanchion to scratch my own itch. I have personal applications running on SQLite where I want to store log and metric data (basically forever) but don't want to add complexity by bringing in a new database or whole separate server. Some of these applications are running on a Raspberry Pi, where storage and compute are limited, but I still want data warehouse like capabilities.
I envision stanchion being used in similar scenarios: on phones and resource-limited devices in applications that are already using SQLite. I know that there are alternatives like DuckDB (which is very cool), but I want stanchion to be "good enough" that it is useful without having to add whole new database technology.
If you think stanchion may be a good fit for your use case and you are interested in contributing, please test it and provide feedback by opening issues for any bugs, difficulties, or missing features you would need! Ideas are also welcome in this thread or as a github issue. Of course stars are always appreciated as well. The CONTRIBUTING doc in the repository has more details.
- Dan
The column-oriented data is stored in large BLOBs inside of regular SQLite tables. It uses the SQLite incremental BLOB I/O API [0] to incrementally read/write data in a column oriented way.
However, this project (and other SQLite extensions) will eventually hit a limit with SQLite's virtual table API. When you create a virtual table, you can perform a number of optimizations on queries. For examples, SQLite will tell your virtual table implementation the WHERE clauses that appear on the virtual table, any ORDER BYs, which columns are SELECT'ed, and other limited information. This allows extension developers to do things like predicate + projection pushdowns to make queries faster.
Unfortunately, it doesn't offer many ways to make analytical queries faster. For example, no matter what you do, a `SELECT COUNT(*) FROM my_vtab` will always iterate through every single row in your virtual table to determine a count. There's no "shortcut" to provide top-level counts. Same with other aggregate functions like SUM() or AVERAGE(), SQLite will perform full scans and do calculations themselves.
So for this project, while column-oriented datasets could make analytical queries like that much faster, the SQLite API does limit you quite a bunch. I'm sure there are workarounds around this (by custom UDFs or exposing other query systems), but would be hefty to add.
That being said, I still love this project! Really would love to see if there's any size benefit to this, and will definitely contribute more when I get a chance. Great job Dan!
[0] https://www.sqlite.org/c3ref/blob_open.html