Wow thank you for reading it. I really appreciate that. I am enamoured with Rock...

didgetmaster · on Aug 12, 2022

Each of my data objects called Didgets (short for Data Widgets) that store tag or column information is a simple Key-Value store. I store each unique value once and have links between it and any keys mapped to it.

So if you have a 1 million row table of U.S. customers that has a 'state' column, then each state value (e.g. 'California', 'Iowa', 'Florida', etc.) is only stored once. Each value is referenced counted so if you had 50,000 customers who lived in California that value would have a reference count of 50,000. There would then be 50,000 links between the 'California' value and their respective keys (in this case the row numbers).

If one of your customers moves from California to Texas, you update that value which just decrements the reference count for California and increments the reference count for Texas. Then the row key for that customer is re-linked or mapped from pointing to the California value to pointing to the Texas value.

A query like "SELECT name, state, zip FROM <table> WHERE state LIKE 'T%';" causes the code to find all values in the state column that start with the letter T (Texas, Tennessee). It will then find all the keys mapped to those two values. Then it will load in the data for the name and zip columns and find any values mapped to those keys.

It is incredibly fast partly because it can be multi-threaded (one thread finds the names mapped to the keys while another one finds the zip codes mapped). Analytics are fast too since all the values are reference counted. It can instantly tell you the top 10 states where your customers live.

Here is a short video showing how fast it can do it compared to the same data set stored in Postgres. https://www.youtube.com/watch?v=OVICKCkWMZE

samsquire · on Aug 12, 2022

That's really interesting.

What's also coincidental is that I read a Quora post recently of network model databases where references are direct. They have the linking problem which CODASYL worked towards solving. Where you need to update all links when there are changes. My idea 18 data synchronization system is to produce one system to handle it. It also could be used for cache invalidation which is an interesting problem in itself.

Your solution also reminds me of a graph database where nodes and vertices/ edges are explicitly stored but in your case you are using reference counting which is new to me.

Thanks for sharing your knowledge on this.

From a logical point of view, explicit storage of links should be more efficient than a hash join or nested loop join. But is there a tradeof in write amplification? Joins in Postgres are materialised at query time but they're efficient due to btrees, whereas in your model the data links are all materialised. I think links in graph databases such as neo4j and dgraph are materialised as links directly too.

didgetmaster · on Aug 12, 2022

Updating values may or may not result in links being changed. For example if you imported data where 'Illinois' was misspelled as 'Ilinois' for the 10,000 customers who lived there; updating the value to the correct spelling might not change any of the links.

All the links are stored within a hashed set of data blocks where similar links are stored near each other. This helps minimize any write amplification. You might be able to update 10,000 links while only needing to write out a few blocks to disk.