Implicit In-order Forests: Zooming a billion trace events at 60fps

Mr_P · on May 1, 2021

If generalized to 2D (and over the unit sphere), you'd get Google's S2 library (https://s2geometry.io/).

In 3D, the same can be done with morton codes (or a 3D hilbert curve) to build an implicit octree. Some raytracing systems use this for fast BVH construction.

In either case, this approach differs from mipmaps in that the underlying data can be sparse, and is simply stored in a sorted array.

tracyhenry · on May 1, 2021

We built a general 2D zooming visualization system which supports data in PostgreSQL: https://github.com/tracyhenry/kyrix. Under the hood, it uses PostgreSQL quad tree index to fetch data on demand.

We have also tested with Citus, which helped us scaled to billions of objects. Demo: https://youtu.be/ccES97ni_vI

londons_explore · on April 30, 2021

Isn't this the idea behind MIPMAPS in computer graphics?

In the tracing world, I believe the opensource pulseview/sigrok does this. It makes the UI very responsive even with gigabytes of data. I just wish it also integrated data compression of some kind so I could fit more trace than I have RAM (it can't be all that hard to intelligently compress a super repetitive signal in a way which still allows this fast zooming and scrolling - replacing some tree nodes with LZ77 style backreferences ought to do the trick)

trishume · on April 30, 2021

Yah mipmaps are an N-dimensional generalization of the breadth first layout of implicit aggregation, where the aggregation function is averaging.

It may in theory be possible to generalize the in-order layout I talk about in a similar way, but I'm not sure it would be that useful, maybe it would allow you to append rows or columns to your mipmapped image more easily, but I don't know of any applications where that's useful.

teej · on April 30, 2021

Check out Facebook’s Gorilla paper (https://www.vldb.org/pvldb/vol8/p1816-teller.pdf) which features some clever compression techniques for time series data generated at regular intervals.

Veserv · on April 30, 2021

What do you mean by needing to fit more trace than you have RAM? Are you saying those systems require you to load the entire raw trace into RAM before constructing the relevant data structures for efficient zooming? If so, that seems very limited since you can just use a storage-friendly data structure that also supports efficient zoom so you can put the bulk of the data on disk or across the network and just use your RAM as a cache for the full data structure. This is no worse than a RAM-only solution and allows you to operate on effectively arbitrarily sized traces for when you need to handle those hundred TB traces.

londons_explore · on May 1, 2021

The use case is "I have binary data coming in at 5 Gbytes/sec from a logic analyzer, and I normally want say 30 seconds of trace"

That data is too big for RAM, and coming in too fast to write to disk. But the data is usually very compressible. Some channels will be a million '0's. Other channels will be a clock signal of '0 1 0 1 0 1' forever. Other signals will be more complex, but probably still be near repeats of patterns.

I just want to compress it in realtime in such a way the GUI tools I want to use can still work. They need to do fast-ish random access on the data, and answer questions about chunks of the data (eg. is all the data between time X and time Y all 0's).

Veserv · on May 1, 2021

Ah, I see, it is a data ingress problem not a UI/UX/visualization problem and you are in that awkward gap between RAM and disk bandwidths. So, you want a scheme to reduce ingress bandwidth which would allow you to store more data since you can not offload it anywhere else. From this perspective you would want a log optimized for useful entropy per unit storage which you can translate into a different data structure offline later on that is optimized for UI/UX. The most likely scenarios for that would require a very low level integration with the logging infrastructure itself and be either, as you say, a lossless streaming compression or some kind of lossy aggregation/compression, probably domain-specific, that retains the important data. The key problem here being creating an algorithm fast enough to operate at those data rates given that you get at most a few clocks per byte and no way of pushing the data to more parallel compute otherwise you would be able to just push it into storage.

benlivengood · on May 1, 2021

https://github.com/google/snappy is ~250 MB/second per core and you can compress in chunks of whatever size is convenient for your GUI to uncompress and render.

danielheath · on May 1, 2021

> too big for RAM

64gb ram sticks are about $700 AUD, which brings the problem into range via RAM.

> coming in too fast to write to disk

Good PCIe4 NVMe drives are pretty fast these days. Samsung 980 PRO is under $300 for 1tb, and claims it can write 5gb/sec.

Even if you only get 2 gb/sec, that brings your RAM requirements down to 90gb, which is not an exotic configuration for a high-performance desktop.

I'm eternally astonished by how far hardware has come. Of course, preprocessing the stream on a cheap FPGA would be more elegant, but it's pretty amazing that consumer hardware could do it at all.

cortesoft · on May 1, 2021

Ok, then imagine you are getting data at 10/g a second.

touisteur · on May 1, 2021

Of from a 200g adapter (or 4). Connect-X 6 EN. Network (LAN at least) almost always has at least 1 order of magnitude over storage... Oh yes I could put in a 10-disk nvme raid array and a liquid cooling system for all this. But then why not compress on the fly if I can anyway...

londons_explore · on May 1, 2021

My current solution for anyone reading is "capture just 0.5 seconds of trace, and try and time that 0.5 seconds to contain whatever it is I want to see"

formerly_proven · on May 1, 2021

blosc comes to mind.

quotemstr · on April 30, 2021

I don't like it when programs rely on VM overcommit in such a way that they break on other systems. Even if you don't care about Windows (which is a strict accounting no-overcommit system), you should care about Linux, which can be configured to do strict accounting the way Windows does.

If you want to use the big address-space carve-out trick, you can, but the right way to do it is to PROT_NONE the parts of the address space you aren't using and install a signal handler to commit bits of your carveout as they're used.

JoshTriplett · on April 30, 2021

Processing a signal and mapping memory via that signal is less efficient than letting the kernel do it for you.

You'll get better performance by using userfaultfd.

quotemstr · on May 1, 2021

That too, but if you commit in big enough chunks, you amortize the signal handling overhead down to almost nothing.

contravariant · on April 30, 2021

Not sure how they got the 3N space overhead for Fenwick trees. If you do it right then Fenwick trees won't have any space overhead (in fact the conversion from an array to a Fenwick tree can be done in place). Though I suppose they might suffer from catastrophic cancellation if you're not careful.

orlp · on May 1, 2021

I'm the author of that code. I have a semi-detailed writeup here: https://drive.google.com/file/d/17aK9ifP79abRuafXSD24jVVI3Sb...

Within one Fenwick tree we have two sorts of paths starting at any node, one going to a virtual node marked with 0, and one going to a node marked with positive infinity. The interesting property is that these two paths for any two starting nodes always intersect once.

In standard Fenwick trees you use one of those two paths to store data at the nodes visited during inserting, and retrieve that data using the other path at nodes visited during reading. This allows you to compute arbitrary prefix or suffix sums (depending on whether you use the 'up' or the 'down' path for insertion or reading).

However in the data structure I proposed we use both paths for storing and retrieving, allowing us to compute prefix and suffix sums. Finally to answer any sort of query we also need to store the original data in another array. Then any query for range [p, q] can be partitioned into three pieces: [p, m), {m}, (m, q] where m is the unique midpoint where the path going up from p and down from q intersect.

trishume · on April 30, 2021

Standard Fenwick trees can only do prefix sums, which only get you general range queries on things with a subtraction operator, not operations like maximum.

The reddit comment I link contains an implementation that allegedly does arbitrary range queries, but it's nigh-incomprehensible so I can't tell how or why it uses 3 arrays.

contravariant · on May 1, 2021

I see, yeah I can't help you there either. I don't see how a tree based approach would ever need more than twice the amount of space.

rokob · on April 30, 2021

I had some time series data which was append only like this. I started looking into skip lists and realized I might as well make it deterministic where the height of a node is basically popcnt. It ended up looking and behaving a lot like this.

trishume · on April 30, 2021

Cool! I thought about using skip lists a bunch before I settled on this, trying to think of various ways to reduce complexity and memory usage. My best skip lists designs still had some pointer overhead that the implicit approach avoids, but it was pretty small and they seemed reasonably simple. I briefly tried thinking of what an implicit skip list would be, but then just ended up thinking about implicit search trees.

lambda_obrien · on May 1, 2021

Check out http://btrdb.io/ which is a similar idea. It's a distributed database as well. It was designed for storing billions or even trillions of records for electrical data from hundreds of sources at once.

It was written by a couple of friends of mine and others, so I'm not sure how it compares to the article link, as I'm not a databases expert.