timgarmstrong's comments

timgarmstrong · on Jan 15, 2022

This is true, but in the case where files are read only, just reading directly from the files with fread()/read()/etc works pretty well. You do have to pay the cost of a system call and a copy from the OS buffer cache into your user-space buffer, but OTOH when the page isn't in the buffer cache, the cost of reading the required data from storage is more predictable than the cost of faulting in all the 4kb pages you're reading.

timgarmstrong · on Aug 16, 2021

Every solution for robust joins sorting, hashing, or partitioning of some form and if you squint hard partitioning and hashing are very closely related to sorting, so I understand conceptually it seems samey - it kinda boils down to different schemes for partitioning the data.

The implementations and performance characteristics really aren't close at all.

The hash-based join starts in memory and degrades only under memory pressure, and, if it's a broadcast join, doesn't require data to be redistributed across nodes for each join. It also allows streaming one input through the join without sorting or otherwise buffering it in the join.

The old-style sort-merge join on map reduce requires both inputs to the join to be written to storage, shuffled across nodes and fully sorted even in the best case.

trhway · on Aug 16, 2021

>The old-style sort-merge join on map reduce requires both inputs to the join to be written to storage, shuffled across nodes and fully sorted even in the best case

It isn't full sort of both inputs. It sorts only what is on given node.

timgarmstrong · on Aug 16, 2021

The theme does come up in some papers - Google's F1 papers for example.

A lot of it doing the engineering work to make known techniques work - you can have a broadcast join that spills to disk (using a hybrid hash join or similar) and then layer on other techniques to make the spilling more incremental and reduce the penalties from spilling (e.g. bloom filters). It's just an order-of-magnitude increase in complexity to go from a simple broadcast join to a robust one.

timgarmstrong · on Dec 9, 2020

If I were running an open source software business, I would simply provide long-term support updates for free and also have lots of money to fund ongoing development.

timgarmstrong · on Dec 9, 2020

If you're using open source software and not paying anyone, sometimes shit happens and you will be surprised or disappointed and have no recourse. Even if everyone starts off with the best of intentions.

We could debate forever about whether fault lies with projects overpromising, or users having unrealistic expectations, or whatever else, but I don't think that changes the situation.

If you have are paying someone for the software/support, shit still happens, but you have a relationship and ways to get recourse.

timgarmstrong · on Aug 13, 2020

shared_ptrs, if misused, can cause a lot of memory management issues, because you end up with a web of object lifetime dependencies that is hard to reason about. I think this is a severely underrated problem in C++, particularly for newer developers. shared_ptr is extremely powerful, but it it makes it easy to gloss over issues of object lifetimes and ownership, when in fact those issues are still very important for building robust systems.

In the code bases I've worked on, we have pushed hard to have unique ownership wherever possible and only use shared_ptr where there was a clear need. You can still have a huge object graph kept alive by a single unique_ptr, but it happens less often and it's easier to trace back and fix.

A nice thing about the handle approach is that it makes it a lot harder to build up these object graphs or in general to implement anything without being explicit about object lifetime.

I've seen parts of the handle/object pool approach misapplied and cause more trouble than it's worth, though. It's a good idea for self-contained subsystems where there are a limited number of object types that you would apply this to. I don't think it scales to 100s or 1000s of distinct types of objects because then you're going to have headaches dealing with the sheer number the object pools.

I've also seen object pooling be implemented proactively as an "optimisation" to avoid calling malloc() but then become a bottleneck because of lock contention in the object pool.

timgarmstrong · on Feb 25, 2020

Being compute-bound is pretty standard for analytic queries (i.e. computing aggregate things over larger data sets). A lot of workloads do have high reuse rates of data so you'll get a lot of data cached in memory, and a lot of the processing is pretty CPU-intensive. Columnar data formats can also achieve very high compression rates, so a relatively small amount of data read off disk turns into a large number of rows. Plus, real queries often have insanely complex expressions (giant case statements, for example), that can burn a lot of compute.

It's very different from a OLTP workload where a query will read 10s or 100s of rows via a btree index.