> Simply put, it is nicer to build your systems so that, as much as possible, they use a constant amount of memory irrespective of the input size
Really good advice - this is a hard earned lesson for many folks. I've worked with quite a few data scientists who were brilliant at experimental design but not necessarily experts in the field of comp sci. Their relatively simple python scripts would run nice and fast initially. As time passed and the organization grew, their scripts would start to run slower and slower as the datasets scaled and swapping to disk started occurring, etc. In some cases they would completely lock up shared machines, taking a good chunk of the team offline for a bit.
Anyway, Daniel Lemire's blog is a fantastic resource. I highly recommend taking a look through his publications and open source contributions. I was able to save my employer a lot of money by building on time series compression algorithms [1] and vectorized implementations [2][3] that he has provided.
> Really good advice - this is a hard-earned lesson for many folks. I've worked with quite a few data scientists who were brilliant at experimental design but not necessarily experts in the field of comp sci. Their relatively simple Python scripts would run nice and fast initially. As time passed and the organization grew,
The problem of "as the organization grew,..." is the problem you want to have. 90% of organizations instead have the problem "As we failed to get any interesting results or any funding, and our contributors started drifting away to other projects,..." and sometimes the reason they have that problem is precisely because the initial contributors were too concerned about doing things "the right way" and weren't satisfied with "relatively simple Python scripts [that] run nice and fast initially".
When your limiting factor is programmer effort, there are two kinds of systems: the kind that runs as slowly as the users and developers can tolerate, and the kind that people don't use because it lacks the functionality they want, functionality that some other system has. Probably a slower one.
Of course you should make things go as fast as you can, and performance can be critical not only for ergonomics but also for the ability to iterate rapidly, and sometimes the performance is the functionality, but when you find yourself facing down the shitty performance of "founder code" that locks up shared machines, don't fall into the trap of wishing that the people had done it right in the first place, before "time passed and the organization grew". In the startup graveyard there are nine other groups who tried to do the same thing, but couldn't quite get it right. Some of them probably had damn fast code, which you aren't looking at, because their organization never grew.
And I say that as someone who spent a fair bit of the last week looking at assembly language and thinking about compiler optimization tradeoffs.
The flip side is that sometimes unnecessary optimizations can slow progress towards the actual goal, especially if the data transformations are sophisticated math.
I have run into this problem more than once. Usually I am not creating the initial data, but am importing it. Usually I have ignored optimizations initially and just loaded and processed the data. Then I go back and hack on the subroutines loading the data and splitting it into records. I was the only programmer working on both projects though.
Once I did it with a large RDF (XML) that was a list of books. Initially I loaded and processed it in Perl with XML::Simple, but XML::Twig let me deal with things one record at a time.
The other time I had a large Java linked list holding objects, where I grabbed the head, processed it, and then shrunk the list (removeFirst()) as I proceeded (I probably could have optimized the loading even more).
As a one-person project, punting on the big load optimization worked for me, as I had a working program even before optimizing. If it was a larger project, what time would be best to optimize might have differed.
In the context of high-performance, revenue-sensitive data transformations, it'd be interesting to see how reusable free/open-source components would fare, especially with regard to licensing.
I can see MIT/permissive licenses being less attractive than usual in an environment like that, because the author might perceive reduced chance of reaping rewards (versus aiding competitors) from publishing components.
If improvements led to greater-than-zero reward for publishers and consumers alike, though (as enforced by copyleft approaches), then perhaps the results would be different. All conjecture, really.
I didn't find the solutions to be all that helpful; it was essentially "pipeline your processing." If all I'm doing is a programmatic "cat | grep | cut", that's fine. It gets messy when I want to sort the data or join it with another medium-large dataset, and this is why people load into memory in the first place. Is there a sane path where I'm not immediately jumping to Hadoop or Presto? Maybe I can offload the join to SQLite?
Yes! If your data is too big to comfortably manipulate in memory with R/Python/your language of choice, and the data is (or can be made to be) tabular, the next best place for it is a relational database. If you don't have an existing server that's well-suited, use SQLite or DuckDB, the "SQLite of analytics". Definitely use a columnar database if you're going to be doing complex queries and joins.
This is battle-worn experience and I will absolutely die on this hill. Relational databases JUST WORK.
The only exception is when you need to do complex ML on big data, in which case, take your pick from Apache Spark, Dask, or some similar framework. Hopefully these will be increasingly supplemented by in-database ML going forward, so that there is no reason to ever leave the database.
The best thing to do when it comes to sorting data is to not do that. Sorts frequently are used in merges or to find some top-n / bottom-n instance rate.
If you've got to sort (or writing an alternative is too time-consuming), do it after eliminating as much data (variables, records) as possible.
Hash tables (or other comparable tools such as btrees or Bloom filters) can be used for dataset comparisons or key-value lookups.
Arrays can be used to keep a runnning count of top-n records (this may need to be sorted on insert, though comparing largest/smallest entry can mitigate this).
Random sampling is very often as good as complete processing, especially where the goal is to find the general characteristics of a dataset rather than an exhaustive view or precise tabulation. Even in accounting and financial operations, sampling methods may be acceptable (see for example, mechanical copyright royalties on performance of recorded music broadcasts, or advertising and circulation rates in publishing). It's fairly staggering how clearly a small sample (as few as 30 records, often only a few hundred or thousand) can describe an immense dataset (millions or billions of entries) with reasonable accuracy, so long as it is randomly selected. And if the first sample seems off, try pulling a few others and seeing how they compare. (Effectively: Monte Carlo method.)
The place where sampling methods break down most is in iterative-calculation forecasts (e.g., weather modeling), where larger aggregations (typically measured as datapoints over both area and time) greatly reduce accuracy. Though even with very precise reporting density, today's weather forecasts tend to be limited to 7--10 days. Sometimes randomness simply wins.
> Arrays can be used to keep a runnning count of top-n records (this may need to be sorted on insert, though comparing largest/smallest entry can mitigate this).
Honored and wise Morbius, I humbly beg to differ; top-N record arrays never need to be sorted on insert, because a binary heap is 25 lines of code. Then all you need for top-N is:
for record in input_file:
if record <= aMinheap.minimum:
continue
aMinheap.extract_minimum()
aMinheap.add(record)
I think that, to produce the top 100, this takes about 8 comparisons per record added to the candidate set, rather than the 1000 or so required by your suggested method. The difference won't matter for a big enough unordered dataset, though, since almost all the time will be taken by discarding records in the initial comparison.
I was thinking quickselect was even faster when your data does fit in RAM, but I can't see how that could be true, even though it's linear time: if you have a million records, then the first partitioning step of quickselect will examine all of them (doing one comparison for each), the second partitioning step will examine on average half of them, and the third partitioning step will examine on average somewhat more than a quarter of them (because the expected size for the result of the second partitioning step is more than 250,000 records), and so we end up doing something like e comparisons per record on average rather than 1.
But, with the running-candidate-set approach you suggested, if the million records are in random order, after we're past the first 1000 records, we do only a single comparison for at least 90% of the remaining records (because they're less than the top 100 from the first 1000), and after we're past the first 10,000 records, for at least 99%. So if we did 8000 comparisons in the first 1000, then inserted 900 of the next 9000 records for another 7200 comparisons (plus 9000 comparisons to reject the other 8100), and then another 900 of the next 90,000 records (plus 90,000 initial comparisons), we're at 121,400 comparisons for the first 100,000 records, 1.214 comparisons per record. So, for a small top-N query, quickselect loses the race before it's finished with its second partitioning step. Or am I overlooking something?
I'd hand-coded a "top-n" utility in awk some time back.
I think my first iteration used bubble sort. That was still faster than a sort | head script, by a lot.
(I think I later improved on that, possibly by retaining min/max records.)
Even in awk, and even when choosing large values for n (e.g., thousand or tens of thousands of records), this easily beat the sort utility in cpu time, wall-clock time, and memory usage, for many millions of input records.
(Most of my coding tends to be hackish and brutish. Algorithms are not my strong suit.)
> I think my first iteration used bubble sort. That was still faster than a sort | head script, by a lot.
Haha! Gotta get Barack Obama to give you some algorithms lessons.
Thinking about it further, the initial heapify phase of heapsort is linear-time, and it's the later heap-popping phase that takes most of the runtime; if the command-line `sort` used heapsort for in-memory data, maybe `sort | head` would have been competitive or even faster? At least for n>32 or so? The usual algorithm for `sort` is mergesort, because that makes it perform acceptably for things that don't fit in RAM, but mergesort has to do most of its work before it can start producing output.
> To be fair, if the rest of your pipeline runs in the megabytes per second, then memory allocation might as well be free from a speed point of view.
This is important, but I think sometime people struggle with this sort of thinking. I struggled to explain a similar concept to a junior engineer recently. He was very keen to try to optimize part of a process that wasn't the bottleneck. I tried a couple approaches, like benchmarking various parts under different conditions, modeling it to calculate how speeding up different components would take.
I wasn't convincing, unfortunately, so he implemented some changes that sussessfully sped up one part but didn't improve end to end performance. I think sometimes you need to see it with your own eyes.
At my first job I spent about US$10k on a super fast compile server which didn't speed up our slow compiles because the bottleneck was the shared 10BaseT Ethernet to the NFS fileserver where we were storing both the source code and the build artifacts. I should have listened to my boss who was telling me it probably wouldn't help.
I'd add 'evreytime you read the data, do something so that next time it's easier. Using java as an example, and pcapng files for example, the first time you read your data, you should at least build a simple block/packet index so that next time you won't have to read it again. Same for all kinds of sparse data. I've had great success using 'simple' things like roaringbitmaps as 'field x is present in block y of stream z' indexes. I save the compressed roaringbitmap(s) in a sqlite DB and next time I open the data file I use it. This can be grafted quite quickly.
I realize over the years that the 'load everything' thing is often linked to lack of understanding of machine limitations and little training in stream-processing and how efficient and scalable it is.
I'd blame maths teaching that focuses on the abstract operation (full formula) and their 'implementation' past simple examples. Or I'd blame Excel as the gateway drug to numerical computing. But mostly, it's probably 'just' that not that many people happen to encounter data 'that' big (yet it's not 'big' data) and when they do they're often not helped in finding 'progressive' solutions. Running variant/avg isn't hard to understand but you must know it exists... Basic stream processing can be achieved with not too much changes (depending on the algorithms, of course). Simple indexes can be quite easy to build... But often we sell them 'you need to go full DB' or 'this is a job for hadoop or infrastructure-nightmare-tool-of-the-day'. Not everyone points you to https://ujmp.org/ with sparse/dense 1d/2d/3d matrix structures and operations and different storage options (disk-backed, etc...).
Most of the time I meet data scientists in difficulty, after 1h of explaining how I do X using roaring bitmaps or sparse structures or after 1h spent building a file/field index using very robust (old) libraries in their language/environment of choice, I see them build pretty solid and scalable pipelines...
At work we are batch-processing a decent amount of data every night using Spark. The way the workload can be split into parts allows the following strategy:
- Generate jobs only containing metadata and distribute those jobs to workers. There the actual data that is required for that specific job is queried on the fly from the db.
For a similar task however we later switched to the following strategy:
- Load all data to be processed at once, use all sorts of distributed transformations and aggregations on the full data set and do the splitting at the end of the workflow.
The reason why we switched? The only real reason was that it seemed more the "Big Data style" to do stuff. With the first approach we actually would not need all the fancy Spark functionality right? We would only abuse the framework for a fancy way to distribute mostly independent workloads.
However, I very much regretted that decision later as it made our life harder in many ways. For example, I could easily execute and thus debug one of the former jobs locally within a minute. Try that when the workflow is designed in a way that it needs to load several gigabytes before applying any logic. To be fair, the total load on the db was somewhat lower using the second approach, but that just wasn't worth it.
This is nothing new. Back in the late 2000s, I did this to deal with a huge quantity of XML been sent over the wire to desktop PCs. The original version needed 20GB of RAM. I changed it just to pick up the tags needed and parsed the stream as it came. Time was massively reduced too.
I see the same mistakes done with JSON nowadays.
Basically, if you don’t need the DOM in its entirety, don’t parse it all.
This is in my experience the most common performance antipattern. What makes it even worse is the fact that people quite often do this on the name of performance. Which rarely is a good idea.
Really good advice - this is a hard earned lesson for many folks. I've worked with quite a few data scientists who were brilliant at experimental design but not necessarily experts in the field of comp sci. Their relatively simple python scripts would run nice and fast initially. As time passed and the organization grew, their scripts would start to run slower and slower as the datasets scaled and swapping to disk started occurring, etc. In some cases they would completely lock up shared machines, taking a good chunk of the team offline for a bit.
Anyway, Daniel Lemire's blog is a fantastic resource. I highly recommend taking a look through his publications and open source contributions. I was able to save my employer a lot of money by building on time series compression algorithms [1] and vectorized implementations [2][3] that he has provided.
[1] Decoding billions of integers per second through vectorization https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2203
[2] https://github.com/lemire/FastPFor
[3] https://github.com/searchivarius/PyFastPFor