I've written this comment before: in 2007, there was a period where I used to run an entire day's worth of trade reconciliations of one of the US's primary stock exchanges on my laptop (I was on-site engineer). It was a Perl script, and it completed in minutes. A decade later, I watched incredulously as a team tried to spin up a Hadoop cluster (or Spark -- I forget which) over several days, to run a work load an order of magnitude smaller.
> over several days, to run a work load an order of magnitude smaller
Here I sit, running a query on a fancy cloud-based tool we pay nontrivial amounts of money for, which takes ~15 minutes.
If I download the data set to a Linux box I can do the query in 3 seconds with grep and awk.
Oh but that is not The Way. So here I sit waiting ~15 minutes every time I need to fine tune and test the query.
Also, of course the query now is written in the vendor's homegrown weird query language which is lacking a lot of functionality, so whenever I need to do some different transformation or pull apart data a bit differently, I get to file a feature request and wait a few month for it to be implemented. On the linux box I could just change my awk parameters a little bit (or throw perl in the pipeline for heavier lifting) and be done in a minute. But hey at least I can put the ticket in blocked state for a few months while waiting for the vendor.
Oh how true this is. At my current work we use _kubernetes_ for absolutely no reason at all other than the guy in charge of infra wanted to learn it.
The result?
1. I don't have access to basic logs for debugging because apparently the infra guy would have to give me access to the whole cluster.
2. Production ends up dying from time to time because apparently they don't know how to set it up.
3. The boss likes him more because he's using big boy tools.
yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
Just because your throw-away 40 line script worked from cron for five years without issue doesn't mean that a seven node hadoop cluster didn't come with benefits. You got to write in a language called "pig"! so fun.
maybe we should all start to add "evaluated a hadoop cluster for X applications and saved the company 1mi (in time, headcount, and uptime) a year going with a 40line perl script"
> yeah but who was getting better stuff on their resume? didn't you get the memo about perl?
That is why Rust is so awesome. It still allows me to get stuff in my resume, but still make an executable that runs on my laptop with high performance.
"Consistently shows disregards for costs, performance and practicality to deliver predictable increases in the team size, budget and number of direct reports"
I was in the field at the time and I agree. I thought it had to be what the big boys used. Then I realized that my job involves huge amounts of structured data and our MySQL instance handled everything quite well.
This was nicely foreseen in the original Map - Reduce paper, where the authors write:
> The issues of how to parallelize the computation, distribute the data, and
> handle failures conspire to obscure the original simple computation with large
> amounts of complex code to deal with these issues. As a reaction to this
> complexity,we designed anew abstraction that allows us to express the simple
> computations we were trying to perform but hides the messy details of
> parallelization, fault-tolerance, data distribution and load balancing in
> a library .
If you are not meeting this complexity (and today with 16 TB of RAM and 192 cores, many jobs don't) then Map-Reduce / Hadoop is not for you...
There is an incentive for people to go horizontal rather than permitting themselves to go vertical.
Makes sense, we are told that vertical has limits in university and we should prioritise horizontal; but I feel a little like the "mid-wit" meme, once we realise how vertical we can go then we can end up using significantly fewer resources in aggregate (as there is overhead in distributed systems of course).
I also think we are disincentivised from going vertical as most cloud providers prioritise splitting workloads, most people don't have 16TiB of RAM available to them, but they might have a credit card on file for a cloud provider/hyperscaler.
*EDIT*: Largest AWS Instance is, I think, the x2iedn.metal ith 128vCPU and 4TiB RAM
*EDIT2*: u-24tb1.metal seems larger; 448vCPU and 24TiB Memory, but I'm not sure if you can actually use it for anything that's not SAP HANA.
Horizontal scaling did have specific incentives when Map Reduce got going and today also in the right parameter space.
For example, I think Dean & Ghemawat reasonably describe what were their incentives: saving capital by reusing an already distributed set of machines while conserving network bandwidth. In table 1 they write average job duration was around 10 minutes involving 150 computers and that on average 1.2 workers died per such job!
The computers had 2-4 GiB memory, 100megabit ethernet and ISA HDDs. In 2003 when they got map reduce going Google's total R&D budget was $90million. There was no cloud so if you wanted a large machine you had to pay up front.
What they did with Map Reduce is a great achievement.
But I would advise against scaling horizontally right from the start because we may need to scale horizontally at some time in future. If it will fit on one machine, do it on one.
Maybe it was a great achievement for Google, but outside of Google I guess approximately nobody rolling out MapReduce or Hadoop read Dean & Ghemawat, resulting in countless analysts waiting 10 minutes to view a few spreadsheet sized tables that used to open in Excel in a few seconds.
MapReduce came along at a moment in time where going horizontal was -essential-. Storage had kept increasing faster than CPU and memory, and CPUs in the aughts encountered two significant hitches: the 32-bit to 64-bit transition and the multicore transition. As always, software lagged these hardware transitions; you could put 8 or 16GB of RAM in a server, but good luck getting Java to use it. So there was a period of several years where the ceiling on vertical scalability was both quite low and absurdly expensive. Meanwhile, hard drives and the internet got big.
For the Map Reduce specifically the one of the big issues was the speed at which you could read data from a HDD and transfer across the network. The MapReduce paper benchmarks were done with computers with 160 GB HDDs (so 3x smaller than typical NVMe SSD today) which had sequential read of maybe 40MB/s (100x smaller than a NVMe Drive today!) and random reads of <1MB/s (also very much smaller than a NVMe Drive today).
On the other hand they had 2GHz Xeon CPUs!
Table 1 in the paper suggests that average read throughput per worker for typical jobs was around 1MB/s.
Maybe more like 80MB/s? But yeah, good point, sequential reads were many times faster than random access, yet on a single disk the sequential transfer rate increases were still not keeping up with storage rate increases, nor CPU speed increases. MapReduce/Hadoop gave you a way to have lots of disks operating sequentially.
I’ll add context that NUMA machines with high CPU’s and RAM used to cost six to seven digits. Some setups were eight figures. They had proprietary software, too.
People came up with horizontal scaling across COTS hardware, often called Beowulf clusters, to have more computing for cheaper. They’d run UNIX or Linux with a growing collection of open-source tools. They’d be able to get the most out of their compute by customizing it to their needs.
So, vertical scaling being exorbitantly expensive and less flexible at the time, too.
If my laptop is closed then data collection will still happen, as collection and processing are different systems; but my ability to mutate the data hands-on is affected.
Thus any downtime of my laptop is not really a problem.
See also: Jupyter notebooks, Excel, etc;
I will also point out that robustness in distributed systems is not as cut and dry for two reasons:
1: These are not considered hot-path systems that are mission critical so will be neglected by SRE.
2: Complexity is increased in distributed systems, thus you have more likelihood of failure until you have a lot of effort put into it.
Yes, I believe we are talking about different things. In my experience the hadoop (or mapR) cluster ended up getting used for a bunch of heterogenous workloads running simultaneously at different priorities. High priority workloads were production impacting batch jobs where downtime would be noticed by users. Lower priority workloads were as you describe--analysts running ad-hoc jobs to support BI, data science, operations, etc.
Hbase also ran on that infrastructure serving real-time workloads. Downtime on any of the Hbase clusters would be a high severity outage.
So minutes/mo of downtime would certainly have unacceptable business impact. Another important thing is replication. Drives do fail, and if a single drive failure brings down prod how long would that take to fix?
To be clear in general my opinions are aligned with the article, I think using the whole machine at high utilization is the only environmentally (and financially) responsible way. But I don't believe it's true that purely vertical scaling is realistic for most businesses.
EDIT: there are also security and compliance concerns that rule out the scenario of copying data onto an employee laptop. I guess what I'm trying to get at is the scenario seems a little contrived.
> degenerated level of sysadmin competence that we forgot even what RAID is
At the risk of troll-feeding, what are you hoping to accomplish with this? Of course I haven't "forgot even what RAID is", and I'm confident my competence is not "degenerated".
In this magical world where we can fit the entire "data lake" on one box of course we can replicate with RAID, but you've still got a spof. So this only works if downtime is acceptable, which I'll concede maybe it could be iff this box is somehow, magically, detached from customer facing systems.
But they never really are. Assuming even that there aren't ever customer impacting reads from this system, downtime in the "data lake" means all the systems which write to it have to buffer (or shed) data during the outage. Random, frequent off-nominal behavior is a recipe for disaster IME. So this magic data box can't be detached, really.
I've only ever worked at companies which are "always on" and have multi-petabyte data sets. I guess if you can tolerate regular outages and/or your working data set is so small that copying it around willy-nilly is acceptably cheap go for it! I wish my life was that simple.
I'm certainly not trolling, but unfortunately I think you've completely misunderstood the context of this entire discussion.
If you really have multi-petabyte datasets then probably you are at the scale where distributed storage and systems will be superior.
The point of this conversation is that most people are not at this scale but think they are. IE: they sincerely believe that a dataset does not fit in ram of a single box because it's 1TiB or they think because it doesn't sit on a single 16TiB drive then a distributed system is the only solution.
The original post is an argument about that; that a single node can outcompete a large cluster, so you should avoid clustering until it really cannot fit on a single box anymore.
Your addendum was reliability is a large factor. Mostly this does not bear resemblance with reality. You might be surprised to learn that reliability follows a curve where you get very close to high reliability with a single machine, you diminish it enormously with a distributed system and then start approaching higher reliability when you have a lot more effort into your distributed system.
My comment about RAID was simply because it's very obvious that a single drive failure should not be taking a single machine down, similarly a CPU fault or memory fault can also be configured to not take down a machine. That you didn't understand this was either a failing of our industry knowledge; or, if you did understand this then the comment was disingenuous and intentionally misleading- which is worse.
I've also only worked at companies that were "always on" but that's less true than you think also.
I have never worked anywhere that insisted that all machines are on all the time, which is really what you're arguing. There is no reason to have a processing box turned on when there's no processing that's required.
Storage and aggregation: sure, those are live systems and should be treated as such, but it is never a single system that both ingests and processes. Sometimes they have the same backing store, but usually there is an ETL process and that ETL process is elastic, bursty, etc. and its outputs are what people are actually doing reports based on.
RAID is literally designed to prevent data corruption using parity from data and gives resilience in the event of drive failures, even intermittent ones.
Like all "additional components", RAID controllers come with their own quirks and I have heard of rare cases of RAID controllers being the cause of data loss, but RAID as a concept is designed to combat bit-rot and lossy hardware.
ZFS in the same vein is also designed around this concept and attempts to join RAID, an LVM and a filesystem to make "better" choices on how to handle blocks of data. Since RAID only sees raw blocks and is not volume or filesystem aware there are cases where it's slower.
That said, I have to also mention that when I was investigating HBASE there was no way to force consistency of data, there was no fsync() call in the code, it only writes to VFS and you have to pray your OS flushes the cache to disk before the system fails. HBASE Parity is configured by HDFS which is essentially doing exactly what RAID does. Except only to VFS and without parity bits.
RAID is distributed across drives on one machine. That whole machine can fail. Plus, it can take a while to recover the machine or array and it is common for another drive to fail during recovery.
HDFS is distributed across multiple machines, each one which can have RAID. It is unlikely that enough machines will fail to lose data.
One of my favorite posts. I'll always upvote this. Of course there are use cases one or two standard deviations outside the mean that require truly massive distributed architectures, but not your shitty csv / json files.
Reflecting on a decade in the industry I can say cut, sort, uniq, xargs, sed, etc etc have taken me farther than any programming language or ec2 instance.
My work sent me to a Hadoop workshop in 2016 where in the introduction the instructor said Hadoop would replace the traditional RDBMS within five years. We went on to build a system to search the full text of Shakespeare for word instances that took a solid minute to scan maybe 100k of text. An RDBMS with decent indexes could have done that work instantly; hell, awk | grep | sort | uniq -c could have done that work instantly.
It’s been 8 years and I think RDBMS is stronger than ever?
Colored the entire course with a “yeah right”. Frankly is Hadoop still popular? Sure, it’s still around but I don’t hear much about it anymore. Never ended up using it professionally, I do most of my heavy data processing in Go and it works great.
Hadoop has largely been replaced by Spark which eliminates a lot of the inefficiencies from Hadoop. HDFS is still reasonably popular, but in your use case, running locally would still be much better.
My current task at my day job is analyzing a large amount of data stored in a Spark cluster. I'd say, so far, 80% of the job has been extracting data from the cluster so that I can work with it interactively with DuckDB.
This data is all read-only, I suspect a set of PostgreSQL servers would perform much better.
Yes. My job involves pulling a ton of data off Redshift into Parquet files, and then working with them using DuckDB (sooo much faster — DuckDB is parallelized, vectorized and just plain fast on Parquet datasets)
Why Postgres? DuckDB is column-based and Postgres is row-based. For analytics workloads, I’m having a hard time thinking of a scenario where Postgres wins in terms of performance.
If your data is too big to fit into DuckDB, consider Clickhouse, which is also column-based and understands standard SQL.
In terms of the actual performance? Sure. In terms of the overhead, the mental model shift, the library changes, the version churn and problems with scala/spark libraries, the black box debugging, no, still really inefficient.
Most of the companies I have worked with that actively have spark deployed are using it on queries with less than 1TB of data at a time and boy howdy does it make no sense.
I haven't really encountered most of the problems you mentioned, but I agree it can certainly be inefficient in terms of runtime. That said, I think if you're already using HDFS for data storage, being able to easily bolt on Spark does make for nice ease of use.
These posts always remind me of the [Manta Object Storage](https://www.tritondatacenter.com/triton/object-storage) project by Joyent. This project was basically a combination of object storage with the added ability to run arbitrary programs against your data in situ. The primary, and key, difference being that you kept the data in place and distributed the program to the data storage nodes (the opposite of most data processing as I understand it), I think of this as a superpowered version of using [pssh](https://linux.die.net/man/1/pssh) to grep logs across a datacenter. Yet another idea before its time. Luckily, Joyent [open sourced](https://github.com/TritonDataCenter/manta) the work, but the fact that it still hasn't caught on as "The Way" is telling.
Some of the projects I remember from the Joyent team were: dumping recordings of local mariokart games to manta and running analytics on the raw video to generate office kart racer stats, the bog standard dump all the logs and map/reduce/grep/count them, and I think there was one about running mdb postmortems on terabytes of core dumps.
On a similar reasoning, in 2008 or such, I observed that, while our Java app would be able to run more user requests per second than our Python version, it’d take months for the Java app to overtake the Python one in total requests served because it’d have to account for a 6 month head start.
Far too often we waste time optimising for problems we don’t have, and, most likely, will never have.
I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.
Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?
I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.
I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.
Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?
each node in our hadoop cluster had 64GiB of ram (which is the max amount you should have for a single node java application, where 32G is allocated for heap FWIW), we had I think 6 of these nodes for a total of 384GiB memory.
Our storage was something like 18TiB across all nodes.
It would be a big machine, but our entire cluster could easily fit. Largest machine on the market right now is something like 128CPU's and 20TiB of Memory.
384GiB was available in a single 1U rackmount server at least as early as 2014.
Storage is basically unlimited with direct-attached-storage controllers and rackmount units.
I had an HP from 2010 that supported 1.5TB of ram with 40 cores, but it was 4U. I'm not sure what the height has to do with memory other than a 1U doesn't have the luxury of the backplane(s) being vertical or otherwise above the motherboard, so maybe it's limited space?
Theres different classes of servers, the 4U ones are pretty much as powerful as it gets, many sockets (usually 4) and a huge fabric.
1Us are extremely commodity, basically as “low end” as it gets, so I like to use them as if they are a baseline.
A 1U that can take 1.5TiB of ram might be part of the same series of machines that might have a 4U machine that could do 10TiB. But those are hugely expensive. Both to buy and to run
> Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?
I was a SQL Server DBA at Cox Automotive. Some director/VP caught the Hadoop around 2015 and hired a consultant to set us up. The consultant's brother worked at Yahoo and did foundational work with it.
Consultant made us provision 6 nodes for Hadoop in Azure (our infra was on Azure Virtual Machines) each with 1 TB of storage. The entire SQL Server footprint was 3 nodes and maybe 100 GB at the time, and most of that was data bloat. He complained about such a small setup.
The data going into Hadoop was maybe 10 GB, and consultant insisted we do a full load every 15 minutes "to keep it fresh". The delta for a 15 minute interval was less than 20 MB, maybe 50 MB during peak usage. Naturally his refresh script was pounding the primary server and hurting performance, so we spent additional money to set up a read replica for him to use.
Did I mention the loading process took 16-17 minutes on average?
You can quit reading now, this meets your request, but in case anyone wants a fuller story:
Hadoop was used to feed some kind of bespoke dashboard product for a customer. Everyone at Cox was against using Microsoft's products for this, while the entire stack was Azure/.Net/SQL Server...go figure. Apparently they weren't aware of PowerBI, or just didn't like it.
I asked someone at MS (might have been one of the GuyInACube folks, I know I mentioned it to him) to come in and demo PowerBI, and in a 15 minute presentation absolutely demolished everything they had been working on for a year. There was a new data group director who was pretty chagrined about it, I think they went into panic mode to ensure the customer didn't find out.
The customer, surprisingly, wasn't happy with the progress or outcome of this dashboard, and were vocally pointing out data discrepancies compared to the production system. Some of them days or even a week out of date.
Once the original contract was up, and time to renew, the Hadoop VP now had to pay for the project from his budget, and about 60 days later it was mysteriously cancelled. The infra group was happy, as our Azure expenses suddenly halved, and our database performance improved 20-25%.
The customer seemed to be happy, they didn't have to struggle with the prototype anymore, and wow, where did all these SSRS reports that were perfectly fine come from? What do you mean they were there all along?
In 2014 I was at Oracle Open World. A 3rd party hardware vendor was saying (and having customers) for Hadoop "clusters" that had 8 cpu cores. Basically their pitch was that Oracle Hardware (ex sun) started at a dense full rack of about a 1 million USD or so, but with the 3d party you could have a hadoop "cluster" in 2U and for 20K. The oracle thing was actually quite price competitive at the time, if you needed hadoop. The 3rd party thing was overpriced for what it was.
Yet, I am sure that 3rd party hardware vendor made out like bandits.
I worked at a corp that had built a Hadoop cluster for lots of different heterogeneous datasets used by different teams. It was part of a strategy to get "all our data in one place". Individually, these datasets were small enough that they would have fitted perfectly fine on single (albeit beefy for the time) machines. Together, they arguably qualified as big data, and justification for the decision to use Hadoop was because some analytics users occasionally wanted to run queries that spanned all of these datasets. In practice, these kind of queries were rare and not very high value, so the business would have been better off just not doing them, and keeping the data on a bunch of siloed SQL Servers (or, better, putting some effort into tiering the rarely used data onto object storage).
I wonder if companies built Hadoop clusters for large jobs and then also use them for small ones.
At work, they run big jobs on lots of data on big clusters. The processing pipeline also includes small jobs. It makes sense to write them in Spark and run them in the same way on the same cluster. The consistency is a big advantage and that cluster is going to be running anyway.
Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.
This is exactly the point of the article. From the conclusion:
> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.
This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.
And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.
Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.
I think it completely matters - yes these orgs are a lot more wasteful, but there is still an opportunity to save money here, especially is this economy, if not for the internal politics wins.
I’ve spent time in some of the largest distributed computing deployments and cost was always a constant factor we had to account for. The easiest promos were always “I saved X hundred million” because it was hard to argue against saving money. And these happened way more than you would guess.
> I’ve spent time in some of the largest distributed computing deployments
Yeah obviously if you run hundreds or thousands of severs then efficiency matters a lot, but then there isn't really the option to use a single machine with a lot of RAM instead, is there?
I'm talking about the typical BigCorp whose core business is something else than IT, like insurance, construction, mining, retail, whatever. Saving a single AKS cluster just doesn't move the needle.
Yeah I see your point where it just doesn’t matter, especially back the the original point where it may not be at scale now, but you don’t want to go through the budget / approval process when you need it etc.
I think my original point was more in the “engineers want to do cool, scalable stuff” realm - and so any solution has to support scaling out to the n’th degree.
Organisational factors pull a whole new dimension into this.
I mean yeah, definitely - it blows my mind how much tolerance for needless complexity the average engineer has. The principal/agent mismatch applies universally, and beyond that it is also a coordination problem - when every engineer plays by the "resume driven development" rules, opting out may not be best move, individually.
The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.
Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.
> "email Jeff and ask him to run his script" isn't scalable
Sure, it's not.
But the only alternative to that is not building some monster cluster to process a few gigabytes.
You can write a good script (instead of hacking one together), put it in source control and pull it from there automatically to the production server and run it regularly from cron. Now you have your robustness, reproducibility and consistency as well as much higher performance, for about one-ten-thousandth of the cost.
I bought a Raspberry Pi 4 for Christmas. It's connected to my dev laptop directly via wired connection. My self imposed challenge for this year is to try to offload as much work to this little Pi as I can. So I'm a fan of this approach.
Even in with large scale data Hadoop/Spark tend to be used in ways that makes no sense, as if something being self described as big data means that as soon as you cross some threshold you SHOULD be using it.
Recently had an argument with a senior engineer on our team because a pipeline that processed several PB of data, scaled to +1000 machines and was all account a success was just a Python script using multiprocessing distributed with ECS and didn't use Spark.
Common command line tools are often the best for analyzing and understanding HPC clusters and issues. People have often asked me for tools and web pages to figure out how to understand and figure out issues in our cluster, or asked if we could use some tool like Hadoop, Spark, or some Azure/GCP/AWS tool to do it faster. I've said that if they want to spend the effort to use those tools, it could be valuable; but if it takes me 10min to use those tools and <1min using command line tools, I'll always fall back to the command line.
That's not to say that fancy tools don't have their use; but people often forget how much you can do with a few simple commands if you understand a pipeline and how the commands work.
Yes, I’ve used this approach in my Swiss Army Llama project with huge numbers of vectors and it can scale massively. Also it’s free! These vector db as a service companies charge insane prices for this! It really does feel like snake oil to me.
There was some environment (Ice surface movements in the South Pole I think) related researcher who rewrote his calculations from Nvidia and GPU computing to a plain C file. The NV task lasted for months; later, seconds.
This is one of my favourite posts.
Part of my PhD was based on this post. https://discovery.ucl.ac.uk/id/eprint/10085826/ (Section 4.1).
I presented this section of my research in a conference and won best paper award as well.
If we write dedicated tools, speeds boost can be enormous. We can process 1 billion rows from a simple CSV in just 2 seconds. In slow Java.
It just requires some skills, which is hard to find nowadays.
Like a lot of things, people tend to make a decision for horizontal vs vertical and then stick with it even as the platforms or "physics" change underneath them over time. Same for memory bandwidth (which people, like Sun, thought would remain more of a bottleneck than it actually turned out to be).
What is the largest data set people here are processing daily for ETL on one machine? What tools are you using, and what does the job do? I want to know how capable new libraries like polars are, and how far you can delay transitioning to Spark. Are terabyte datasets feasible yet?
I remember reading this article several years ago. Good to see it again. I remember when everyone thought their data was big data. How provincial we were.
272MB/s is "consumer spinning rust RAID 5 via USB3" speeds. I know this because that's roughly what my NAS backup machine does on writes/reads. From a NAS on gbit it's overpowered by 150%, for sure.
Actually the awk solution in the blog post doesn’t load the entire dataset into memory. It is not limited by RAM. Even if you make the input 100x larger, mawk will still be hundreds of times faster than Hadoop. An important lesson here is streaming. In our field, we often process >100GB data in <1GB memory this way.
This. For many analytical use cases the whole dataset doesn't have to fit into memory.
Still: of course worthwhile to point out how oversized a compute cluster approach is when the whole dataset would actually fit into memory of a single machine.
It's so funny how there is almost no 'science' - or 'engineering' - in modern 'computer science' or 'software engineering'. The finding on OP's website, of what is or isn't fast for what purpose, should not be surprising us, 79 years after the first programmable computer. Yet we go about our work, blissfully ignorant of what the actual capabilities of what we're doing are.
We don't have hypotheses, experiments, and results published, of what a given computing system X, made up of Y, does or doesn't achieve. There are certainly research papers, algorithms and proof-of-concepts, but (afaict) no scientific evidence for most of the practices we follow and results we get.
We don't have engineering specifications or tolerances for what a given thing can do. We don't have calculations for how to estimate, given X computing power, and Y system or algorithm, how much Z work it can do. We don't even have institutional knowledge of all the problems a given engineering effort faces, and how to avoid those problems. When we do have institutional knowledge, it's in books from 4 decades ago, that nobody reads, and everyone makes the same mistakes again and again, because there is no institutional way to hold people to account to avoid these problems.
What we do have, is some tool someone made, that then millions of dollars is poured into using, without any realistic idea whatsoever what the result is going to be. We hope that we get what we want out of it once we're done building something with it. Like building a bridge over a river and hoping it can handle the traffic.
There are two reasons creating software will never (in my lifetime) be considered an engineering discipline:
1) There are (practically) no consequences for bad software.
2) The rate of change is too high to introduce true software development standards.
Modern engineering best practice is "follow the standards". The standards were developed in blood -- people were either injured or killed, so the standard was developed to make sure it didn't happen again. In today's society, no software defects (except maybe aircraft and medical devices) are considered severe enough for anyone to call for the creation and enforcement of standards. Even Teslas full-self-driving themselves into parked fire trucks and killing the occupants doesn't seem enough.
Engineers that design buildings and bridges also have an advantage not available to computers: physics doesn't change, at least not at scales and rates that matter. When you have a stable foundation it is far easier to develop engineering standards on that foundation. Programmers have no such luxury. Computers have only been around for less than 100 years, and the rate of change is so high in terms of architecture and capabilities that we are constantly having to learn "new physics" every few years.
Even when we do standardize (e.g. x86 ISA) there is always something bubbling in research labs or locked behind NDAs that is ready to overthrow that standard and force a generation of programmers into obsolescence so quickly there is no opportunity to realistically convey a "software engineering culture" from one generation to the next.
I look forward to the day when the churn slows down enough that a true engineering culture can develop.
Imagine what scenario we would be in if they laid down the Standards of Software Engineering (tm) 20 years ago. Most of us would likely be chafing against guidelines that make our lives much worse for negative benefit.
In 20 years we'll have a much better idea of how to write good software under economic constraints. Many things we try to nail down today will only get in the way of future advancements.
My hope is that we're starting to get close though. After all, 'general purpose' languages seem to be converging on ML* style features.
* - think standard ML not machine learning. Static types, limited inference, algebraic data types, pattern matching, no null, lambdas, etc.
The Mythical Man-Month came out in 1975. It was written after the development of OS/360, which was released in 1966. Of the many now-universally-acknowledged truths about software development contained in that book, No Silver Bullet encapsulates why "in 20 years" we will still not have a better idea:
There is no single development, in either technology or management technique,
which by itself promises even one order of magnitude improvement within a decade
in productivity, in reliability, in simplicity."
I like to over-simplify that quote down to:
Humans are too stupid to write software any better than they do now.
We have been writing software for 70 years and the real world outcomes have not gotten a lot better than when we started. There are improvements in how the software is developed, but the end result is still unpredictable. Without thorough quality control - which is often disdained, and there is no requirement to perform - the result is often indistinguishable whether it was created by geniuses or amateurs.
That's why I would much rather have "chafing guidelines" that control the morass, than to continue to wade through it and get deeper and deeper. If we can't make it "better", we can at least make it more predictable, and control for the many, many, many problems that we keep repeating over and over as if they're somehow new to us after 70 years.
"Guidelines" can't stop researchers from exploring new engineering materials and techniques. Just having standard measures, practices, and guidelines, does not stop the advancement of true science. But it does improve the real-world practice of engineering, and provides more reliable outcomes. This was the reason professional engineering was created, and why it is still used today.
"It's so funny how there is almost no 'science' - or 'engineering' - in modern 'computer science' or 'software engineering'"
It may not have been clear in 2014, but it is now: Data scientists are not computer scientists or software engineers. So tarring software engineers with data scientists practices is really a low blow. Not that we're perfect by any means, but that data point you're drawing a line through isn't even on the graph you're trying to draw.
I was unlucky enough to brush that world about a year ago. I am grateful I bounced off of it. It was surreal how much infrastructure data science has put into place just to deal with their mistake of choosing Python as their fundamental language. They're so excited about the frameworks being developed over years to do streaming of things that a "real" compiled language can either easily do on a single node, or could easily stream. They simply couldn't process the idea that I was not excited about porting all my code to their streaming platform because my code was already better than that platform. A constant battle with them assuming I just must not Get It and must just not understand how awesome their new platforms were, and me trying to explain how much of a downgrade it was for me.
"We don't have calculations for how to estimate, given X computing power, and Y system or algorithm, how much Z work it can do."
Yeah, we do, actually. I use this sort of stuff all the time. Anyone who works competently at scale does, it's a basic necessity for such things. Part of the mismatch I had with the data scientists was precisely that I had this information and not only did they not, they couldn't even process that it does exist and basically seemed to assume I must just be lying about my code's performance. It just won't take the form you expect. It's not textbooks. It can't be textbooks. But that's not the criterion of whether such data exists.
We do actually have some methods of calculating an expected performance. For instance we know that a Zen4 CPU can do 4 256 bit operations per clock, with some restrictions on what combinations are allowed. We are never going to hit 4 outright in real code, but 3.5 is a realistic target for well optimised code. We can use 1 instruction to detect newline characters within those 32 bytes, then a few more to find the exact location, then a couple to determine if the line is a result, and a few more to extract that result. Given a high density of newlines this will mean something on the order of 10 instructions per 32 B block searched. Multiply the numbers and we expect to process approximately 11 B per clock cycle. On a 5 GHz CPU that would mean we would expect to be done in 32 ms, give or take. And the data would of course need to be in memory already for this time to be feasible, as loading it from disk takes appreciably longer.
Of course you have to spend some effort to actually get code this fast, and that probably isn't worth it for the one-shot job. But jobs like compression, video codecs, cryptography and that newfangled AI stuff all have experts that write code in this manner, for generally good reasons, and they can all ballpark how a job like this can be solved in a close to optimal fashion.
The way I see it is that we're in an era analogous to what came immediately after alchemy. We're all busy building up phlogiston like theories that will disprove themselves in a decade or two.
But this is better than where we just came from. Not that long ago, you would build software by getting a bunch of wizards together in a basement and hope they produce something that you can sell.
If things feel worse (I hope) that's because the rest of us muggles aren't as good as the wizards that came before us. But at least we're working in a somewhat tractable fashion.
The mathematical frameworks for construction were first laid out ~1500s (iirc). And people had been doing it since time immemorial. The mathematics for computation started about 1920-30s. And there's currently no mathematics for the comprehensibility of blocks of code. [Sure there's cyclomatic complexity and Weyuker's 9 properties, but I've got zero confidence in either of them. For example, neither of them account for variable names, so a program with well named variables is just as 'comprehensible' as a program with names composed of 500MB of random characters. Similarly, some studies indicate that CC has worse predictive power of the presence of defects than lines of code. And from what I've seen in Weyuker, they haven't shown that there's any reason to assume that their output is predictive of anything useful.]
It can be, but usually isn't. Similarly, dropping a feather and a bowling ball simultaneously might be science, or might not be. Did I make observations? Or am I just delivering some things to my friend at the bottom?
I for one does not miss having "big data" being mentioned in every meeting, talk, memo, etc. Sure it's AI now, but even that doesn't become as annoying and misunderstood as the big data fad was.
> geeks think they are rational beings, while they are completely influenced by buzz, marketing, and their emotions. Even more so than the average person, because they believe they are less susceptible to it than normies, so they have a blind spot.
> but even that doesn't become as annoying and misunderstood as the big data fad was.
Must be nostalgia. AI is much, much worse. And, even more importantly, not only it is a annoying buzzword, it is already threatening lives (see the mushroom guide written by AI) and democracies (see the "singing Modi" and "New Hampshire Officials to Investigate A.I. Robocalls").
Also both OpenAI and Anthrophic argued if licenses were required to train LLMs on copyrighted content, today’s general-purpose AI tools simply could not exist.
This might be naive, but I agree that AI hype will never be as annoying as Big Data hype.
At least 90% when people mention wanting to use AI for something, I can at least see why they think AI will help them (even if I think it will be challenging in practice).
99% of the time when people talk about big data it is complete bullshit.
So I can use a command line tool to process queries that are processing 100 TB of data? The last time I used Hadoop it was on a cluster with roughly 8PB of data.
"Can be 235x faster" != "will always be 235x faster", nor indeed "will always be faster" or "will always be possible".
The point is not that there are no valid uses for Hadoop, but that most people who think they have big data do not have big data. Whereas your use case sounds like it (for the time being) genuinely is big data, or at least at a size where it is a reasonable tradeoff and judgement call.
To people's beliefs on this, here's a Forbes article on Big Data [1] (yes, I know Forbes is now a glorified blog for people to pay for exposure). It uses as example a company with 2.2 million pages of text and diagrams. Unless those are far above average, they fit in RAM on a single server, or on a small RAID array of NVMe drives.
That's not Big Data.
I've indexed more than that as a side-project on a desktop-class machine with spinning rust.
The people who think that is big data are the audience of this, not people with actual big data.