Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

wenc · on March 8, 2022

This should be tagged (2014). This article has made the rounds many times. https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Outside of legacy systems, Hadoop isn't widely used anymore.

Puts · on March 8, 2022

Maybe we need a reminder every now and then that text-files are quite capable and except from being faster to parse in a lot of situation has the benefits of being "future safe" and easy to backup and compress as well.

IanCal · on March 8, 2022

The common text file for lots of this, CSVs, are absolutely awful. They're fine until they totally aren't, they were just the best option for a lot of use cases. I'd argue that's now been entirely replaced with parquet, significantly faster and broad support, proper types and more.

bell-cot · on March 8, 2022

If you don't already have parquet deployed, there's a wide gulf (in skill set, overhead, etc.) between CSV and parquet.

If you don't mind being old-school, the data is ASCII text, and you're tired of some of CSV's little issue, then ASCII has the FS, GS, RS, and US control characters - specifically intended for such uses. Micro conceptual overhead, none of the CSV issues which screw up many *nix text-handling programs and little script files, and a decent modern filesystem can handle the compression separately.

IanCal · on March 8, 2022

For many cases of passing data between systems, I'd say the gulf is dramatically lower than it has been until even fairly recently. Support for many standard packages is just there, swapping out "read_csv" for "read_parquet" if you're a pandas shop may even be enough. More and more tools read these directly, and it opens up a whole load of better options for processing data.

> If you don't mind being old-school, the data is ASCII text, and you're tired of some of CSV's little issue, then ASCII has the FS, GS, RS, and US control characters

I have used those before, and yet I still had those characters appear in data. The only places I'd ever seen them were in the wiki page and in customer delivered data. Absolute pain to dig through and remove.

On top of that "if your data is ASCII" is something I'd be nervous about for many use cases even if it is right now.

Beyond that, then you need everyone to swap out their parsing to use those characters.

CSV is fine until it totally blows up in your face. All it takes is one "oh it's fine we'll use awk" stage somewhere or a CSV parser that isn't good enough and one person to put a newline where nobody had expected it before.

bell-cot · on March 8, 2022

> I have used those before, and yet I still had those characters appear in data...

Oh, yes - which is why I emphasized "is". But ASCII text is easy to test for, which lets you fast-track into exception handling - "Tell Sales that Customer data is not as represented", "Trouble-shoot internal data source", etc.

(My experience is that substantial Customer data is never, ever as initially represented. Nor as represented after you point out the first set of issues with it. Nor as represented after you point out the second set of issues. Nor as...)

IanCal · on March 9, 2022

> Oh, yes - which is why I emphasized "is". But ASCII text is easy to test for, which lets you fast-track into exception handling - "Tell Sales that Customer data is not as represented", "Trouble-shoot internal data source", etc.

Oh with that I mean the ASCII control characters appearing in inputs. So some columns would have record end markers in for example.

If I'm able to make everyone dealing with the reading and writing add specific characters to be used for start/end/etc I'd rather just tell them to swap to a parquet reader unless they've got a really good reason.

slavik81 · on March 8, 2022

As someone in the process of setting up a pipeline and currently using pandas.to_csv as my output, I'm curious what makes you recommend parquet in particular? How does it compare to HDF or Feather?

IanCal · on March 9, 2022

It depends on what you're trying to optimise, parquet is a very good all round option. HDF I've never really gotten into as it always felt like a good solution only if I move everything over. It's great if your use case fits.

Feather is a layer on top of arrow and was a proof of concept (so I'm not sure how heavily it's used now), and arrow is fast becoming the interchange format. It's exactly laid out as things will be in memory - which means zero copy for shuttling it around from one place to another. I _think_ there is less support for feather but that is likely changing as everything converges.

Parquet should be

* Faster to write * Faster to read (even if you're reading the whole file, which actually isn't required, the format helps you read just sections of the columns you need) * Smaller * Better at handling actual floating points

than CSV, while having actual standards alongside it. Be a little wary of pandas guessing the right column types for you if you're creating partitioned files btw.

When you're working with pandas, etc (check out Dask) you can pretty much just swap out some reading and writing functions. You can also use pyarrow directly if you need to be very careful about column types.

For your use case you may want to explicitly use a single column for the features that is a list, I'm not sure if that's better/worse than having so many columns. If a reader may want to find just some images where a small subset of features are > X, you might benefit from multiple columns so that the reader only processes the data it needs.

Worth testing out, but I expect you should be able to try it out in an afternoon if you're already working with pandas/similar. Just install pyarrow and use a to_parquet. Things like dask (or straight pyarrow) give you partitioned files as output if you want too, if there's a useful column or columns to split on https://arrow.apache.org/docs/python/parquet.html#partitione...

awild · on March 8, 2022

Just a heads up, pandas to Csv is laughably slow. You're literally better off converting the pandas df to arrow and then serialise to Csv with arrow, though the format is quite limited with arrow.

NavinF · on March 9, 2022

Well at least it’s better than numpy’s csv functions. For my use case (~4 million images, one row per image, 2048 fp32 columns for features extracted with resnet), parsing/encoding csv with pandas was 10x faster than with numpy. Of course serializing the arrays to/from npy was 10x faster than any csv library.

derriz · on March 8, 2022

Yes - parquet tooling is non-existent compared to CSV particularly on the command line. And the cross-language/platform support is a mess - good luck reading Pandas generated parquet on .NET or in a (non-spark) JVM environment.

There are many reasons why CSV is flawed for the purposes of storing tabular data (e.g. loss of column type information) but the alternatives are just so unergonomic that CSV remains a viable choice in many situations.

manigandham · on March 8, 2022

Alternatives are available like Avro

kenoph · on March 8, 2022

Tbh Unix programs don't handle non-ASCII text very well, in my experience.

stingraycharles · on March 8, 2022

I’d argue that for as long as Excel doesn’t support Parquet files, we haven’t seen the last of CSVs for a long time.

Parquet is great, but it’s simply nowhere near as ubiquitous as CSV.

gehen88 · on March 8, 2022

I hadn't even heard of Parquet until now, and I'm sure this goes for lots of developers who don't do much data engineering.

jsjohnst · on March 8, 2022

https://xkcd.com/1053/

What’s the Parquet equivalent of going to the store to buy Mentos and Diet Coke now?

tharne · on March 9, 2022

> I’d argue that for as long as Excel doesn’t support Parquet files, we haven’t seen the last of CSVs for a long time.

It's easy to forget just how much analytic "stuff" Excel still powers.

dhx · on March 8, 2022

Column-oriented formats such as Parquet can be awful too. For example, if you have five columns of numbers you want to multiply together, Parquet is going to be the worst option available because of the extreme cache thrashing that the CPU will encounter.

Structure packing[1] and consideration of locality of reference[2] would need to be applied for high performance applications where a computer scientist has considered the algorithm needing to be implemented and the most efficient data format that the source data would need to be provided in.

[1] http://www.catb.org/esr/structure-packing/

[2] https://en.wikipedia.org/wiki/Locality_of_reference

spullara · on March 8, 2022

I mean, it isn't like Hadoop wasn't used to parse text files. Also, it is all fun and games until you need a join.

kragen · on March 8, 2022

comm(1) and join(1) can do joins on text files. Make sure they're sorted in the same locale you're joining them in; LANG=C tends to be the fastest.

StreamBright · on March 8, 2022

> Outside of legacy systems, Hadoop isn't widely used anymore.

This is not true at all. Almost all the cloud providers have their own Hadoop distributions that is used a lot in many companies.

rizkeyz · on March 8, 2022

Exactly. Basically most big data lives in hdfs and hdfs is part of hadoop. Even if you use Spark and Flink I would classify that as using Hadoop (under the hood).

> However, a very common setup is to use Flink to analyze data stored in the Hadoop Distributed File System (HDFS). -- https://wints.github.io/flink-web//faq.html

jamal-kumar · on March 8, 2022

Aside from Apache Spark, what's replaced it and does it still face the same speed of access limitations compared to just zipping through giant CSVs with awk or whatever streaming APIs you write with your own preferred language?

bpodgursky · on March 8, 2022

I use Spark for a number of jobs for language-specific features still but I think within 2 years all custom code will be trivially invoked as native UDFs in SQL data warehouses (ie Snowflake, which has essentially solved big-data performance as a going concern).

I just write SQL in Snowflake and it replaces 95% of what I would otherwise have done in custom MapReduce or Spark code.

markus_zhang · on March 8, 2022

What I really dislike modern cloud DWH such as Snowflake is that it hides a lot of things from me. Since I'm not a CTO who worries about not delivering, but a junior DE who actually wants to learn things, I really prefer that things were done in the old ways where we had to manage our own infrastructure and our own code for ETL. These kinds of things can not be learned "just for fun" because one has to work in a real environment.

manigandham · on March 8, 2022

What do you mean? Of course you can still learn them "just for fun" if you want. There are plenty of columnar data warehouses (memsql, greenplum, vertica, clickhouse, etc) and data processing frameworks (spark, flink, etc) that you can look at, implement and run yourself.

It's all using the same principles underneath.

markus_zhang · on March 8, 2022

What I'm saying is that you can surely scratch the basics from personal use, but it's completely different from real usage and such can only be trained on job. Now that those jobs are fewer as everyone goes on cloud.

tharne · on March 9, 2022

> What I really dislike modern cloud DWH such as Snowflake is that it hides a lot of things from me.

This is by design. The less skill required to use a tool, the less some CTO has to pay the person using it.

deepstack · on March 8, 2022

SQL will always be faster than Hadoop and MapReduce. The main reason to use those other slower services is developer are not use to SQL or declarative programming, and insist on having the code in Procedural way.

lmm · on March 8, 2022

That's completely backwards. Mapreduce-like approaches are how SQL datastores are implemented underneath; the absolute best case for SQL is to equal hand-tuned mapreduce-like performance, and often it will be slower (you're at the mercy of your query planner to pick the right indices, do joins in the right order, etc.). The main reason people use SQL is because they find it easier to express a query that way (which is completely legitimate - if your query planner is good enough most of the time, you've got better things to be doing than hand-tuning your query execution).

988747 · on March 8, 2022

No, that does not seem correct. SQL Datastores are not "map-reduce underneath", they have optimized datastructures for efficient querying (i.e. indices). Map-reduce is equivalent to those cases in SQL database where you have full table scan in your query plan - basically brute-forcing your way through the dataset.

lmm · on March 8, 2022

You can (and often should) have indices in a map-reduce situation as well - you just build them in an explicit, visible way. But in most of the relevant use cases you're doing some kind of aggregation over the whole table, so indices don't help any.

tremon · on March 8, 2022

And if your primary use-case is column-wise aggregation over the whole table, in SQL you'd use a (compressed) column store rather than a row store as your table storage method.

988747 · on March 8, 2022

To be fair, Parquet, which is commonly used in Big Data solutions is a column store format. So, once you normalize your data and save it as Parquet you can have efficient column-wise aggregation - but that assumes some preprocessing step.

manigandham · on March 8, 2022

That makes no sense. SQL is a query language, commonly implemented by relational databases.

In the early 2000s, columnar relational data warehouses were not sophisticated and scalable enough to handle the scale of data encountered at Yahoo, Google and other internet companies. MapReduce (and the many evolutions of Hadoop ecosystem) was created to scale processing through low-level instructions and algorithms.

Eventually columnar data warehouses caught up and are now capable of handling petabyte scale, regardless of whatever language you use to query them. The fundamental storage and compute primitives haven't really changed that much, just offered in a much more user-friendly way now.

Cthulhu_ · on March 8, 2022

SQL itself is just a query language, it's the underlying cloud based data warehouse that fulfills the role of what map/reduce used to do in terms of parallelization transparently.

mohanmcgeek · on March 8, 2022

But I can write SQL in Spark just fine. Can't I?

bpodgursky · on March 8, 2022

You can write some SQL in Spark, but

1) Why would you want to maintain your own Spark infrastructure? Spark on Kube is a huge improvement over YARN but you still have to deal with OOMEs, filled disks, Kube upgrades, pushing custom images to container registries, etc etc etc.

2) Snowflake is probably 10-50x as performant as Spark for data manipulation. I don't know what kind of unholy demonic incantations Snowflake is doing on the backend to support their SQL performance, but it's really freaking fast. There's just no other way to cut it.

I've spent 5-10 years eking every ounce of performance I can get out of a Hadoop/Spark cluster. I'm not trying to be unreasonable about this. I would love for OSS to be competitive; it's great for the world, and it would be great for my skill set and earning potential.

But it's not a contest, and if you think standalone Spark is going to be a viable competitor in a couple years, you are deluding yourself. Make informed choices about your career and investment.

Foobar8568 · on March 8, 2022

A lot of articles I read about snowflake involves data vault which is a massive turn off. And when their tech lead (Kent Graziano) is a prominent figure in the DV bullshit...

fritkot · on March 8, 2022

Snowflake and DV have no interdependency whatsoever. Snowflake is just a database. Whether you use DV to model the data inside of it or dimensional modelling or "big wide tables" is completely up to you, there's nothing about it that requires or benefits DV in particular.

secondcoming · on March 8, 2022

What is DV?

belter · on March 8, 2022

Count not knowing what it is, as a blessing. Run if you can.

https://en.wikipedia.org/wiki/Data_vault_modeling

Edit: As the Wikipedia article has no Criticism section I will add some references:

http://kejser.org/the-data-vault-vs-kimball-round-2/

https://timi.eu/blog/data-vaulting-from-a-bad-idea-to-ineffi...

fritkot · on March 8, 2022

It's a data modelling method for data warehouses, it can be used in Snowflake or on any other data management platform.

mohanmcgeek · on March 11, 2022

> Snowflake is probably 10-50x as performant as Spark for data manipulation

Wow is this for a fact? I haven't used either in a while but I saw the blog post from databricks and Spark was more performant than snowflake.

I assumed that's what I'll also get when i run spark on kube

rxin · on March 8, 2022

You should try Databricks, especially the new Photon engine powering Spark. In general more performant than Snowflake in SQL and a lot more flexible. (There are some cases in which Databricks would be slower but the perf is improving rapidly.)

belter · on March 8, 2022

Probably an oversight on your part, but I would argue would be elegant to disclose you are one of the co-founders.

Fiahil · on March 8, 2022

Databricks has an extremely bad API. So, sure, your Spark jobs might be a little bit faster some times, but why would you use it if you can't even read logs of running jobs?

Lucasoato · on March 8, 2022

Databricks is amazing, the Delta Live Table technology is incredible. It's very hard to approach problems like Data Lineage and Data Quality, but that platform does it in the right way.

My only concern is that they offer just a managed cloud product. That's cool for startups, but large enterprises sometimes need more governance and ownership than that.

soulbadguy · on March 8, 2022

Very surprised by this. Do you have a reference ?

fnord123 · on March 8, 2022

Fyi, rxin is co-founder of databricks.

soulbadguy · on March 8, 2022

That explains it

StreamBright · on March 8, 2022

The biggest selling point of Snowflake for most of the customers is that they do not need to maintain the infrastructure.

fritkot · on March 8, 2022

Of course you would say that it's more performant and flexible ...TCP-DS was just a PR ploy

kohlerm · on March 8, 2022

Apache Flink is an alternative with "real" (not micro batching) streaming support.

easytiger · on March 8, 2022

I spent a while putting a POC of Flink for a use case together at client behest a couple of years ago. Really struggled with it. Seemed slow as heck too. I'd much rather write a few hundred extra lines of my own code to do stuff like that (ingest -> process -> output. Modern software fashion trends seems to create as much work as it tries to save; as extensive and magical as a lot of the features are.

StreamBright · on March 8, 2022

There are two kind of approaches that I see:

- skip streaming entirely and have near real time solutions using just storage + a query engine

- have streaming using message queues and lambda architecture

In both cases the goal is that your freshest data shows up on a dashboard.

auggierose · on March 8, 2022

So what is used instead of Hadoop currently?

serial_dev · on March 8, 2022

Whether a technology can replace Hadoop in an organization depends on many factors, but some technologies that solve at least in part similar problem are Apache Storm, Spark, Flink, Kafka Streams, and maybe BigQuery?

Or, as the original article says, some companies just use some command line tools, shell scripts.

It's been a couple of years since I was interested in Data Engineering, so my knowledge on this topic is some years behind.

riknox · on March 8, 2022

I've not seen Storm being used anywhere sane for a few years at least now, and from a glance at job postings it looks unlikely. Spark, Kafka Streams etc. are definitely used in a modern data platform from my experience.

I think we're seeing a big shift with Hadoop-like workloads being moved onto cloud providers, so BigQuery, Amazon EMR etc.

bsenftner · on March 8, 2022

I'm curious what constitutes "big data" anymore. In an intermediate machine learning course, we train on nearly a petabyte of data using Google Colab and Jupyter Notebooks. Nobody discusses the size of the data requiring any special treatment due to its size... would not 95% of a petabyte be "big data"?

happymellon · on March 8, 2022

Big data is a shifting concept as computers gain more storage and faster commodity processors.

My general rule of thumb is whether it is too big to put on my laptop. So greater than a couple of Tb's.

pilotneko · on March 8, 2022

What course are you taking? Imagenet is only 150 GB, and Common Crawl is only 320 TB.

Big data is a moving target, but I’m comfortable defining it as data too large to fit in memory. Obviously, you can always get a bigger node, my rule is thumb is that if you need generators, you are working with big data.

rurban · on March 8, 2022

GNU parallel. not moreutils parallel, and not xargs

commandlinefan · on March 8, 2022

Well, a lot of AWS and GCP infrastructure use it, but they hide it from you.

fulafel · on March 8, 2022

Any specifics about which services?

ryapric · on March 8, 2022

AWS: Glue, and by proxy Athena

GCP: Dataproc

Those are just the obvious ones though.

mr_toad · on March 8, 2022

Glue often uses EMR under the hood, which is often Spark. And Athena is PrestoDB, as far as I know it has nothing to do with Hadoop other than you can use it to query Hadoop data stores.

trashtester · on March 8, 2022

The way I see it, Hadoop is still in common use as the storage layer for Spark and related implementations, whether that is in the form of HDFS or something like EMRFS:

Quote from AWS: "EMRFS is an implementation of the Hadoop file system ..."

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-p...

commandlinefan · on March 8, 2022

Ah - that's a good point. Usually when people say "Hadoop", I assume they're referring to HDFS, but there is (was...) the Hadoop MR system that ran on top of it that has been almost entirely replaced by Spark.

tus666 · on March 8, 2022

What about if you had 500 petabytes of data to analyze? Big data does exist in the wild.

usrbinbash · on March 8, 2022

> What about if you had 500 petabytes of data to analyze?

Then we probably need the big tools.

> Big data does exist in the wild.

So does little data.

The problem is that a "one-size-fits-all" approach has become common, not just in data analysis; think of all the low-medium traffic webpages that use giant, complex frameworks and huge distributed systems just to display essentially a small CRUD app that would have been ALOT easier to cobble together in plain JS on a simple LAMP server.

What the article shows is the importance on deciding for the right tool for the job: When I want to plant a little tree in my backyard, bringing one of these https://upload.wikimedia.org/wikipedia/commons/0/01/Bucket_w... to dig the hole is proooobably overengineering it a tiny little bit, and will likely take longer than getting a shovel.

HelloNurse · on March 8, 2022

Those 500 PB imply a competent staff, a world-class infrastructure, and worthwhile applications. These things don't materialize suddenly: they start maturing when the data set is 500 GB or even just planned and they usually mature enough to switch technology multiple times as the scale rises.

morelisp · on March 8, 2022

The article still applies to the things which have replaced Hadoop. The operational, development, and computational overhead of distributing/re-aggregating work remains huge.

athenot · on March 8, 2022

Large-scale storage clusters like Hadoop, Cassandra, ElasticSearch are generally slow, expensive, hard to set up properly and require a lot of monitoring to stay healthy. Use them only when other solutions won't do.

If your data will fit in a set of text files or a cluster of relational databases, use those. Even if you plan on storing a gazillion TB of data, it's faster to iterate app logic on nimble storage solutions first, before entertaining something bigger.

Where the large scale storage clusters shine is when the sheer scale of data won't fit in anything else, i.e. there's no other (sane) choice.

chockchocschoir · on March 8, 2022

> Where the large scale storage clusters shine is when the sheer scale of data won't fit in anything else, i.e. there's no other (sane) choice.

And when people say "won't fit in anything else", do explore your options before! RAM storage can be very, very, very big and the cost for using it is minuscule compared to what it used to be!

On that note, there is this great website for seeing if there are servers that can handle fitting all your data in RAM: https://yourdatafitsinram.net/

Systems like Power System E980 can handle up to 64TB RAM, which is a lot of data. Just like parent said here, do try to fit things on machines like this before even trying out large-scale storage clusters, because they are a hassle to deal with and generally not worth the cost.

Edit: For the Cloud-hosting crowd out there, I took a quick look at https://instances.vantage.sh/ and sorted it by memory. Largest instances you can get at AWS is "U-24TB1 Metal", which comes with 24TiB of RAM (and 448 vCPU so the computation itself gets as fast as possible too [granted you can parallelize it]), which should fit most cases of "big data" I've seen in the wild today. Unsure about running costs though, but I'm 90% sure it's cheaper to have that instance for a couple of hours than the cost for having to deal with storage clusters, which are also slower.

phkahler · on March 8, 2022

>> And when people say "won't fit in anything else", do explore your options before! RAM storage can be very, very, very big and the cost for using it is minuscule compared to what it used to be!

That's why the "big data" industry also encourages collection of absolutely every bit of data you can find. They want you to need their tools. You may not think there's a use for it, but vague promises of AI finding needles in the haystack are used to get you to keep it and bloat your system.

sitkack · on March 8, 2022

This is why the first, possibly most important phase of a data analytics job is to filter down to the subset of the data that you actually need to compute the result.

Collecting it all on the front end can be very effective, as long as you can easily filter. If you can't, you are just causing yourself more problems.

fivea · on March 9, 2022

> Systems like Power System E980 can handle up to 64TB RAM, which is a lot of data.

Doesn't that kind of system cost millions of dollars?

Hadoop and the like are popular because they run on COTS hardware that costs peanuts. It hardly makes any sense to argue that spending over a million dollars is enough to drop a solution that's primarily adopted because you do not have that kind of money to throw at a problem.

zozbot234 · on March 8, 2022

You could also use SSI (single system image) system architectures, including distributed shared memory, to seamlessly (at least wrt. software implementation) scale up from efficient single-node processing to an arbitrarily large, clustered system. AIUI this is how things are traditionally done in mainframe processing, and it is also regarded as a very meaningful possibility in HPC.

AitchEmArsey · on March 8, 2022

That concept has fallen out of favour in HPC, but it'll be interesting to see if it makes a comeback.

As recently as 2016, an SGI UV3000 rack-scale SMP machine with 16TB of RAM was the sort of thing you'd see on a trade-show floor - now you can get that much memory in a 4U chassis, and things will only improve further if/when Optane DIMMs take off.

louwrentius · on March 8, 2022

I can share with you that my solar-powered Pi3B+ hosting yourdatafitsinram is holding up quite well, being linked to on HN - a funny contrast to the kind of systems it links to in the table.

Even a 10+ year old DL380G8 - can hold 1.5TB+ RAM and 24/48 cores and that hardware is dirt cheap on the second hand market.

fivea · on March 9, 2022

> I can share with you that my solar-powered Pi3B+ hosting yourdatafitsinram is holding up quite well, being linked to on HN - a funny contrast to the kind of systems it links to in the table.

If all a server does is serve mostly static content from memory then it's not expected to handle a demanding workload beyond networking.

Therefore I fail to see the point of trying to downplay someone else's needs for TB of RAM just because you serve a site on a raspberry pi, as if that's the full extent of what anyone needs to do with a computer, particularly when the OP mentions big data.

louwrentius · on March 10, 2022

You totally misinterpreted the point of my text.

Breza · on March 18, 2022

I highly recommend the book Designing Data-Intensive Applications. It talks at length about different complex data systems and then warns readers that they should only use them if absolutely necessary.

pizza234 · on March 8, 2022

> Where the large scale storage clusters shine is when the sheer scale of data won't fit in anything else, i.e. there's no other (sane) choice.

In general, this can be an effective approach, but at least fulltext search is another story.

Storing hundreds of MBs (actually, I think even tens of MBs can be problematic) in text files or a db like MySQL (whose FT engine is terrible) will result in slow fulltext searches.

sriku · on March 8, 2022

This reminded me of Frank McSherry's article "Scalability! But at what COST?" (2015) - https://www.frankmcsherry.org/graph/scalability/cost/2015/01...

Where he shows a single laptop beating a spark cluster.

ImaCake · on March 8, 2022

Another way to speed grep up is to use something like ripgrep. Something I picked up lately, as a noob with bash scripting, is that you can run a whole bunch of things from a single bash script and they will be automatically allocated to different cores on whatever node/machine you are on. Just append `&` to each line and stick a `wait` at the end and voila, you have a very hack-y but robust way of running a bunch of processes in parallel.

TacticalCoder · on March 8, 2022

Forking in the background is one way, you can also use GNU "parallel" ("apt-get install parallel" for those on Debian and the likes).

For example I've got all my CD ripped to FLAC files, but my car only takes mp3 or wav... So I did a batch convert of FLAC to mp3, making sure to put all cores at work by piping the output of "find" into "parallel".

Some commands also allows to directly parallelize (like, say, "make -j 16 ..." to build using 16 cores/hyperthreads).

chockchocschoir · on March 8, 2022

Another tip is using wget2 instead of wget if you're mirroring a site (but this is more I/O tip than computationally heavy) https://gitlab.com/gnuwget/wget2/-/wikis/home

Sadly, wget2 doesn't support WARC last time I checked, but wget2 comes with a `--max-threads` parameter that together with `--mirror` and `--tries` makes it trivial to mirror even the slowest websites out there.

Edit: your parallel to `parallel` made me think of wget2 as I often see scripts that use `parallel` together with `wget` when `wget2` can be much better to use alone instead of pairing the two. Just wanted to add some context.

efxhoy · on March 8, 2022

If you care about errors you can use

    function wait_all {
      for job in $(jobs -p); do wait "${job}"; done
    }

A lone wait wont tell you if any of the jobs fail.

burntsushi · on March 8, 2022

Do note that unlike GNU (or BSD) grep, ripgrep will automatically use multiple threads to execute a search on a directory.

chakkepolja · on March 8, 2022

Is there a reason classic grep or find utilities do not use multithreading?

It takes 5-6 seconds to search for some files on my machine. Would'nt it be an easy win if `find` had spawned 4 threads?

burntsushi · on March 9, 2022

I don't know. I'm not a maintainer of those projects. There's no obvious reason to me other than that it's hard to adapt a legacy code base that assumes a single thread to something that uses multiple threads. IIRC, there have been multiple attempts to make GNU grep multi-threaded. I believe folks have even submitted patches to the project. I don't follow it closely enough to know why they haven't been accepted.

Sometimes the answer to "why don't they do this" is just uninteresting. :-)

JimmyRuska · on March 8, 2022

If you don't have to do any complex sorting or grouping, yes a simple script works way better. It doesn't have the overhead of using the scheduler, or distributing the data into chunks on many servers. Also consider using sqlite, postgres, or your EDW if you have one. Tools like CSVkit and XSV are useful for preprocessing, exploration.

I've seen many ETL scripts written where a simple SQL statement would have been better. SQL queries tend to work after a few queries have been verified to be correct, ETL jobs in languages like java can dump mysterious stack traces referencing many frameworks breaking due to data issues, memory issues, or unhandled cases.

sitkack · on March 8, 2022

As the predictable discussion unfolds below (above probably?), using the best subset you can means your analytics workloads can potentially scale from sqlite to bigquery, going through spark or over to clickhouse.

SQL is everywhere and it is fast and kinda portable. I have done sqlite analytics jobs over 4-6GB databases on a desktop machine 10 years ago to generate really complex reports. It worked wonderfully and was the shortest time from raw data (xml, html, csvs) to useful results.

_solr · on March 8, 2022

> I've seen many ETL scripts written where a simple SQL statement would have been better. SQL queries tend to work after a few queries have been verified to be correct, ETL jobs in languages like java can dump mysterious stack traces referencing many frameworks breaking due to data issues, memory issues, or unhandled cases.

This is so true. I write data pipelines at work. I only use SQL to move data around for this very reason. You never get any bug to fix with SQL. You sometimes need to adjust some queries to some new bad input but never ever will you see : 'index out of bound' or 'wrong argument' or whatever.

tremon · on March 8, 2022

You sometimes need to adjust some queries to some new bad input but never ever will you see : 'index out of bound' or 'wrong argument' or whatever.

Just wait until you encounter the "String or binary data would be truncated" error. I guarantee you it will make you long even for Java stack traces.

api · on March 8, 2022

The mis-use of "big data" tools like Hadoop to do with entire giant clusters of systems what can be done more quickly on a MacBook Air encapsulates quite a bit of what is wrong with all modern software development: overengineering, dick size based infrastructure decisions ("look at how big my cluster is!"), stacking up tools to pad one's resume, mindlessly copying what bigger companies do, needless layers of abstraction, etc.

Look up the largest hard drive of any type that you can find for sale. Now spec out a consumer or small business grade NAS with 2-4 of those drives. If your data will fit there, you do not have "big data." If the cost bothers you consider that the cloud footprint (or on-prem mini data center) required to use your big sexy "big data" approach will cost far more than one of those NAS systems, possibly every month.

The only real exception is if you need performance and the computations you are doing are CPU bound or highly parallelizable. If you need rapid turnaround you may want some kind of distributed replicated cluster approach that can do things in parallel. For the majority of jobs though these are periodic or internal facing analytics jobs and getting the results faster is not worth 10X-100X the hardware cost or cloud bill and 10X the developer time.

oxplot · on March 8, 2022

Hmm, no mention of GNU Parallel — that's my go-to replacement for xargs. It has numerous options to treat input as blocks of text, lines, etc and can even run in a distributed fashion over SSH.

kenoph · on March 8, 2022

You can do a lot with just bash + pipes + unix tools. It can get messy as your pipeline grows though, and there are a lot of edge cases.

Relevant: "bashML: Why Spark when you can Bash?" (https://rev.ng/blog/bashml/post.html), aka how to deduplicate git repositories using `comm` + `awk`.

yung_steezy · on March 8, 2022

Bash is a godsend for quick debugging and I can see the temptation to start writing production code using bash. It basically boils down to a few things IMO: - large bash scripts are hard to read/maintain - complex modelling chains need intermediary points in the processing On the latter point I can't count the amount of times where being able to query an athena database has saved a lot of headaches. The overhead from parquet and AWS bills pales in comparison. I'm sure almost everyone already agrees with me here but it's a classic case of the whole being more than the sum of its parts.

randomswede · on March 9, 2022

And bash is really painful once you're trying to do clever things with structured data.

Excellent glue, but there's also a skill in knowing when you should port your increasingly complicated shell script to another language.

mathogre · on March 8, 2022

At my old company, Hadoop, JSON, Parquet were the solutions looking for a problem to solve. Yes we handled lots of data, but none of it was time critical. No matter, we had BIG DATA™. Bow when I say that. We weren't Google or Twitter. We were not surgically retrieving data, and while we had lots of data, we didn't have huge data centers; Dremel wasn't relevant to us.

It's funny! It would take our Hadoop team three weeks to get data together for our use. I often didn't have three weeks. In those times when I needed to use the data from the previous day, I'd just grab the raw data, organize it, process it, and be done with it in a few hours, using Unix/Linux tools and a bit of mathemagical wizardry.

"You're supposed to use Hadoop."

"You wanted to know what happened yesterday."

"I did!"

"If you want it from Hadoop, it will be ready in three weeks. Probably. That's if they have everything done."

AtlasBarfed · on March 8, 2022

A formal enterprise data group should probably

- be a clone/repo for disparate databases so you don't need to figure out access/security/location or impact production systems

- an interface to management types that aren't technical or don't have tech people to do these things

- should provide a "librarian" knowledge of the enterprise's data and data sources

- should have knowledge on how to analyze data using different tools

- be able to schedule movements/reports and manage that

If you don't need any of that, then ... yeah, don't use it. But those sets of requirements should be useful to anything that deems itself an "enterprise".

kohlerm · on March 8, 2022

There is no point running something like Hadoop or Spark or Flink if your problem is not bottlenecked by the resources of of a single machine. E.g. how would this solution work if the data is so large or the problem requires more context (looking at more than one line at the time) and therefore more memory? Answer: it wouldn't .

Cthulhu_ · on March 8, 2022

That's it, for a long time, what a lot of companies that jumped on the big data bandwagon didn't realize is that they didn't actually have big data.

I'm sure there's some arbitrary lines that can be drawn, but anything under 1 TB is not big data anymore and can be processed on a single machine.

Other things to consider is data volume though. A popular use case of hadoop was to take e.g. server access logs - high volume data that traditionally you wouldn't hold onto for long - and get something meaningful out of that.

rizkeyz · on March 8, 2022

At this point in time (2022) I consider everything below say 40TB not big (textual) data at all. It can be compressed 40TB -> 10TB (or less) and that fits fine on a single 16T drive.

For many questions, you won't need all the raw data, so you end up with some form of projection of the data that is maybe 1/10 in size, so 10TB -> 1TB. Heck, if you tune GNU sort a bit, it will blast through that TB quite quickly.

anonymousDan · on March 8, 2022

The problem isn't necessarily (just) capacity. It's also the he I/O bandwidth needed to read the data.

trashtester · on March 8, 2022

If you just want cold storage, you can put 10TB of compressed textual data on a spinning hard drive. If want to run some processing of that data within a workday, you need multiple drives in parallel (still possible on a single server). However, if you want to process it in less than 30 mins you need a cluster.

NateEag · on March 8, 2022

My favorite litmus test for "Big Data" is https://yourdatafitsinram.net.

If you can fit it into RAM on a single machine, you probably shouldn't be using complex distributed systems for working with it. The developer time spent setting it up and fixing arcane bugs due to the distributed system is likely to cost you more than a single monster server will.

If your processes generate large amounts of data rapidly, you can get a decent idea when you should start working on a move to a "big data" solution by looking at your rate of growth.

...though in my experience, stuff like logs can be tarballed and tossed on S3 as insurance just in case you really do need to get an overview of some pattern over the past three years. Mostly only about the past six months of logs are really worth keeping on hand actively, IMO.

renonce · on March 8, 2022

I wonder what is the best way to process huge amounts of time-series data (several thousand records per second). When I search for "time series database" I get something like MySQL, InfluxDB, TimescaleDB etc, but all of them are way too powerful and have their own query language and storage engine etc which are hard to learn and manipulate. In case you run out of storage/memory/CPU etc or when you want to query something that the database doesn't support, there is no way out of it without waiting for the database vendor to provide some new feature.

Now I'm just writing all data to plain text files, one JSON object per line, and query and process them with cli tools like jq. Regular compression tools like zstd, pattern matchers like grep works. All the Unix philosophy applies and it's easy to do anything I want without being restricted to the features of a certain database.

Beltalowda · on March 8, 2022

TimescaleDB "upgrades" fairly gracefully from regular PostgreSQL; there's a bunch of Timescale-specific functions, but it's not that much: mostly it's just regular SQL. I don't know how well it performs with thousand records per second though.

ajoseps · on March 8, 2022

I have a similar issue. I've been using feather files/parquet files for storage, and just using pandas to do analysis. There is an issue where the initial load/convert essentially doubles the memory usage of the file itself as it converts to a pandas dataframe. This can be avoided if you use a feather file and follow its recommendations for a zero-copy conversion (no NaNs/nulls).

I think it's a bit more flexible than using cli tools since you can set some sort of time index and query specific timeslices fairly easily

LoriP · on March 14, 2022

Just wanted to say that TimescaleDB is an extension of PostgreSQL and so it uses SQL. MySQL would use SQL too.

InfluxDB does, indeed, have its own query language though.

spaniard89277 · on March 8, 2022

Mmm maybe DuckDB?

treeman79 · on March 8, 2022

25 years ago my company Needed to match lists against a collection numbers not to call. Probably ran lists of 20k daily or more. So I wrote up a simple python program to match against a special format txt file. A few seconds for big lists at most.

Higher up IT people were horrified, and commissioned a proper oracle soliton. It ran around a 1-2 records per second. So all day for a typical list. They were quite proud of this. I think they spent 25-50 grand on IBM support to get it setup.

They were not happy when we said we would only consider using it if they could speed it up 5000 times faster.

We never heard back.

AitchEmArsey · on March 8, 2022

> proper oracle soliton

I know this was a typo, but the idea of Oracle upselling someone on a quantum field theory based solution to a string matching problem amuses me greatly.

donatj · on March 8, 2022

I went to a Hadoop workshop in 2016 where the speaker was insistent Hadoop would replace traditional relational databases in the next 5 years. It’s been six and I think the death of relational databases has still been greatly exaggerated.

https://twitter.com/donatj/status/740210538320273408

pjmlp · on March 8, 2022

Having done OLAP and used other DB models like ISAM and xBase, I am a firm beliver that relational databases will keep being the horse worth betting on.

Whatever else keeps poping up against them are only usefull in special use cases, not needed for 90% of the common use cases, and even then, it isn't like relational database vendors are frozen in time without improving them.

Cthulhu_ · on March 8, 2022

Things look to be quite cyclic; ten years ago, NoSQL was pushed as The Future, but these days companies still build their core data on top of a traditional relational database. I don't even know which NoSQL is even popular. It feels like it's been pushed to the fringes, e.g. key/value stores, or the Q in a CQRS setup where it acts more as a readonly cache.

chakkepolja · on March 8, 2022

MongoDB is pretty popular among newbie nodejs developers. (That's not a complement, btw).

Thaxll · on March 8, 2022

"newbie", maybe get a look outside, MongoDB is used by very large and successful companies. Fortnite which makes multiple $B a year is heavily using MongoDB.

lildata · on March 8, 2022

I'm surprised nobody mentioned nushell already https://www.nushell.sh/book/dataframes.html

I guess supporting high perf dataframes in a shell will make those big data platforms irrelevant for most use cases.

grenran · on March 8, 2022

Walking to the store is faster than catching a plane

yakshaving_jgt · on March 8, 2022

I believe the point of the article is that people seem to have a poor intuition for when it’s appropriate to choose the plane over walking.

ricardobeat · on March 8, 2022

I assume this is mocking the article as obvious, but it starts from someone using Hadoop to process 1.75GB of data.

The default advice is that you should catch a plane to go anywhere, and the author is showing why you should walk.

kaba0 · on March 8, 2022

1.75 GB of data is going to the supermarket down a street with a plane still. Hell, perhaps just to the fridge. It is nowhere near big data when a 10 years old low-end phone can hold it in memory.

ricardobeat · on March 8, 2022

Yes, that's the point I was trying to make, could have phrased it better.

throwaway984393 · on March 8, 2022

But it's hard to get investors to give you hundreds of millions of dollars to buy tennis shoes, easier to get it for a Gulfstream

rovr138 · on March 8, 2022

But if you're going to keep adding destinations and going to them, an airport is a better idea.

Cthulhu_ · on March 8, 2022

Depends if it's just another store within walking distance. I think the gist of the GP comment was that they use a big tool for a small job. Many small jobs don't make the big tool any more useful.

rovr138 · on March 8, 2022

Of course, there are uses.

But if you have many jobs that you have to maintain and run, standardizing the tool and workflow is really helpful.

mellavora · on March 8, 2022

might depend on the distance between destinations

rovr138 · on March 8, 2022

Fine. Maybe a bus or taxi service.

But standardizing helps and hadoop and other tools provides that.

ufmace · on March 9, 2022

I did a vaguely similar thing once. We once upon a time had a EMR / Hadoop based process that simply joined a some thousands of files of a few kb-ish from S3 each into one big file totaling 20GB or so. It worked, but took a while. After a bit of mucking with it, I noticed that the built-in tools for it used the wrong S3 API for the purpose. I wrote a simple dumb single-threaded Ruby script using the right API, and it ran like 4x faster IIRC.

Same lesson - before you reach for Big Data tools, make sure you've fully explored simple conventional solutions.

elmolino89 · on March 8, 2022

I had to ingest and explore enough dirty CSV/TSV data to use CLI tools only to get a first glimpse what's there. Whenever facing anything non trivial I go for CSV to parquet, and preferably write intermediate parquet data sets. Then DuckDB / SQL queries for slicing and dicing. I am yet to encounter some readable to a newcomers awk/uniq/sed combos for intersecting bunch of CSVs with 30+ columns.

On the other hand parquet becomes a turtle if one tries to squeeze i.e. 12k numerical columns into it.

p33p · on March 8, 2022

> On the other hand parquet becomes a turtle if one tries to squeeze i.e. 12k numerical columns into it.

I thought parquet was columnar stored? Is this a fault of parquet or just the shear number of columns trying to get accessed?

I agree with your general premise though. I'd rather take a dirty dataset, throw it into S3, spin up a Redshift cluster, do what I need, spin down the cluster. You can work with billions of records fairly easily with plain old SQL and c-store databases.

gnuvince · on March 8, 2022

This article is such a good demonstration (IMO) that computers are really fast, but that we have gotten really bad at actually making reasonable use of them.

slowhand09 · on March 8, 2022

I've seen this article before and still consider it valid, especially for one-offs. Another similar and valuable resource is "https://datascienceatthecommandline.com/" Data Science at the Command Line, by Jeronen Janssens. Free to read at the link.

theden · on March 8, 2022

Would be interested if the final one-liner would be faster if updated to use the rust equivalent utils like https://github.com/ezrosent/frawk and https://github.com/sharkdp/fd

vs4vijay · on March 9, 2022

I read similar on HN sometime back, this is call Taco Bell Programming - http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

truth_seeker · on March 8, 2022

Crontab for scheduling

+ BASH (or NodeJS/Python/Lua) scripts for map-reduce or whatever kind of computation you want to do

+ NFS v4.1 (something like AWS EFS) will do the job.

You get immense flexibility, however the team of developer who is maintaining this must be more skilful.

klysm · on March 8, 2022

I hate the failure semantics of cron. Unreasonably hard to get right or to not run duplicate jobs. I’ve replaced it with systemd timers and never looked back.

NavinF · on March 8, 2022

NFS is such a pain in the ass for this use case. If the client crashes, everything’s corrupted. If the server crashes, everything’s corrupted. You can’t even atomically write 4k with O_APPEND. Why anyone would choose to build distributed systems your way is beyond me

gigatexal · on March 8, 2022

I remember when this first was posted and laughed. My investment in learning unix tools has helped me a ton just as a single dev working on small datasets but I can totally see how it could work here.

charcircuit · on March 8, 2022

Can someone explain to me why it took 26 minutes for hadoop? How can 7 machines be that many times slower than a single machine?

pjmlp · on March 8, 2022

Easy lesson from distributed systems, most of the time is spent in communication instead of actual work.

anonymoushn · on March 8, 2022

This is what we've found. For a certain kind of simulation, we can process data on a single core at ~.5x line rate (I think this could be a lot faster if we took this problem seriously) so if we move to some arrangement where hosts have to download the data in order to process it, they can only use 2 cores each, and every 2 cores we want use requires another storage box to download the data from! Paying for hardware to increase "line rate" and sending zstd-compressed files over the wire might get us a factor of ~60 here, but even then we pretty rapidly run into issues where the data has to be placed on a bunch of distinct hosts in order to be useful, since 120 cores is not a lot of machines.

In the future we expect to have workloads that do less computation per piece of data, which makes the need to move our compute to our data much more acute.

charcircuit · on March 8, 2022

26 minutes worth? That can't be right. Sending 250 MB (uncompressed) to 7 machines shouldn't take that long.

usrbinbash · on March 8, 2022

How many sends are we talking about? Into how many messages is the data turned, how often does it get sent around?

If I send 1MiB of data by packing it up into messages of 10 byte each, it will ikely be slower than sending 10MiB in a single message. Messages == Overhead. Envelopes, packing, unpacking, parsing, assembling, etc. all eat up cycles.

charcircuit · on March 8, 2022

This is a distributed sum. You don't need extra messages being sent around. You send 7 messages giving each node 1/7th of all games. Then 1 message per node is sent with all the tallies to be merged together. And then maybe an extra message to give you the results if you weren't the one that merged it. 15 messages shouldn't take 26 minutes to send.

rstuart4133 · on March 8, 2022

> How can 7 machines be that many times slower than a single machine?

A mere 7 machines is almost certainly going to be slower than 1.

There was a recent post here recently titled "Latency Numbers Every Programmer Should Know": https://news.ycombinator.com/item?id=30546995

In this case the dataset was small enough to fix in my laptop's DRAM without straining it. If we assume the 7 machines are in the data centre, that means the two numbers to compare are the main memory reference (100ns) versus the Data Centre round trip time (500us). That's a factor of 5,000.

If your intuition told you those 7 machines are going to be faster, then you really should invest the time to internalise those numbers. The article is 100% correct - every programmer should know them.

charcircuit · on March 8, 2022

I think you are missing my point. It's a difference between 12 seconds on a laptop and 26 minutes on 7 c1.medium instances. Yes there is overhead in shipping the data around, but there isn't 26 minutes worth of overhead.

If a c1.medium can process at the same speed as his laptop it should take less than 3 seconds worst case. And yes 7 machines should be faster at this scale.

rstuart4133 · on March 8, 2022

> If a c1.medium can process at the same speed as his laptop

As it happens, I got to perform that experiment. Sort of. I was moving stuff on physical Dell hardware to a virtualised environment, at the behest of MBA's. I was a bit concerned about it as we pushed the existing hardware hard - it had overnight stuff it had to get finished by morning. It had a 30% buffer.

It wasn't even close. The "virtual environment" was 15 times slower. It wasn't the CPU. It was slower but not by much and that could have been remedied by firing up more VM's on different physical machines. It was disc storage. They insisted their big beefy SAN they spent 100's of K on would be faster than a locally connected SSD. But a SAN operates over a network, and SSD's over 150mm gigabit links. The I/O was mostly random.

If they had of consulted "Latency Numbers Every Programmer Should Know" the could have predicted the outcome - just as I had.

BTW, I also benchmarked my the laptop. Despite what you apparently think it isn't much slower at doing a single task than the big Dell server, and I wouldn't expect it to be. The one difference is the Dell server comes with a factor of 10 more RAM, SSD, HDD and networking, and it seems to be able to drive all those devices at full speed in parallel. But that would not of helped in this instance as the task is too small. I know from experience for this sort of task my laptop will easily outperform a c1.medium.

randomswede · on March 9, 2022

This is one of those "it depends" things. As usual, it's good to build an intuition for what you expect. But it is also worthwhile benchmarking it now and again.

At a previous job, we were pushing against going to "networked temp disk" instead of "local temp disk", on the assumption that local storage would be faster than remote storage. But, actual benchmarking showed that the "networked temp disk" was about two to three times faster. Mostly because almost all IOPS on each machine went to servicing network disk requests, so trying to squeeze in on one machine's IOPS caused IO stall times that trying to squeeze into N machines' IO queue didn't see.

It's also a "are you mostly doing read consecutive blocks" or "are you doing essentially random, scattered reads" (for read workloads, write workloads are a bit different, as it is approaching hard top speculatively write data that has not yet passed through a write(2) call).

ed_elliott_asc · on March 8, 2022

It often isn’t about the execution time, if it’s a daily job and takes 26 minutes who cares?

Using something like databricks means it is easy to schedule and manage jobs, easy to write jobs that work in good enough time, easy to troubleshoot when things go wrong.

It comes with a well documented security mode and a support contract when needed.

Developers can be onboarded quickly and work code reviewed and managed.

Logging and diagnostics are available and you can report on metrics easily.

That isn’t true with custom data pipelines written in shell scripts.

The value isn’t in the pure execution time, it is in everything around it.

jbverschoor · on March 8, 2022

Processes that take ages are never "easy to write" nor "easy to troubleshoot". This turns into shitty code, because you won't get any sane person to spend weeks instead of days on that. It's bad for morale.

It's the same narrative people use for many complexity or slow things

_pvxk · on March 8, 2022

> The value isn’t in the pure execution time, it is in everything around it.

Especially for those of us who get paid for developing, designing, supporting, architecturing, meetinging, catering, conferencing and maintaining everything around it :-)

usrbinbash · on March 8, 2022

> That isn’t true with custom data pipelines written in shell scripts.

Why not? Cron can schedule jobs, the documentation for standard shell tools is among the best, almost every developer can handle bash scripts, Logging can be done via syslog.

skocznymroczny · on March 8, 2022

This is like the classic Dropbox hackernews comment:

> For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

Sure, you can stitch several services together and it will work for your needs, but for most users there is a benefit to a centrally managed and complete solution.

usrbinbash · on March 8, 2022

> but for most users

Define "most users"

Most users who have to tackle actual big-data problems, meaning analysing things on the order of several TB or more?

Sure, they will absolutely benefit.

But there isn't just big data. There is also little data, where what is analysed is on the order of a few GiB or less, and everything in between.

I am not saying "use shell for everything!" I am saying "the right tool for the right job". A 15t excavator is probably not a good choice if I want to plant a small tree in my backyard, and a gardening shovel will probably not serve me well when I wanna start building a scyscraper.

wodenokoto · on March 8, 2022

Can't you just ask one of the machines in the databricks/spark cluster to run those shell commands?

Or is that more of a kubernetes thing?

westurner · on March 8, 2022

That's pretty much how Dask works; though Dask doesn't throw away data types for newline-delimited strings between each failed partitioned IPC pipeline process, and then the new hire didn't appropriately trap errors in their shell script, so the log messages are chronologically non-sequential and text-only.

https://github.com/dask/dask-labextension :

> This package provides a JupyterLab extension to manage Dask clusters, as well as embed Dask's dashboard plots directly into JupyterLab panes.

Something like ml-hub allows MLops teams to create resource-quota'd containers with k8s and IAM, though even signed code can DoS an unauditable system with no logs of which processes ran which signed archive of which code at what time, with bash and ssh. https://github.com/ml-tooling/ml-hub

mr_toad · on March 8, 2022

You can use Linux shell commands in Map Reduce jobs using Hadoop streaming.

BasDirks · on March 9, 2022

I don't see why this is downvoted. Seems like a solid argument, even though I don't agree.