Maybe we need a reminder every now and then that text-files are quite capable and except from being faster to parse in a lot of situation has the benefits of being "future safe" and easy to backup and compress as well.
The common text file for lots of this, CSVs, are absolutely awful. They're fine until they totally aren't, they were just the best option for a lot of use cases. I'd argue that's now been entirely replaced with parquet, significantly faster and broad support, proper types and more.
If you don't already have parquet deployed, there's a wide gulf (in skill set, overhead, etc.) between CSV and parquet.
If you don't mind being old-school, the data is ASCII text, and you're tired of some of CSV's little issue, then ASCII has the FS, GS, RS, and US control characters - specifically intended for such uses. Micro conceptual overhead, none of the CSV issues which screw up many *nix text-handling programs and little script files, and a decent modern filesystem can handle the compression separately.
For many cases of passing data between systems, I'd say the gulf is dramatically lower than it has been until even fairly recently. Support for many standard packages is just there, swapping out "read_csv" for "read_parquet" if you're a pandas shop may even be enough. More and more tools read these directly, and it opens up a whole load of better options for processing data.
> If you don't mind being old-school, the data is ASCII text, and you're tired of some of CSV's little issue, then ASCII has the FS, GS, RS, and US control characters
I have used those before, and yet I still had those characters appear in data. The only places I'd ever seen them were in the wiki page and in customer delivered data. Absolute pain to dig through and remove.
On top of that "if your data is ASCII" is something I'd be nervous about for many use cases even if it is right now.
Beyond that, then you need everyone to swap out their parsing to use those characters.
CSV is fine until it totally blows up in your face. All it takes is one "oh it's fine we'll use awk" stage somewhere or a CSV parser that isn't good enough and one person to put a newline where nobody had expected it before.
> I have used those before, and yet I still had those characters appear in data...
Oh, yes - which is why I emphasized "is". But ASCII text is easy to test for, which lets you fast-track into exception handling - "Tell Sales that Customer data is not as represented", "Trouble-shoot internal data source", etc.
(My experience is that substantial Customer data is never, ever as initially represented. Nor as represented after you point out the first set of issues with it. Nor as represented after you point out the second set of issues. Nor as...)
> Oh, yes - which is why I emphasized "is". But ASCII text is easy to test for, which lets you fast-track into exception handling - "Tell Sales that Customer data is not as represented", "Trouble-shoot internal data source", etc.
Oh with that I mean the ASCII control characters appearing in inputs. So some columns would have record end markers in for example.
If I'm able to make everyone dealing with the reading and writing add specific characters to be used for start/end/etc I'd rather just tell them to swap to a parquet reader unless they've got a really good reason.
As someone in the process of setting up a pipeline and currently using pandas.to_csv as my output, I'm curious what makes you recommend parquet in particular? How does it compare to HDF or Feather?
It depends on what you're trying to optimise, parquet is a very good all round option. HDF I've never really gotten into as it always felt like a good solution only if I move everything over. It's great if your use case fits.
Feather is a layer on top of arrow and was a proof of concept (so I'm not sure how heavily it's used now), and arrow is fast becoming the interchange format. It's exactly laid out as things will be in memory - which means zero copy for shuttling it around from one place to another. I _think_ there is less support for feather but that is likely changing as everything converges.
Parquet should be
* Faster to write
* Faster to read (even if you're reading the whole file, which actually isn't required, the format helps you read just sections of the columns you need)
* Smaller
* Better at handling actual floating points
than CSV, while having actual standards alongside it. Be a little wary of pandas guessing the right column types for you if you're creating partitioned files btw.
When you're working with pandas, etc (check out Dask) you can pretty much just swap out some reading and writing functions. You can also use pyarrow directly if you need to be very careful about column types.
For your use case you may want to explicitly use a single column for the features that is a list, I'm not sure if that's better/worse than having so many columns. If a reader may want to find just some images where a small subset of features are > X, you might benefit from multiple columns so that the reader only processes the data it needs.
Worth testing out, but I expect you should be able to try it out in an afternoon if you're already working with pandas/similar. Just install pyarrow and use a to_parquet. Things like dask (or straight pyarrow) give you partitioned files as output if you want too, if there's a useful column or columns to split on https://arrow.apache.org/docs/python/parquet.html#partitione...
Just a heads up, pandas to Csv is laughably slow. You're literally better off converting the pandas df to arrow and then serialise to Csv with arrow, though the format is quite limited with arrow.
Well at least it’s better than numpy’s csv functions. For my use case (~4 million images, one row per image, 2048 fp32 columns for features extracted with resnet), parsing/encoding csv with pandas was 10x faster than with numpy. Of course serializing the arrays to/from npy was 10x faster than any csv library.
Yes - parquet tooling is non-existent compared to CSV particularly on the command line. And the cross-language/platform support is a mess - good luck reading Pandas generated parquet on .NET or in a (non-spark) JVM environment.
There are many reasons why CSV is flawed for the purposes of storing tabular data (e.g. loss of column type information) but the alternatives are just so unergonomic that CSV remains a viable choice in many situations.
Column-oriented formats such as Parquet can be awful too. For example, if you have five columns of numbers you want to multiply together, Parquet is going to be the worst option available because of the extreme cache thrashing that the CPU will encounter.
Structure packing[1] and consideration of locality of reference[2] would need to be applied for high performance applications where a computer scientist has considered the algorithm needing to be implemented and the most efficient data format that the source data would need to be provided in.
Exactly. Basically most big data lives in hdfs and hdfs is part of hadoop. Even if you use Spark and Flink I would classify that as using Hadoop (under the hood).
Aside from Apache Spark, what's replaced it and does it still face the same speed of access limitations compared to just zipping through giant CSVs with awk or whatever streaming APIs you write with your own preferred language?
I use Spark for a number of jobs for language-specific features still but I think within 2 years all custom code will be trivially invoked as native UDFs in SQL data warehouses (ie Snowflake, which has essentially solved big-data performance as a going concern).
I just write SQL in Snowflake and it replaces 95% of what I would otherwise have done in custom MapReduce or Spark code.
What I really dislike modern cloud DWH such as Snowflake is that it hides a lot of things from me. Since I'm not a CTO who worries about not delivering, but a junior DE who actually wants to learn things, I really prefer that things were done in the old ways where we had to manage our own infrastructure and our own code for ETL. These kinds of things can not be learned "just for fun" because one has to work in a real environment.
What do you mean? Of course you can still learn them "just for fun" if you want. There are plenty of columnar data warehouses (memsql, greenplum, vertica, clickhouse, etc) and data processing frameworks (spark, flink, etc) that you can look at, implement and run yourself.
What I'm saying is that you can surely scratch the basics from personal use, but it's completely different from real usage and such can only be trained on job. Now that those jobs are fewer as everyone goes on cloud.
SQL will always be faster than Hadoop and MapReduce. The main reason to use those other slower services is developer are not use to SQL or declarative programming, and insist on having the code in Procedural way.
That's completely backwards. Mapreduce-like approaches are how SQL datastores are implemented underneath; the absolute best case for SQL is to equal hand-tuned mapreduce-like performance, and often it will be slower (you're at the mercy of your query planner to pick the right indices, do joins in the right order, etc.). The main reason people use SQL is because they find it easier to express a query that way (which is completely legitimate - if your query planner is good enough most of the time, you've got better things to be doing than hand-tuning your query execution).
No, that does not seem correct. SQL Datastores are not "map-reduce underneath", they have optimized datastructures for efficient querying (i.e. indices). Map-reduce is equivalent to those cases in SQL database where you have full table scan in your query plan - basically brute-forcing your way through the dataset.
You can (and often should) have indices in a map-reduce situation as well - you just build them in an explicit, visible way. But in most of the relevant use cases you're doing some kind of aggregation over the whole table, so indices don't help any.
And if your primary use-case is column-wise aggregation over the whole table, in SQL you'd use a (compressed) column store rather than a row store as your table storage method.
To be fair, Parquet, which is commonly used in Big Data solutions is a column store format. So, once you normalize your data and save it as Parquet you can have efficient column-wise aggregation - but that assumes some preprocessing step.
That makes no sense. SQL is a query language, commonly implemented by relational databases.
In the early 2000s, columnar relational data warehouses were not sophisticated and scalable enough to handle the scale of data encountered at Yahoo, Google and other internet companies. MapReduce (and the many evolutions of Hadoop ecosystem) was created to scale processing through low-level instructions and algorithms.
Eventually columnar data warehouses caught up and are now capable of handling petabyte scale, regardless of whatever language you use to query them. The fundamental storage and compute primitives haven't really changed that much, just offered in a much more user-friendly way now.
SQL itself is just a query language, it's the underlying cloud based data warehouse that fulfills the role of what map/reduce used to do in terms of parallelization transparently.
1) Why would you want to maintain your own Spark infrastructure? Spark on Kube is a huge improvement over YARN but you still have to deal with OOMEs, filled disks, Kube upgrades, pushing custom images to container registries, etc etc etc.
2) Snowflake is probably 10-50x as performant as Spark for data manipulation. I don't know what kind of unholy demonic incantations Snowflake is doing on the backend to support their SQL performance, but it's really freaking fast. There's just no other way to cut it.
I've spent 5-10 years eking every ounce of performance I can get out of a Hadoop/Spark cluster. I'm not trying to be unreasonable about this. I would love for OSS to be competitive; it's great for the world, and it would be great for my skill set and earning potential.
But it's not a contest, and if you think standalone Spark is going to be a viable competitor in a couple years, you are deluding yourself. Make informed choices about your career and investment.
A lot of articles I read about snowflake involves data vault which is a massive turn off. And when their tech lead (Kent Graziano) is a prominent figure in the DV bullshit...
Snowflake and DV have no interdependency whatsoever. Snowflake is just a database. Whether you use DV to model the data inside of it or dimensional modelling or "big wide tables" is completely up to you, there's nothing about it that requires or benefits DV in particular.
You should try Databricks, especially the new Photon engine powering Spark. In general more performant than Snowflake in SQL and a lot more flexible. (There are some cases in which Databricks would be slower but the perf is improving rapidly.)
Databricks has an extremely bad API. So, sure, your Spark jobs might be a little bit faster some times, but why would you use it if you can't even read logs of running jobs?
Databricks is amazing, the Delta Live Table technology is incredible. It's very hard to approach problems like Data Lineage and Data Quality, but that platform does it in the right way.
My only concern is that they offer just a managed cloud product. That's cool for startups, but large enterprises sometimes need more governance and ownership than that.
I spent a while putting a POC of Flink for a use case together at client behest a couple of years ago. Really struggled with it. Seemed slow as heck too. I'd much rather write a few hundred extra lines of my own code to do stuff like that (ingest -> process -> output. Modern software fashion trends seems to create as much work as it tries to save; as extensive and magical as a lot of the features are.
Whether a technology can replace Hadoop in an organization depends on many factors, but some technologies that solve at least in part similar problem are Apache Storm, Spark, Flink, Kafka Streams,
and maybe BigQuery?
Or, as the original article says, some companies just use some command line tools, shell scripts.
It's been a couple of years since I was interested in Data Engineering, so my knowledge on this topic is some years behind.
I've not seen Storm being used anywhere sane for a few years at least now, and from a glance at job postings it looks unlikely. Spark, Kafka Streams etc. are definitely used in a modern data platform from my experience.
I think we're seeing a big shift with Hadoop-like workloads being moved onto cloud providers, so BigQuery, Amazon EMR etc.
I'm curious what constitutes "big data" anymore. In an intermediate machine learning course, we train on nearly a petabyte of data using Google Colab and Jupyter Notebooks. Nobody discusses the size of the data requiring any special treatment due to its size... would not 95% of a petabyte be "big data"?
What course are you taking? Imagenet is only 150 GB, and Common Crawl is only 320 TB.
Big data is a moving target, but I’m comfortable defining it as data too large to fit in memory. Obviously, you can always get a bigger node, my rule is thumb is that if you need generators, you are working with big data.
Glue often uses EMR under the hood, which is often Spark. And Athena is PrestoDB, as far as I know it has nothing to do with Hadoop other than you can use it to query Hadoop data stores.
The way I see it, Hadoop is still in common use as the storage layer for Spark and related implementations, whether that is in the form of HDFS or something like EMRFS:
Quote from AWS: "EMRFS is an implementation of the Hadoop file system ..."
Ah - that's a good point. Usually when people say "Hadoop", I assume they're referring to HDFS, but there is (was...) the Hadoop MR system that ran on top of it that has been almost entirely replaced by Spark.
> What about if you had 500 petabytes of data to analyze?
Then we probably need the big tools.
> Big data does exist in the wild.
So does little data.
The problem is that a "one-size-fits-all" approach has become common, not just in data analysis; think of all the low-medium traffic webpages that use giant, complex frameworks and huge distributed systems just to display essentially a small CRUD app that would have been ALOT easier to cobble together in plain JS on a simple LAMP server.
What the article shows is the importance on deciding for the right tool for the job: When I want to plant a little tree in my backyard, bringing one of these
https://upload.wikimedia.org/wikipedia/commons/0/01/Bucket_w...
to dig the hole is proooobably overengineering it a tiny little bit, and will likely take longer than getting a shovel.
Those 500 PB imply a competent staff, a world-class infrastructure, and worthwhile applications. These things don't materialize suddenly: they start maturing when the data set is 500 GB or even just planned and they usually mature enough to switch technology multiple times as the scale rises.
The article still applies to the things which have replaced Hadoop. The operational, development, and computational overhead of distributing/re-aggregating work remains huge.
Large-scale storage clusters like Hadoop, Cassandra, ElasticSearch are generally slow, expensive, hard to set up properly and require a lot of monitoring to stay healthy. Use them only when other solutions won't do.
If your data will fit in a set of text files or a cluster of relational databases, use those. Even if you plan on storing a gazillion TB of data, it's faster to iterate app logic on nimble storage solutions first, before entertaining something bigger.
Where the large scale storage clusters shine is when the sheer scale of data won't fit in anything else, i.e. there's no other (sane) choice.
> Where the large scale storage clusters shine is when the sheer scale of data won't fit in anything else, i.e. there's no other (sane) choice.
And when people say "won't fit in anything else", do explore your options before! RAM storage can be very, very, very big and the cost for using it is minuscule compared to what it used to be!
On that note, there is this great website for seeing if there are servers that can handle fitting all your data in RAM: https://yourdatafitsinram.net/
Systems like Power System E980 can handle up to 64TB RAM, which is a lot of data. Just like parent said here, do try to fit things on machines like this before even trying out large-scale storage clusters, because they are a hassle to deal with and generally not worth the cost.
Edit: For the Cloud-hosting crowd out there, I took a quick look at https://instances.vantage.sh/ and sorted it by memory. Largest instances you can get at AWS is "U-24TB1 Metal", which comes with 24TiB of RAM (and 448 vCPU so the computation itself gets as fast as possible too [granted you can parallelize it]), which should fit most cases of "big data" I've seen in the wild today. Unsure about running costs though, but I'm 90% sure it's cheaper to have that instance for a couple of hours than the cost for having to deal with storage clusters, which are also slower.
>> And when people say "won't fit in anything else", do explore your options before! RAM storage can be very, very, very big and the cost for using it is minuscule compared to what it used to be!
That's why the "big data" industry also encourages collection of absolutely every bit of data you can find. They want you to need their tools. You may not think there's a use for it, but vague promises of AI finding needles in the haystack are used to get you to keep it and bloat your system.
This is why the first, possibly most important phase of a data analytics job is to filter down to the subset of the data that you actually need to compute the result.
Collecting it all on the front end can be very effective, as long as you can easily filter. If you can't, you are just causing yourself more problems.
> Systems like Power System E980 can handle up to 64TB RAM, which is a lot of data.
Doesn't that kind of system cost millions of dollars?
Hadoop and the like are popular because they run on COTS hardware that costs peanuts. It hardly makes any sense to argue that spending over a million dollars is enough to drop a solution that's primarily adopted because you do not have that kind of money to throw at a problem.
You could also use SSI (single system image) system architectures, including distributed shared memory, to seamlessly (at least wrt. software implementation) scale up from efficient single-node processing to an arbitrarily large, clustered system. AIUI this is how things are traditionally done in mainframe processing, and it is also regarded as a very meaningful possibility in HPC.
That concept has fallen out of favour in HPC, but it'll be interesting to see if it makes a comeback.
As recently as 2016, an SGI UV3000 rack-scale SMP machine with 16TB of RAM was the sort of thing you'd see on a trade-show floor - now you can get that much memory in a 4U chassis, and things will only improve further if/when Optane DIMMs take off.
I can share with you that my solar-powered Pi3B+ hosting yourdatafitsinram is holding up quite well, being linked to on HN - a funny contrast to the kind of systems it links to in the table.
Even a 10+ year old DL380G8 - can hold 1.5TB+ RAM and 24/48 cores and that hardware is dirt cheap on the second hand market.
> I can share with you that my solar-powered Pi3B+ hosting yourdatafitsinram is holding up quite well, being linked to on HN - a funny contrast to the kind of systems it links to in the table.
If all a server does is serve mostly static content from memory then it's not expected to handle a demanding workload beyond networking.
Therefore I fail to see the point of trying to downplay someone else's needs for TB of RAM just because you serve a site on a raspberry pi, as if that's the full extent of what anyone needs to do with a computer, particularly when the OP mentions big data.
I highly recommend the book Designing Data-Intensive Applications. It talks at length about different complex data systems and then warns readers that they should only use them if absolutely necessary.
> Where the large scale storage clusters shine is when the sheer scale of data won't fit in anything else, i.e. there's no other (sane) choice.
In general, this can be an effective approach, but at least fulltext search is another story.
Storing hundreds of MBs (actually, I think even tens of MBs can be problematic) in text files or a db like MySQL (whose FT engine is terrible) will result in slow fulltext searches.
Another way to speed grep up is to use something like ripgrep. Something I picked up lately, as a noob with bash scripting, is that you can run a whole bunch of things from a single bash script and they will be automatically allocated to different cores on whatever node/machine you are on. Just append `&` to each line and stick a `wait` at the end and voila, you have a very hack-y but robust way of running a bunch of processes in parallel.
Forking in the background is one way, you can also use GNU "parallel" ("apt-get install parallel" for those on Debian and the likes).
For example I've got all my CD ripped to FLAC files, but my car only takes mp3 or wav... So I did a batch convert of FLAC to mp3, making sure to put all cores at work by piping the output of "find" into "parallel".
Some commands also allows to directly parallelize (like, say, "make -j 16 ..." to build using 16 cores/hyperthreads).
Sadly, wget2 doesn't support WARC last time I checked, but wget2 comes with a `--max-threads` parameter that together with `--mirror` and `--tries` makes it trivial to mirror even the slowest websites out there.
Edit: your parallel to `parallel` made me think of wget2 as I often see scripts that use `parallel` together with `wget` when `wget2` can be much better to use alone instead of pairing the two. Just wanted to add some context.
I don't know. I'm not a maintainer of those projects. There's no obvious reason to me other than that it's hard to adapt a legacy code base that assumes a single thread to something that uses multiple threads. IIRC, there have been multiple attempts to make GNU grep multi-threaded. I believe folks have even submitted patches to the project. I don't follow it closely enough to know why they haven't been accepted.
Sometimes the answer to "why don't they do this" is just uninteresting. :-)
If you don't have to do any complex sorting or grouping, yes a simple script works way better. It doesn't have the overhead of using the scheduler, or distributing the data into chunks on many servers. Also consider using sqlite, postgres, or your EDW if you have one. Tools like CSVkit and XSV are useful for preprocessing, exploration.
I've seen many ETL scripts written where a simple SQL statement would have been better. SQL queries tend to work after a few queries have been verified to be correct, ETL jobs in languages like java can dump mysterious stack traces referencing many frameworks breaking due to data issues, memory issues, or unhandled cases.
As the predictable discussion unfolds below (above probably?), using the best subset you can means your analytics workloads can potentially scale from sqlite to bigquery, going through spark or over to clickhouse.
SQL is everywhere and it is fast and kinda portable. I have done sqlite analytics jobs over 4-6GB databases on a desktop machine 10 years ago to generate really complex reports. It worked wonderfully and was the shortest time from raw data (xml, html, csvs) to useful results.
> I've seen many ETL scripts written where a simple SQL statement would have been better. SQL queries tend to work after a few queries have been verified to be correct, ETL jobs in languages like java can dump mysterious stack traces referencing many frameworks breaking due to data issues, memory issues, or unhandled cases.
This is so true. I write data pipelines at work. I only use SQL to move data around for this very reason. You never get any bug to fix with SQL. You sometimes need to adjust some queries to some new bad input but never ever will you see : 'index out of bound' or 'wrong argument' or whatever.
The mis-use of "big data" tools like Hadoop to do with entire giant clusters of systems what can be done more quickly on a MacBook Air encapsulates quite a bit of what is wrong with all modern software development: overengineering, dick size based infrastructure decisions ("look at how big my cluster is!"), stacking up tools to pad one's resume, mindlessly copying what bigger companies do, needless layers of abstraction, etc.
Look up the largest hard drive of any type that you can find for sale. Now spec out a consumer or small business grade NAS with 2-4 of those drives. If your data will fit there, you do not have "big data." If the cost bothers you consider that the cloud footprint (or on-prem mini data center) required to use your big sexy "big data" approach will cost far more than one of those NAS systems, possibly every month.
The only real exception is if you need performance and the computations you are doing are CPU bound or highly parallelizable. If you need rapid turnaround you may want some kind of distributed replicated cluster approach that can do things in parallel. For the majority of jobs though these are periodic or internal facing analytics jobs and getting the results faster is not worth 10X-100X the hardware cost or cloud bill and 10X the developer time.
Hmm, no mention of GNU Parallel — that's my go-to replacement for xargs. It has numerous options to treat input as blocks of text, lines, etc and can even run in a distributed fashion over SSH.
Bash is a godsend for quick debugging and I can see the temptation to start writing production code using bash. It basically boils down to a few things IMO:
- large bash scripts are hard to read/maintain
- complex modelling chains need intermediary points in the processing
On the latter point I can't count the amount of times where being able to query an athena database has saved a lot of headaches. The overhead from parquet and AWS bills pales in comparison. I'm sure almost everyone already agrees with me here but it's a classic case of the whole being more than the sum of its parts.
At my old company, Hadoop, JSON, Parquet were the solutions looking for a problem to solve. Yes we handled lots of data, but none of it was time critical. No matter, we had BIG DATA™. Bow when I say that. We weren't Google or Twitter. We were not surgically retrieving data, and while we had lots of data, we didn't have huge data centers; Dremel wasn't relevant to us.
It's funny! It would take our Hadoop team three weeks to get data together for our use. I often didn't have three weeks. In those times when I needed to use the data from the previous day, I'd just grab the raw data, organize it, process it, and be done with it in a few hours, using Unix/Linux tools and a bit of mathemagical wizardry.
"You're supposed to use Hadoop."
"You wanted to know what happened yesterday."
"I did!"
"If you want it from Hadoop, it will be ready in three weeks. Probably. That's if they have everything done."
- be a clone/repo for disparate databases so you don't need to figure out access/security/location or impact production systems
- an interface to management types that aren't technical or don't have tech people to do these things
- should provide a "librarian" knowledge of the enterprise's data and data sources
- should have knowledge on how to analyze data using different tools
- be able to schedule movements/reports and manage that
If you don't need any of that, then ... yeah, don't use it. But those sets of requirements should be useful to anything that deems itself an "enterprise".
There is no point running something like Hadoop or Spark or Flink if your problem is not bottlenecked by the resources of of a single machine.
E.g. how would this solution work if the data is so large or the problem requires more context (looking at more than one line at the time) and therefore more memory?
Answer: it wouldn't .
That's it, for a long time, what a lot of companies that jumped on the big data bandwagon didn't realize is that they didn't actually have big data.
I'm sure there's some arbitrary lines that can be drawn, but anything under 1 TB is not big data anymore and can be processed on a single machine.
Other things to consider is data volume though. A popular use case of hadoop was to take e.g. server access logs - high volume data that traditionally you wouldn't hold onto for long - and get something meaningful out of that.
At this point in time (2022) I consider everything below say 40TB not big (textual) data at all. It can be compressed 40TB -> 10TB (or less) and that fits fine on a single 16T drive.
For many questions, you won't need all the raw data, so you end up with some form of projection of the data that is maybe 1/10 in size, so 10TB -> 1TB. Heck, if you tune GNU sort a bit, it will blast through that TB quite quickly.
If you just want cold storage, you can put 10TB of compressed textual data on a spinning hard drive. If want to run some processing of that data within a workday, you need multiple drives in parallel (still possible on a single server). However, if you want to process it in less than 30 mins you need a cluster.
If you can fit it into RAM on a single machine, you probably shouldn't be using complex distributed systems for working with it. The developer time spent setting it up and fixing arcane bugs due to the distributed system is likely to cost you more than a single monster server will.
If your processes generate large amounts of data rapidly, you can get a decent idea when you should start working on a move to a "big data" solution by looking at your rate of growth.
...though in my experience, stuff like logs can be tarballed and tossed on S3 as insurance just in case you really do need to get an overview of some pattern over the past three years. Mostly only about the past six months of logs are really worth keeping on hand actively, IMO.
I wonder what is the best way to process huge amounts of time-series data (several thousand records per second). When I search for "time series database" I get something like MySQL, InfluxDB, TimescaleDB etc, but all of them are way too powerful and have their own query language and storage engine etc which are hard to learn and manipulate. In case you run out of storage/memory/CPU etc or when you want to query something that the database doesn't support, there is no way out of it without waiting for the database vendor to provide some new feature.
Now I'm just writing all data to plain text files, one JSON object per line, and query and process them with cli tools like jq. Regular compression tools like zstd, pattern matchers like grep works. All the Unix philosophy applies and it's easy to do anything I want without being restricted to the features of a certain database.
TimescaleDB "upgrades" fairly gracefully from regular PostgreSQL; there's a bunch of Timescale-specific functions, but it's not that much: mostly it's just regular SQL. I don't know how well it performs with thousand records per second though.
I have a similar issue. I've been using feather files/parquet files for storage, and just using pandas to do analysis. There is an issue where the initial load/convert essentially doubles the memory usage of the file itself as it converts to a pandas dataframe. This can be avoided if you use a feather file and follow its recommendations for a zero-copy conversion (no NaNs/nulls).
I think it's a bit more flexible than using cli tools since you can set some sort of time index and query specific timeslices fairly easily
25 years ago my company Needed to match lists against a collection numbers not to call. Probably ran lists of 20k daily or more.
So I wrote up a simple python program to match against a special format txt file. A few seconds for big lists at most.
Higher up IT people were horrified, and commissioned a proper oracle soliton. It ran around a 1-2 records per second. So all day for a typical list. They were quite proud of this.
I think they spent 25-50 grand on IBM support to get it setup.
They were not happy when we said we would only consider using it if they could speed it up 5000 times faster.
I know this was a typo, but the idea of Oracle upselling someone on a quantum field theory based solution to a string matching problem amuses me greatly.
I went to a Hadoop workshop in 2016 where the speaker was insistent Hadoop would replace traditional relational databases in the next 5 years. It’s been six and I think the death of relational databases has still been greatly exaggerated.
Having done OLAP and used other DB models like ISAM and xBase, I am a firm beliver that relational databases will keep being the horse worth betting on.
Whatever else keeps poping up against them are only usefull in special use cases, not needed for 90% of the common use cases, and even then, it isn't like relational database vendors are frozen in time without improving them.
Things look to be quite cyclic; ten years ago, NoSQL was pushed as The Future, but these days companies still build their core data on top of a traditional relational database. I don't even know which NoSQL is even popular. It feels like it's been pushed to the fringes, e.g. key/value stores, or the Q in a CQRS setup where it acts more as a readonly cache.
"newbie", maybe get a look outside, MongoDB is used by very large and successful companies.
Fortnite which makes multiple $B a year is heavily using MongoDB.
1.75 GB of data is going to the supermarket down a street with a plane still. Hell, perhaps just to the fridge. It is nowhere near big data when a 10 years old low-end phone can hold it in memory.
Depends if it's just another store within walking distance. I think the gist of the GP comment was that they use a big tool for a small job. Many small jobs don't make the big tool any more useful.
I did a vaguely similar thing once. We once upon a time had a EMR / Hadoop based process that simply joined a some thousands of files of a few kb-ish from S3 each into one big file totaling 20GB or so. It worked, but took a while. After a bit of mucking with it, I noticed that the built-in tools for it used the wrong S3 API for the purpose. I wrote a simple dumb single-threaded Ruby script using the right API, and it ran like 4x faster IIRC.
Same lesson - before you reach for Big Data tools, make sure you've fully explored simple conventional solutions.
I had to ingest and explore enough dirty CSV/TSV data to use CLI tools only to get a first glimpse what's there.
Whenever facing anything non trivial I go for CSV to parquet, and preferably write intermediate parquet data sets. Then DuckDB / SQL queries for slicing and dicing.
I am yet to encounter some readable to a newcomers awk/uniq/sed combos for intersecting bunch of CSVs with 30+ columns.
On the other hand parquet becomes a turtle if one tries to squeeze i.e. 12k numerical columns into it.
> On the other hand parquet becomes a turtle if one tries to squeeze i.e. 12k numerical columns into it.
I thought parquet was columnar stored? Is this a fault of parquet or just the shear number of columns trying to get accessed?
I agree with your general premise though. I'd rather take a dirty dataset, throw it into S3, spin up a Redshift cluster, do what I need, spin down the cluster. You can work with billions of records fairly easily with plain old SQL and c-store databases.
This article is such a good demonstration (IMO) that computers are really fast, but that we have gotten really bad at actually making reasonable use of them.
I've seen this article before and still consider it valid, especially for one-offs. Another similar and valuable resource is "https://datascienceatthecommandline.com/" Data Science at the Command Line, by Jeronen Janssens. Free to read at the link.
I hate the failure semantics of cron. Unreasonably hard to get right or to not run duplicate jobs. I’ve replaced it with systemd timers and never looked back.
NFS is such a pain in the ass for this use case. If the client crashes, everything’s corrupted. If the server crashes, everything’s corrupted. You can’t even atomically write 4k with O_APPEND. Why anyone would choose to build distributed systems your way is beyond me
I remember when this first was posted and laughed. My investment in learning unix tools has helped me a ton just as a single dev working on small datasets but I can totally see how it could work here.
This is what we've found. For a certain kind of simulation, we can process data on a single core at ~.5x line rate (I think this could be a lot faster if we took this problem seriously) so if we move to some arrangement where hosts have to download the data in order to process it, they can only use 2 cores each, and every 2 cores we want use requires another storage box to download the data from! Paying for hardware to increase "line rate" and sending zstd-compressed files over the wire might get us a factor of ~60 here, but even then we pretty rapidly run into issues where the data has to be placed on a bunch of distinct hosts in order to be useful, since 120 cores is not a lot of machines.
In the future we expect to have workloads that do less computation per piece of data, which makes the need to move our compute to our data much more acute.
How many sends are we talking about? Into how many messages is the data turned, how often does it get sent around?
If I send 1MiB of data by packing it up into messages of 10 byte each, it will ikely be slower than sending 10MiB in a single message. Messages == Overhead. Envelopes, packing, unpacking, parsing, assembling, etc. all eat up cycles.
This is a distributed sum. You don't need extra messages being sent around. You send 7 messages giving each node 1/7th of all games. Then 1 message per node is sent with all the tallies to be merged together. And then maybe an extra message to give you the results if you weren't the one that merged it. 15 messages shouldn't take 26 minutes to send.
In this case the dataset was small enough to fix in my laptop's DRAM without straining it. If we assume the 7 machines are in the data centre, that means the two numbers to compare are the main memory reference (100ns) versus the Data Centre round trip time (500us). That's a factor of 5,000.
If your intuition told you those 7 machines are going to be faster, then you really should invest the time to internalise those numbers. The article is 100% correct - every programmer should know them.
I think you are missing my point. It's a difference between 12 seconds on a laptop and 26 minutes on 7 c1.medium instances. Yes there is overhead in shipping the data around, but there isn't 26 minutes worth of overhead.
If a c1.medium can process at the same speed as his laptop it should take less than 3 seconds worst case. And yes 7 machines should be faster at this scale.
> If a c1.medium can process at the same speed as his laptop
As it happens, I got to perform that experiment. Sort of. I was moving stuff on physical Dell hardware to a virtualised environment, at the behest of MBA's. I was a bit concerned about it as we pushed the existing hardware hard - it had overnight stuff it had to get finished by morning. It had a 30% buffer.
It wasn't even close. The "virtual environment" was 15 times slower. It wasn't the CPU. It was slower but not by much and that could have been remedied by firing up more VM's on different physical machines. It was disc storage. They insisted their big beefy SAN they spent 100's of K on would be faster than a locally connected SSD. But a SAN operates over a network, and SSD's over 150mm gigabit links. The I/O was mostly random.
If they had of consulted "Latency Numbers Every Programmer Should Know" the could have predicted the outcome - just as I had.
BTW, I also benchmarked my the laptop. Despite what you apparently think it isn't much slower at doing a single task than the big Dell server, and I wouldn't expect it to be. The one difference is the Dell server comes with a factor of 10 more RAM, SSD, HDD and networking, and it seems to be able to drive all those devices at full speed in parallel. But that would not of helped in this instance as the task is too small. I know from experience for this sort of task my laptop will easily outperform a c1.medium.
This is one of those "it depends" things. As usual, it's good to build an intuition for what you expect. But it is also worthwhile benchmarking it now and again.
At a previous job, we were pushing against going to "networked temp disk" instead of "local temp disk", on the assumption that local storage would be faster than remote storage. But, actual benchmarking showed that the "networked temp disk" was about two to three times faster. Mostly because almost all IOPS on each machine went to servicing network disk requests, so trying to squeeze in on one machine's IOPS caused IO stall times that trying to squeeze into N machines' IO queue didn't see.
It's also a "are you mostly doing read consecutive blocks" or "are you doing essentially random, scattered reads" (for read workloads, write workloads are a bit different, as it is approaching hard top speculatively write data that has not yet passed through a write(2) call).
It often isn’t about the execution time, if it’s a daily job and takes 26 minutes who cares?
Using something like databricks means it is easy to schedule and manage jobs, easy to write jobs that work in good enough time, easy to troubleshoot when things go wrong.
It comes with a well documented security mode and a support contract when needed.
Developers can be onboarded quickly and work code reviewed and managed.
Logging and diagnostics are available and you can report on metrics easily.
That isn’t true with custom data pipelines written in shell scripts.
The value isn’t in the pure execution time, it is in everything around it.
Processes that take ages are never "easy to write" nor "easy to troubleshoot". This turns into shitty code, because you won't get any sane person to spend weeks instead of days on that. It's bad for morale.
It's the same narrative people use for many complexity or slow things
> The value isn’t in the pure execution time, it is in everything around it.
Especially for those of us who get paid for developing, designing, supporting, architecturing, meetinging, catering, conferencing and maintaining everything around it :-)
> That isn’t true with custom data pipelines written in shell scripts.
Why not? Cron can schedule jobs, the documentation for standard shell tools is among the best, almost every developer can handle bash scripts, Logging can be done via syslog.
This is like the classic Dropbox hackernews comment:
> For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
Sure, you can stitch several services together and it will work for your needs, but for most users there is a benefit to a centrally managed and complete solution.
Most users who have to tackle actual big-data problems, meaning analysing things on the order of several TB or more?
Sure, they will absolutely benefit.
But there isn't just big data. There is also little data, where what is analysed is on the order of a few GiB or less, and everything in between.
I am not saying "use shell for everything!" I am saying "the right tool for the right job". A 15t excavator is probably not a good choice if I want to plant a small tree in my backyard, and a gardening shovel will probably not serve me well when I wanna start building a scyscraper.
That's pretty much how Dask works; though Dask doesn't throw away data types for newline-delimited strings between each failed partitioned IPC pipeline process, and then the new hire didn't appropriately trap errors in their shell script, so the log messages are chronologically non-sequential and text-only.
> This package provides a JupyterLab extension to manage Dask clusters, as well as embed Dask's dashboard plots directly into JupyterLab panes.
Something like ml-hub allows MLops teams to create resource-quota'd containers with k8s and IAM, though even signed code can DoS an unauditable system with no logs of which processes ran which signed archive of which code at what time, with bash and ssh.
https://github.com/ml-tooling/ml-hub
Outside of legacy systems, Hadoop isn't widely used anymore.