Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Proliferation of Big Data Tools
1 point by nraf on Jan 6, 2021 | hide | past | favorite | 2 comments
There's a proliferation of big data tools out there (Flink, Kafka, Spark, Storm, Hadoop, RedShift, Athena, BigQuery, etc) and it's a overwhelming trying to understand what niche they're all trying to fill in the space.

Was wondering if there are any resources that help break down the domain and which tools are appropriate for solving specific problems.

There's a ton of articles comparing a few of the technologies against each other, but I'm looking for something a bit more structured and in-depth.




>There's a ton of articles comparing a few of the technologies against each other, but I'm looking for something a bit more structured and in-depth.

The problem is that most of the articles out there are written by data virgins: people who never got any. What exacerbates this problem is that they are distributed through huge publications, especially on Medium, which makes it that a huge number of people will get the wrong idea.

This is due to incentives: these people's incentives is not to actually get things done, build products, or execute projects. These people's output is an article. You don't really incur technical debt by writing an article about how awesome a solution is, but if I weren't disciplined and just mindlessly followed what some "guru" or, worse, some guy who never shipped something, wrote on their blog, I would incur massive technical debt.

What I do is tedious: keep an eye on different tools out there, read the documentation, play with them, keep an eye on the code and issue trackers to see the direction the projects are taking. Not eager to adopt any. Just for general culture.

What makes it less overwhelming is that I'm not looking for trouble. I'm not picking up solutions and trying to find problems to solve with them. I do have problems and search for solutions for them. This is hard when one does not have specific problems they're trying to solve. In that case every product out there is a potential tool in the box and it gets overwhelming. One trap is to solve imaginary problems one does not have, which is a result of not actually working in the field, because actually working on real problems focuses your attention and makes you want Judo to use in real life, not Aikido to use on imaginary or compliant opponents.

The question then becomes: what problems do you have specifically that make you interested in these tools? Once you specify your problem, you can look for the right tool. If I were you, I wouldn't waste time on the vast majority of Medium posts out there because, as I said, most of those who write are not doing, and most of those who do are not writing unfortunately. I'm biased. This is very opinionated and corresponds to what I have seen. I would very much like to be wrong and save myself some time.


Talking to investors in the space, it seems like most people in the data field understand the tool differences pretty clearly, but it's really hard to understand as an outsider. What angle are you coming at it from?

While it's not in-depth, I thought these articles did a good job of clarifying what each tool is not.

https://towardsdatascience.com/25-hot-new-data-tools-and-wha... https://towardsdatascience.com/20-more-hot-data-tools-and-wh...

Most tools boil down to the following categories, with some overlapping into 2 or 3 of them.

- Sourcing & Extraction (Fivetran, Stitch, Xplenty)

- Testing & Alerting (Montecarlo, Toro Data, Anamalo, Great Expectations)

- Cleaning & Transformation (dbt, Talend, Alteryx)

- Storage (Bigquery, Redshift, Snowflake)

- Delivery & Syncing (Census, Hightouch)

- Analyzing & Querying (PopSQL, Jupyter, Count, Hex Tech),

- Visualizing & Reporting (Looker, Tableau, Domo)

- Model Development (Sagemaker, DataRobot, Algorithmia)

- Orchestration & Automation (Shipyard, Airflow, Prefect)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: