Hacker News new | past | comments | ask | show | jobs | submit login

Dask is already this! They have a dataframe replacement, a numpy array replacement, and some lower level primitives like dask.delayed too. Plus, the nice thing is that it's already being used with large amounts of data, and the warts (which were plentiful two years ago) are rapidly reducing.

http://matthewrocklin.com/blog/work/2018/06/26/dask-scaling-...




Nice, I've heard of it before but never used. How is deployment compared to Spark?


The deployment is /way/ more simple, especially if you already have a python environment (even more so with conda), which is one of the main attractions for me. The other is that the API is way richer - spark dataframes tie your hands down in so many ways and require you to write tons of custom code for routine stuff, while pandas (and dask) has builtins for almost everything imaginable these days.

The tradeoff is that spark and hadoop in general have invested /serious/ efforts into resilience, while dask only really provides protection against worker failure, but not really scheduler failure. In practice is this really an issue? shrug it depends. "Works for me". How many tasks do you run in parallel? If you're doing ad-hoc analysis on a very large dataset, dask might be a great fit. If you have a data warehouse use case and have tons of people running analytics queries, then you have uptime requirements, and spark might be a better fit. That's where I ended up.


Dask requires users to keep in mind the computation graph and persistence. From the sound of it, this is a lighter-weight version for non-compute-focused data scientists (who probably are R users anyway).


> Dask requires users to keep in mind the computation graph and persistence

I saw this statement in the official documentation too ... in practice this isn't a big burden - if you're going to reuse something, persist it. That's about it. "Keep in mind the computation graph" doesn't mean anything - the docs and guides really gloss over why I don't have to do this. Not thinking about these is effectively a decision made one way or another.

The papers offer a bit more insight: Looks like the novelty is in the scheduler that was built for low task latencies and high throughput: wonderful for all kinds of compute-intensive workloads (they talk about Reinforcement Learning, for example) and definitely fulfills a need. They especially talk about how some tasks are dynamic in nature and spawn new tasks unpredictably and how you might not know the whole task graph in advance, or how some work can be done either locally or globally in a transparent manner, or how if you have many tiny tasks, the task overhead can kill you. But don't see what this buys for traditional table style data-heavy analytic workloads which have none of these problems.

Don't get me wrong - I'm excited about more contenders. The neat stuff about ray seems to be the scheduling model they have, the fact that they built in fault tolerance by storing all state globally in a sharded data store (redis in this case), but also they use Apache Arrow and the plasma store, which means zero-copy data transfer between workers and other data stores - awesome. Dask should probably also consider some of this. But pandas + ray just seems like an odd couple - it seems like ray's scheduler was built for the opposite use case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: