Pandas on Ray – Early Lessons from Parallelizing Pandas

miggyrozay · on July 8, 2018

How does this compare to dask.distributed? Dask dataframes are also a wrapper on pandas API.

edit- They explain differences in a section of this blog post: https://rise.cs.berkeley.edu/blog/pandas-on-ray/

innagadadavida · on July 7, 2018

Does anyone here know if Ray is some sort of Yarn competitor? If not what problem space is it in?

dataflow · on July 8, 2018

I don't know what Yarn is, but Ray is a distributed heterogeneous computing framework. It's meant to make it easy to take fairly arbitrary computational programs (with machine learning being an important test/use case) and run/debug them in parallel across lots of machines with high performance in a natural fashion and without drastic changes. [1] The advantage compared to (say) Hadoop is that it allows for heterogeneous programming models and isn't limited to or designed for (say) MapReduce; you can invoke functions in parallel in a dynamic fashion pretty easily.

[1] You can get an idea of that here: https://ray.readthedocs.io/en/latest/

maccam912 · on July 8, 2018

It feels to me like more of a spark competitor, aiming for the people who want more flexibility than Spark and not needing as much cluster setup (a redis cluster is the hardest part). It might even be able to use Yarn to do some of the resource management.

jamesblonde · on July 8, 2018

Ray is, like YARN, a resource manager - but it's lightweight and fast. Kind of like Erlang is for a lightweight process model, Ray can start tens of thousands of containers with much lower latency than YARN or Mesos or Kubernetes. It has found a niche in distributed Reinforcement Learning (deep RL), and is establishing a beach-head there. Now it faces the chasm....

kgos · on July 7, 2018

Code Repo: https://github.com/modin-project/modin

smittywerben · on July 7, 2018

For those of you confused like me, "Pandas on Ray has moved into the Modin project"

http://ray.readthedocs.io/en/latest/pandas_on_ray.html

axiom92 · on July 8, 2018

This could be really helpful for implementations that were written with relatively smaller datasets in mind but now need to be scaled up. However, for someone starting from scratch, it is not clear what advantages do they plan to offer against Spark used with the Dataframe API.

rmbeard · on July 8, 2018

Unclear what this is good for.

AdamM12 · on July 8, 2018

It is a WIP distributed Pandas implementation. Allows you to spread massive dataframes, their data and computations, over multiple machines vs. Pandas which is only local to your machine. It's not quite there yet [1]

[1] http://modin.readthedocs.io/en/latest/pandas_on_ray.html#usi...

nerdponx · on July 8, 2018

Pandas replacing PySpark? Sign me up.

makmanalp · on July 8, 2018

Dask is already this! They have a dataframe replacement, a numpy array replacement, and some lower level primitives like dask.delayed too. Plus, the nice thing is that it's already being used with large amounts of data, and the warts (which were plentiful two years ago) are rapidly reducing.

http://matthewrocklin.com/blog/work/2018/06/26/dask-scaling-...

nerdponx · on July 8, 2018

Nice, I've heard of it before but never used. How is deployment compared to Spark?

makmanalp · on July 9, 2018

The deployment is /way/ more simple, especially if you already have a python environment (even more so with conda), which is one of the main attractions for me. The other is that the API is way richer - spark dataframes tie your hands down in so many ways and require you to write tons of custom code for routine stuff, while pandas (and dask) has builtins for almost everything imaginable these days.

The tradeoff is that spark and hadoop in general have invested /serious/ efforts into resilience, while dask only really provides protection against worker failure, but not really scheduler failure. In practice is this really an issue? shrug it depends. "Works for me". How many tasks do you run in parallel? If you're doing ad-hoc analysis on a very large dataset, dask might be a great fit. If you have a data warehouse use case and have tons of people running analytics queries, then you have uptime requirements, and spark might be a better fit. That's where I ended up.

guard0g · on July 8, 2018

Dask requires users to keep in mind the computation graph and persistence. From the sound of it, this is a lighter-weight version for non-compute-focused data scientists (who probably are R users anyway).

makmanalp · on July 9, 2018

> Dask requires users to keep in mind the computation graph and persistence

I saw this statement in the official documentation too ... in practice this isn't a big burden - if you're going to reuse something, persist it. That's about it. "Keep in mind the computation graph" doesn't mean anything - the docs and guides really gloss over why I don't have to do this. Not thinking about these is effectively a decision made one way or another.

The papers offer a bit more insight: Looks like the novelty is in the scheduler that was built for low task latencies and high throughput: wonderful for all kinds of compute-intensive workloads (they talk about Reinforcement Learning, for example) and definitely fulfills a need. They especially talk about how some tasks are dynamic in nature and spawn new tasks unpredictably and how you might not know the whole task graph in advance, or how some work can be done either locally or globally in a transparent manner, or how if you have many tiny tasks, the task overhead can kill you. But don't see what this buys for traditional table style data-heavy analytic workloads which have none of these problems.

Don't get me wrong - I'm excited about more contenders. The neat stuff about ray seems to be the scheduling model they have, the fact that they built in fault tolerance by storing all state globally in a sharded data store (redis in this case), but also they use Apache Arrow and the plasma store, which means zero-copy data transfer between workers and other data stores - awesome. Dask should probably also consider some of this. But pandas + ray just seems like an odd couple - it seems like ray's scheduler was built for the opposite use case.

AdamM12 · on July 8, 2018

It looks like that would be the long term goal. Haven't used Spark enough to understand it's downsides.

rmbeard · on July 9, 2018

Thanks for the clarification. I can now see how that might be useful.

guard0g · on July 8, 2018

This looks interesting. Thanks for sharing and will have my DS team try it out.