Rust Dataframe: Update 1

quotemstr · on April 5, 2020

One of the things that most pisses me off about numpy, pandas, and the whole scipy ecosystem is that everything is "immediate mode": all operations evaluate _instantly_ to new arrays or dataframes. There's no opportunity for any component to evaluate a whole expression tree and optimize it, e.g., by loop hoisting.

The right way to design a data analysis DSL is to do it the way Python's dask does it: build an operation graph and execute it as the last step. The trouble is that Dask doesn't get it right either because, as part of its graph formation, it computes sizes of operands, and computing operand sizes can involve huge amounts of computation, and so dask, in effect, is also an "immediate mode" system.

What I really want to see is something lazy that can do sane query planning and that can work within limited system resources. Maybe one day I'll open source the work I've done in this space. Query languages are infinitely nicer for analytics than data processing libraries.

nevi-me · on April 5, 2020

Apache Spark has a great 'lazy' story, and I recommend it as a compute supplement for pandas and numpy. I'm always fascinated by the lengths the authors go through to optimise queries, from the cluster to one's machine. Also, the integration with Apache Arrow over the years has bore much fruit, to the extent that a pyspark user can cross over to Pandas with little serde cost.

Fun story, about 18 months ago I was consulting for a telco. One of my tasks was to run a segmentation model on their subscriber base. They wanted me to use SAS in their cluster, but because I had 'brought my infra/laptop', I told them that I could do it with Python instead. I got the gist of it working on Pandas, but it was 'slow'. So they asked me if I was going to run it on a sample, but I said I'll port it to Spark in the evening.

The following morning I presented them with results, and the porting to Spark took about 2 hours (I hadn't used SparkML before), and the thing took an hour to run on the whole population.

bobbylarrybobby · on April 5, 2020

It really is remarkable that Python's numerical computing libraries have such poor performance. When doing chains of elementwise operations on large arrays (such as toy example `elementwise cos(sqrt(sin(pow(array, 2))))`), Julia appears to outperform Python by a factor 2! Numpy cannot avoid computing each intermediate array, which means it has to allocate a ton of wasteful memory. Meanwhile Julia does the smart thing and coalesces all operations into one and applies that single operation elementwise, allocating only a single new array.

Pandas does not also defer computations, which means computing Boolean functions that include the same data multiple times must make multiple passes over said data. Absurd.

a_t48 · on April 5, 2020

numpy appears to have optional arguments for the storage location of outputs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c... - Could you elaborate a little further? The syntax might not be as nice as another language or framework, but it's not "unavoidable".

Disclaimer: have never used numpy, have used python fairly extensively.

bobbylarrybobby · on April 5, 2020

You are right about the `out` argument, I'd forgotten about that. But even avoiding the wasteful memory allocations, numpy still is about 60% slower than Julia, as it makes multiple passes over the input data. (If there's a way to get numpy to just make a single pass over the data and remain performant, I'd love to know.)

a_t48 · on April 5, 2020

The first thing I'd reach for is to refactor into a list comprehension. Looks like this is the proper way to do it:

  for x in np.nditer(a, op_flags = ['readwrite']):
    x[...] = cos(sqrt(sin(pow(x, 2))))

That's some knarly syntax though. I've never seen that ellipsis operator?

Edit: just read about `Ellipsis`. I'm a fan, even if it's sort of nonstandard across libraries. Those readwrite flags are a travesty though, but at least you can paper over it with a helper function.

Something like:

  def np_apply(a, f):
    for x in np.nditer(a, op_flags = ['readwrite']):
      x[...] = f(x)

  np_apply(x, lambda x: return cos(sqrt(sin(pow(x, 2))))))

or

  def np_apply(a, *argv):
    for x in np.nditer(a, op_flags = ['readwrite']):
      for f in argv:
        x = f(x)
      x[...] = x

  np_apply(lambda x: return pow(x, 2), sin, sqrt, cos)

Edit3: There's a way to turn this into "pythonic" list comprehension code, but it would probably only make it look prettier rather than more performant.

quotemstr · on April 6, 2020

Yes, you can use the out arguments, but doing so 1) complicates your code, and 2) isn't necessarily a win. You have to do the same number of memory write bus cycles whether you write to a specified output array or to some new array.

a_t48 · on April 6, 2020

I agree with point (1), wholeheartedly - by default the syntax is ugly unless you wrap it in a utility. (2) isn't true as far as I can tell - you can keep reusing the same output buffer, or double buffer if the output can't be the same as the input.

Either way see https://news.ycombinator.com/item?id=22786958 where I figured out the syntax to do it in place, which shouldn't result in any allocations at all and should solve both your concerns.

quotemstr · on April 6, 2020

Regarding #2: whether you use the same buffer or a new buffer, you still have to actually write to the memory. The bottleneck at numpy scale isn't the memory allocation (mmap or whatever) but the actual memory writes (saturating the memory bus), and you need to perform the same number of writes no matter which destination array you use.

a_t48 · on April 6, 2020

It should still be a perf gain:

1. Memory allocation isn't free 2. Doing multiple loops over the buffer (once per operation) is going to be slower than doing all the operations at once, both because of caching and the opportunity to do the operations on a register and write at the end (though who knows if the python interpreter will do that).

quotemstr · on April 6, 2020

The Python interpreter is too stupid to do operation coalescing. That's the whole point of my initial comment.

nmca · on April 5, 2020

JAX supports this; as do other performance-targeting numerical libraries. There are hoops of course.

quotemstr · on April 5, 2020

> Julia appears to outperform Python by a factor 2

Now I have some more reading to do. Thanks.

timClicks · on April 5, 2020

Julia is very, very good. I say this as a person who has invested about a decade in the Python data science ecosystem. Numpy will probably always be copy-heavy, whereas Julia's JIT can optimize as aggressively as the impressive type system allows.

pangoraw · on April 5, 2020

Have you heard about Weld[1] ?

It is an ongoing effort to built an intermediary representation that numerical library could share to produce an optimized instructions tree.

[1] https://github.com/weld-project/weld

muldvarp · on April 5, 2020

I never understood why we limit ourselves to sending SQL to databases. Instead of sending SQL we should be sending actual logical plans to the database. These logical plans could then be optimized by the database and executed. This would solve two issues I have with SQL:

1. Without CTEs SQL isn't that expressive and with CTEs queries can be hard to understand. Logical query plans on the other hand are rather easy to understand and can express more queries than SQL can.

2. SQL is hard to modify programmatically. To most software, SQL is simply a string without any inherent meaning. A logical query plan on the hand is a tree structure that can be introspected and modified.

That's basically the same thing you want, except that you want it implemented on dataframes while I want it implemented for database systems.

fulafel · on April 5, 2020

Isn't the alternative in the end a computation DSL and optimizing compiler? GPU HLLs like Halide, Futhark, etc show the way here.

logicchains · on April 5, 2020

>What I really want to see is something lazy that can do sane query planning and that can work within limited system resources. Maybe one day I'll open source the work I've done in this space.

Eigen does this via expression templates; it essentially optimises the computation graph at compile time.

quotemstr · on April 6, 2020

> Eigen does this via expression templates; it essentially optimises the computation graph at compile time.

That's wrong too. The right query execution plan can depend heavily on the sizes and statistical distributions of the operands. Eigen doesn't have enough information at compile time to make good decisions in all cases. This kind of computation really calls for a JIT.

adamnemecek · on April 5, 2020

Do you have an opinion on julia?

j88439h84 · on April 5, 2020

Some thoughts on dataframe.

- Don't put methods on it like pandas.DataFrame. You won't get the API right the first many times and you'll end up with a million methods.

- Make chaining easy.

- Use pure functions. No mutation.

- Get inspired by R's dplyr and data.table.

orlp · on April 5, 2020

> Don't put methods on it like pandas.DataFrame. You won't get the API right the first many times and you'll end up with a million methods.

I was actually thinking about this for a bit. Since the data layout seems unlikely to ever backwards-incompatibly change, can you not define the API entirely in traits on the struct, and package them into one super trait (e.g. DataFrameAPIVersion1)?

Then you can simply import the DataFrame struct as well as the version of the API you'll use, and be 100% compatible with any other code regardless of API version they use to manipulate the dataframes.

j88439h84 · on April 5, 2020

Then application-specific functions would need a different syntax from the built in methods. Like df.pipe(f). I think that is not desirable. Better to have them all be detached from the frame

orlp · on April 5, 2020

That's not true. You can implement your own traits for foreign types.

aldanor · on April 5, 2020

"Pure functions" is usually a bad idea with large numeric data structures if they are non-lazy - because you'll end up copying the data.

One of the common pure alternatives is to make them lazy - i.e., all those pure methods don't do anything but rather just collect the information on what to do with your data. But then you need to write an execution engine for arbitrary computation graph, which ain't easy.

orlp · on April 5, 2020

> "Pure functions" is usually a bad idea with large numeric data structures if they are non-lazy - because you'll end up copying the data.

This isn't necessarily true due to Rust's strong ownership. Methods can take `self` (without reference) as an argument which means the method takes ownership of itself. There is no copy for 'double' in the below example, yet it is pure:

    #[derive(Clone)]
    struct A {
        v: Vec<i32>,
    }
    
    impl A {
        fn double(mut self) -> Self {
            for x in &mut self.v {
                *x *= 2;
            }
            self
        }
    }
    
    fn main() {
        let a = A { v: vec![2, 3, 4] };
        let b = a.double();
        dbg!(b.v[1]);
    }

If you now tried to access `a` the compiler would error out, saying you're trying to access a moved-from variable. If you still wanted to keep the original `a` around you simply write `let b = a.clone().double()`.

nevi-me · on April 5, 2020

Yes, that's the approach that I'm taking with the library, but with a simple execution flow. One of the things that used to bite me a lot with Pandas was having to manually free memory from intermediate dataframes that I would create. I'm expecting Rust to work better in this case, because when executing a 3-step pipeline, step 2 consumes the output from step 1, and the dataframe is freed from memory.

I'll write an experiment on this in the coming weeks/months, to see if my assumption would work.

uryga · on April 5, 2020

i use pure functions wherever i can, but is this always the right approach when dealing with large dataframes? i imagine when you're chaining a few methods, it'd generate a large number of intermediate results that immediately get discarded/transformed again:

  frame.foo_columns().bar_rows().baz()

the result of .foo_columns() is basically linear – it gets passed to .bar_rows() immediately, with no other references to it. maybe this'd be a good place for rust's safe mutability magic?

wtetzner · on April 8, 2020

If each of those methods consumes `self` instead of taking a reference to it, then the methods are effectively pure, and don't require copying.

uryga · on April 9, 2020

yeah, that's what i meant by "linear" (as in linear types, similar to rust's ownership stuff, just terminology i'm more familiar with)

Icathian · on April 5, 2020

A good Rust equivalent to Pandas is really the main thing keeping me from switching to Rust for about half of my day job. This is incredibly promising, and I am following your work with great interest. Thank you for working on this!

nevi-me · on April 5, 2020

Out of interest, what are the common tasks that you'd be looking to achieve? We can't replace Pandas and its ecosystem in the short-term, so for me a Rust-backed backend to a dataframe that is compatible with Pandas would be a win.

Icathian · on April 5, 2020

My workflow usually reads something like this:

1. Import data using hand-written SQL from SQL server (read_sql)

2. Perform various filtering, aggregation, math (loc, iterrows/itertuples, groupby, agg)

3. Push results either out to delivery file or back to source server (to_csv, to_excel, to_sql).

Really bread and butter stuff, but the relative ease and stability of using pandas to do it is the attraction.

potatochup · on April 6, 2020

We use pandas for writing assertions on time series data collected from simulated and real firmeare. Eg:

- two signals X and Y should always be within 5% of each other while condition Z is true.

- signal A should transition from value M to N within time window t1-t2

Would love something in rust to match the rest of the infrastructure. Python also has plenty of warts with regards to packaging, cross compilation and deployment

BubRoss · on April 5, 2020

Why would you want to use rust here in the first place? Won't it just slow down your iteration time?

lumost · on April 5, 2020

Not the OP, but also interested in similar applications of rust. The two main benefits I see are

1) Speed and memory efficiency, I've often worked in python/pandas and been frustrated by its performance. If I have a job that takes 12 hours to churn through all of the data, flipping over to rust may accelerate my iteration time to ~7.2-72 minutes.

2) Straightforward FFI bindings into other languages. Like C you can trivially bind rust code into python. Language choice can then be a bit of a two-way door back to python.

Icathian · on April 5, 2020

You pretty much nailed my reasons as well. Trying to explain that updating model results for this month's data will take 12 hours usually does not go over well. The learning curve to move stuff to Rust would be considered well worth it.

BubRoss · on April 6, 2020

I thought pandas called into C libraries, does it not run close to native speeds?

lumost · on April 7, 2020

It depends on what type of object you're using in pandas as well as whether you ever need to use a User Defined function e.g. pd.Series.map. Crossing between python objects and C objects is incredibly slow, and python defined UDFs will usually run at python speed.

In most of my projects I'm unable to avoid using a udf/python data manipulation at some point.

lmeyerov · on April 4, 2020

Getting a fast & safe Rust UDF layer that targets SPIR/CUDA/PTX would be quite interesting wrt enabling RAPIDS.ai (libcudf + python bindings) as well. It'd enable getting rid of slow & quirky numba etc - I remember Mozilla had GPGPU RUST codegen experiments here awhile back...

peterhj · on April 4, 2020

rustc does have a working nvptx target today, though it’s not supported nearly as well as the mainstream cpu targets, and some things you would really want for gpu programming (e.g. shared memory address space) are not currently exposed in the rust language. But kernels written in rust can compile to ptx; you’ll still need to write glue code.

lmeyerov · on April 5, 2020

Yeah this would be about extending it to columnar analytics funcs, like `df['x'].apply(f)`, `df.query("x > 10 && y < 10"). I realized I may be wrong about the compiler speed part, not sure if it'd be faster than numba for codegen nowadays :)

forrestthewoods · on April 4, 2020

What is a dataframe? I wish libraries would define their key terminology. Especially when the term is rather generic.

chubot · on April 5, 2020

What Is a Data Frame? (In Python, R, and SQL)

http://www.oilshell.org/blog/2018/11/30.html

(this is my article, but it will also show up if you Google the question :) )

macawfish · on April 4, 2020

It's kinda like a spreadsheet with a programmatic interface. Check out pandas for a nice introduction!

k__ · on April 5, 2020

I saw this used as some Python library inside another app, didn't quite understand it.

NewJazz · on April 5, 2020

Am I the only one who finds pandas to be terrible compared to dplyr?

bandage · on April 5, 2020

No, pity that a lot of new build libraries and platform gets inspired by pandas workflow in search for something better, instead of looking at R ecosystem where people actually enjoy the available tools and workflows.

Icathian · on April 5, 2020

Everyone does. On the other hand I, and many others I assume, aren't willing to move to R for the sake of the one superior tool over python. Two if you count ggplot2, I suppose.

NewJazz · on April 5, 2020

Yeah but for giving someone an intro to what dataframes are, I think pandas might leave a sour taste where dplyr would allow someone to learn comfortably.

Although, one might run into a "true level" situation. This is when Morty feels what true level is like, then says "everything is crooked, reality is poison" when he has to live in the "fake level" world.

nevi-me · on April 4, 2020

TL:DR; I've been writing a Rust dataframe library for a while (on and off when I have time). This is my first update, to motivate why I'm writing it.

nestorD · on April 5, 2020

Nice, there are definitely people trying to push more of their datascience/ML stack to Rust and a good dataframe implementation would be useful.

As a side note, a small usage example in the readme would be good.

nevi-me · on April 5, 2020

Thanks for the feedback. I was thinking that Update 2 would be in the form of examples of what can currently be done. I'll also add that to the README

kzrdude · on April 4, 2020

Exciting, looking forward to future updates. In relation to pandas and xarray, this would be more inspired by the former?

nevi-me · on April 4, 2020

Yes, the former. I haven't seen xarray before, it seems interesting (still reading up on it). The ecosystem around Pandas is amazingly enormous :)

kzrdude · on April 4, 2020

xarray is more geared towards physical sciences data, data formats like HDF5 and so on, but the general concepts overlap a lot with a dataframe.

amelius · on April 5, 2020

What bothers me is: why can't Rust developers just import a C++ library that does the job? What novelty would a Rust version of the same thing bring really? Why not focus on real innovation, and use wrappers for things that were already built in another language a decade ago?

MaulingMonkey · on April 5, 2020

> why can't Rust developers just import a C++ library that does the job?

We can and frequently do. Rust even seems to have one of the better packaging stories around building C++ libs, albeit with several heavyweight dependencies (C++ compiler + toolchain, libc, etc.). Granted, calling into it from Rust means we need a C ABI, but that's rarely a terrible problem.

> What novelty would a Rust version of the same thing bring really?

I find Rust easier to cross compile for starters. I still haven't tamed emscripten for compiling C++ as WASM - the Rust toolchains by comparison make compiling Rust to WASM trivial. Rust's larger stdlib, standard package manager, and healthy library ecosystem means whatever I'm cross compiling is less likely to need to resort to platform specific nonsense in need of manual porting.

But the main draw of Rust to me is the ability to write memory safe code and race-safe native code, reducing how many heisenbugs I encounter and the portion of the codebase where they might lurk. If your codebase is large, and mostly memory/race safe, this is a huge advantage. If your codebase is large, and only 50% memory/race safe, you're still frequently in debugging hell inside a large haystack.

If you take a super solid C++ dependency, maybe you kinda retain that advantage, but most C++ dependencies I take on don't meet that high bar (including those me and my coworkers write!) Alternatively, I could take on a relatively mediocre Rust dependency, and I'll probably still retain that memory/race-safety advantage. There are exceptions to this of course thanks to Rust's many opt-in escape hatches - but I also find it much easier to audit Rust code for their use and the possible problems they may cause, versus C++ where every array access is suspect.

_4ziu · on April 5, 2020

Do you think all rust developers are connected together like some sort of hivemind? What an open source developer chooses to spend their time on doesn't require the consideration and consolidation of the entire rust team, nor does it speak for all of them.

That said there are plenty of reasons why they shouldn't just import c++ libraries, one of them being that c++ is not the same language as rust.

PeCaN · on April 5, 2020

>Do you think all rust developers are connected together like some sort of hivemind?

They do do a pretty damn good job of giving people this impression.

withtypes · on April 5, 2020

I don't have this impression. Can you explain?

chungy · on April 5, 2020

Rust can't import C++ libraries, for many of the same reasons that C programs can't import C++ libraries.

bluejekyll · on April 5, 2020

Right. The C++ lib would need a C FFI interface for Rust to consume.

alvarelle · on April 5, 2020

Or using this crate: https://docs.rs/cpp/0.5.4/cpp/

caconym_ · on April 5, 2020

Why does other people working on what they want to work on bother you? Do you think they owe you something?

catach · on April 5, 2020

A certain mindset is bothered by observing others apparently wasting effort they don't need to. It has nothing to do with any benefit or loss for the observer. Very roughly, one can think of it as a sort of altruism focused on efficiency.

This mindset may be appeased by demonstrating why you don't believe your actions are actually a waste of effort.

caconym_ · on April 5, 2020

I'd encourage people with this mindset to reflect on how unqualified they are to judge the holistic merits of how other people spend their time.

adev_ · on April 5, 2020

I name that the isolated island syndrome.

If you wait long enough, any stable working solutions will be reinvented in its own Language.

It's completely unproductive. But there is reason to that: compatibility with the language toolchain and understandability.

nevi-me · on April 5, 2020

it has to be done at some point. I still have vcpkg on my Windows machine, because I needed to install it and a few GB of other things, just to use PostgreSQL in Rust a few years ago (I think Diesel).

If someone hadn't implemented the libpq protocol in Rust, I likely wouldn't have written in binary copy support on the dataframe in one afternoon.

This reduces the barrier for people to productively use their favourite languages. It might be unproductive at first, but we see the benefits as time goes on.

adev_ · on April 5, 2020

> This reduces the barrier for people to productively use their favourite languages. It might be unproductive at first, but we see the benefits as time goes on.

It is vastly unproductive to reproduce anything useful in 10 different language just because our cross-language tooling awfuly suxxe.

The lack of cross-language dev-oriented lead us in the ridiculous situation where every language has its own package manager and his completely unable to use a package made for an other language.

There is reasons to that... but much more political than technical and that's the sad state we are in.

imtringued · on April 6, 2020

I don't understand how it is completely unproductive. Surely there must some advantage that Rust has over C++? If that advantage exists then there is your justification for reinventing something that already exists. If all you do is reuse existing C++ libraries in Rust then why even bother with Rust in the first place? Why not just use C++ directly? Well, the most of the value from using Rust is derived from the things that are written in Rust.

Let me give you an example but this time with Groovy. Groovy runs on the JVM and that means it benefits from the existing JVM ecosystem, right? Well, of course you get the benefit of existing libraries but there was a crucial difference between using them through Groovy/Grails than natively through Java. There were lots of grails plugins that were very thin layers above the Java libraries yet somehow the usability increased a hundredfold. A small plugin with maybe 1000 lines of code ended up providing more value to our business than the part of the library that was written in Java.

kzrdude · on April 6, 2020

This builds on apache arrow, so it does?

jamwt · on April 6, 2020

It's using the Rust native arrow library, from the DataFusion project. So nope, pure rust.

https://github.com/apache/arrow/tree/master/rust/arrow

kzrdude · on April 6, 2020

I see, thank you

FridgeSeal · on April 4, 2020

Oh this is really cool! I could definitely make use of something like this.