One of the things that most pisses me off about numpy, pandas, and the whole scipy ecosystem is that everything is "immediate mode": all operations evaluate _instantly_ to new arrays or dataframes. There's no opportunity for any component to evaluate a whole expression tree and optimize it, e.g., by loop hoisting.
The right way to design a data analysis DSL is to do it the way Python's dask does it: build an operation graph and execute it as the last step. The trouble is that Dask doesn't get it right either because, as part of its graph formation, it computes sizes of operands, and computing operand sizes can involve huge amounts of computation, and so dask, in effect, is also an "immediate mode" system.
What I really want to see is something lazy that can do sane query planning and that can work within limited system resources. Maybe one day I'll open source the work I've done in this space. Query languages are infinitely nicer for analytics than data processing libraries.
Apache Spark has a great 'lazy' story, and I recommend it as a compute supplement for pandas and numpy. I'm always fascinated by the lengths the authors go through to optimise queries, from the cluster to one's machine.
Also, the integration with Apache Arrow over the years has bore much fruit, to the extent that a pyspark user can cross over to Pandas with little serde cost.
Fun story, about 18 months ago I was consulting for a telco. One of my tasks was to run a segmentation model on their subscriber base. They wanted me to use SAS in their cluster, but because I had 'brought my infra/laptop', I told them that I could do it with Python instead.
I got the gist of it working on Pandas, but it was 'slow'. So they asked me if I was going to run it on a sample, but I said I'll port it to Spark in the evening.
The following morning I presented them with results, and the porting to Spark took about 2 hours (I hadn't used SparkML before), and the thing took an hour to run on the whole population.
It really is remarkable that Python's numerical computing libraries have such poor performance. When doing chains of elementwise operations on large arrays (such as toy example `elementwise cos(sqrt(sin(pow(array, 2))))`), Julia appears to outperform Python by a factor 2! Numpy cannot avoid computing each intermediate array, which means it has to allocate a ton of wasteful memory. Meanwhile Julia does the smart thing and coalesces all operations into one and applies that single operation elementwise, allocating only a single new array.
Pandas does not also defer computations, which means computing Boolean functions that include the same data multiple times must make multiple passes over said data. Absurd.
numpy appears to have optional arguments for the storage location of outputs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c... - Could you elaborate a little further? The syntax might not be as nice as another language or framework, but it's not "unavoidable".
Disclaimer: have never used numpy, have used python fairly extensively.
You are right about the `out` argument, I'd forgotten about that. But even avoiding the wasteful memory allocations, numpy still is about 60% slower than Julia, as it makes multiple passes over the input data. (If there's a way to get numpy to just make a single pass over the data and remain performant, I'd love to know.)
The first thing I'd reach for is to refactor into a list comprehension. Looks like this is the proper way to do it:
for x in np.nditer(a, op_flags = ['readwrite']):
x[...] = cos(sqrt(sin(pow(x, 2))))
That's some knarly syntax though. I've never seen that ellipsis operator?
Edit: just read about `Ellipsis`. I'm a fan, even if it's sort of nonstandard across libraries. Those readwrite flags are a travesty though, but at least you can paper over it with a helper function.
Something like:
def np_apply(a, f):
for x in np.nditer(a, op_flags = ['readwrite']):
x[...] = f(x)
np_apply(x, lambda x: return cos(sqrt(sin(pow(x, 2))))))
or
def np_apply(a, *argv):
for x in np.nditer(a, op_flags = ['readwrite']):
for f in argv:
x = f(x)
x[...] = x
np_apply(lambda x: return pow(x, 2), sin, sqrt, cos)
Edit3: There's a way to turn this into "pythonic" list comprehension code, but it would probably only make it look prettier rather than more performant.
Yes, you can use the out arguments, but doing so 1) complicates your code, and 2) isn't necessarily a win. You have to do the same number of memory write bus cycles whether you write to a specified output array or to some new array.
I agree with point (1), wholeheartedly - by default the syntax is ugly unless you wrap it in a utility. (2) isn't true as far as I can tell - you can keep reusing the same output buffer, or double buffer if the output can't be the same as the input.
Either way see https://news.ycombinator.com/item?id=22786958 where I figured out the syntax to do it in place, which shouldn't result in any allocations at all and should solve both your concerns.
Regarding #2: whether you use the same buffer or a new buffer, you still have to actually write to the memory. The bottleneck at numpy scale isn't the memory allocation (mmap or whatever) but the actual memory writes (saturating the memory bus), and you need to perform the same number of writes no matter which destination array you use.
1. Memory allocation isn't free
2. Doing multiple loops over the buffer (once per operation) is going to be slower than doing all the operations at once, both because of caching and the opportunity to do the operations on a register and write at the end (though who knows if the python interpreter will do that).
Julia is very, very good. I say this as a person who has invested about a decade in the Python data science ecosystem. Numpy will probably always be copy-heavy, whereas Julia's JIT can optimize as aggressively as the impressive type system allows.
I never understood why we limit ourselves to sending SQL to databases. Instead of sending SQL we should be sending actual logical plans to the database. These logical plans could then be optimized by the database and executed. This would solve two issues I have with SQL:
1. Without CTEs SQL isn't that expressive and with CTEs queries can be hard to understand. Logical query plans on the other hand are rather easy to understand and can express more queries than SQL can.
2. SQL is hard to modify programmatically. To most software, SQL is simply a string without any inherent meaning. A logical query plan on the hand is a tree structure that can be introspected and modified.
That's basically the same thing you want, except that you want it implemented on dataframes while I want it implemented for database systems.
>What I really want to see is something lazy that can do sane query planning and that can work within limited system resources. Maybe one day I'll open source the work I've done in this space.
Eigen does this via expression templates; it essentially optimises the computation graph at compile time.
> Eigen does this via expression templates; it essentially optimises the computation graph at compile time.
That's wrong too. The right query execution plan can depend heavily on the sizes and statistical distributions of the operands. Eigen doesn't have enough information at compile time to make good decisions in all cases. This kind of computation really calls for a JIT.
> Don't put methods on it like pandas.DataFrame. You won't get the API right the first many times and you'll end up with a million methods.
I was actually thinking about this for a bit. Since the data layout seems unlikely to ever
backwards-incompatibly change, can you not define the API entirely in traits on the struct,
and package them into one super trait (e.g. DataFrameAPIVersion1)?
Then you can simply import the DataFrame struct as well as the version of the API you'll use, and
be 100% compatible with any other code regardless of API version they use to manipulate the dataframes.
Then application-specific functions would need a different syntax from the built in methods. Like df.pipe(f). I think that is not desirable. Better to have them all be detached from the frame
"Pure functions" is usually a bad idea with large numeric data structures if they are non-lazy - because you'll end up copying the data.
One of the common pure alternatives is to make them lazy - i.e., all those pure methods don't do anything but rather just collect the information on what to do with your data. But then you need to write an execution engine for arbitrary computation graph, which ain't easy.
> "Pure functions" is usually a bad idea with large numeric data structures if they are non-lazy - because you'll end up copying the data.
This isn't necessarily true due to Rust's strong ownership. Methods can take `self` (without reference) as an argument which means the method takes ownership of itself. There is no copy for 'double' in the below example, yet it is pure:
#[derive(Clone)]
struct A {
v: Vec<i32>,
}
impl A {
fn double(mut self) -> Self {
for x in &mut self.v {
*x *= 2;
}
self
}
}
fn main() {
let a = A { v: vec![2, 3, 4] };
let b = a.double();
dbg!(b.v[1]);
}
If you now tried to access `a` the compiler would error out, saying you're trying to access a moved-from variable. If you still wanted to keep the original `a` around you simply write `let b = a.clone().double()`.
Yes, that's the approach that I'm taking with the library, but with a simple execution flow.
One of the things that used to bite me a lot with Pandas was having to manually free memory from intermediate dataframes that I would create. I'm expecting Rust to work better in this case, because when executing a 3-step pipeline, step 2 consumes the output from step 1, and the dataframe is freed from memory.
I'll write an experiment on this in the coming weeks/months, to see if my assumption would work.
i use pure functions wherever i can, but is this always the right approach when dealing with large dataframes? i imagine when you're chaining a few methods, it'd generate a large number of intermediate results that immediately get discarded/transformed again:
frame.foo_columns().bar_rows().baz()
the result of .foo_columns() is basically linear – it gets passed to .bar_rows() immediately, with no other references to it. maybe this'd be a good place for rust's safe mutability magic?
A good Rust equivalent to Pandas is really the main thing keeping me from switching to Rust for about half of my day job. This is incredibly promising, and I am following your work with great interest. Thank you for working on this!
Out of interest, what are the common tasks that you'd be looking to achieve? We can't replace Pandas and its ecosystem in the short-term, so for me a Rust-backed backend to a dataframe that is compatible with Pandas would be a win.
We use pandas for writing assertions on time series data collected from simulated and real firmeare. Eg:
- two signals X and Y should always be within 5% of each other while condition Z is true.
- signal A should transition from value M to N within time window t1-t2
Would love something in rust to match the rest of the infrastructure. Python also has plenty of warts with regards to packaging, cross compilation and deployment
Not the OP, but also interested in similar applications of rust. The two main benefits I see are
1) Speed and memory efficiency, I've often worked in python/pandas and been frustrated by its performance. If I have a job that takes 12 hours to churn through all of the data, flipping over to rust may accelerate my iteration time to ~7.2-72 minutes.
2) Straightforward FFI bindings into other languages. Like C you can trivially bind rust code into python. Language choice can then be a bit of a two-way door back to python.
You pretty much nailed my reasons as well. Trying to explain that updating model results for this month's data will take 12 hours usually does not go over well. The learning curve to move stuff to Rust would be considered well worth it.
It depends on what type of object you're using in pandas as well as whether you ever need to use a User Defined function e.g. pd.Series.map. Crossing between python objects and C objects is incredibly slow, and python defined UDFs will usually run at python speed.
In most of my projects I'm unable to avoid using a udf/python data manipulation at some point.
Getting a fast & safe Rust UDF layer that targets SPIR/CUDA/PTX would be quite interesting wrt enabling RAPIDS.ai (libcudf + python bindings) as well. It'd enable getting rid of slow & quirky numba etc - I remember Mozilla had GPGPU RUST codegen experiments here awhile back...
rustc does have a working nvptx target today, though it’s not supported nearly as well as the mainstream cpu targets, and some things you would really want for gpu programming (e.g. shared memory address space) are not currently exposed in the rust language. But kernels written in rust can compile to ptx; you’ll still need to write glue code.
Yeah this would be about extending it to columnar analytics funcs, like `df['x'].apply(f)`, `df.query("x > 10 && y < 10"). I realized I may be wrong about the compiler speed part, not sure if it'd be faster than numba for codegen nowadays :)
No, pity that a lot of new build libraries and platform gets inspired by pandas workflow in search for something better, instead of looking at R ecosystem where people actually enjoy the available tools and workflows.
Everyone does. On the other hand I, and many others I assume, aren't willing to move to R for the sake of the one superior tool over python. Two if you count ggplot2, I suppose.
Yeah but for giving someone an intro to what dataframes are, I think pandas might leave a sour taste where dplyr would allow someone to learn comfortably.
Although, one might run into a "true level" situation. This is when Morty feels what true level is like, then says "everything is crooked, reality is poison" when he has to live in the "fake level" world.
What bothers me is: why can't Rust developers just import a C++ library that does the job? What novelty would a Rust version of the same thing bring really? Why not focus on real innovation, and use wrappers for things that were already built in another language a decade ago?
> why can't Rust developers just import a C++ library that does the job?
We can and frequently do. Rust even seems to have one of the better packaging stories around building C++ libs, albeit with several heavyweight dependencies (C++ compiler + toolchain, libc, etc.). Granted, calling into it from Rust means we need a C ABI, but that's rarely a terrible problem.
> What novelty would a Rust version of the same thing bring really?
I find Rust easier to cross compile for starters. I still haven't tamed emscripten for compiling C++ as WASM - the Rust toolchains by comparison make compiling Rust to WASM trivial. Rust's larger stdlib, standard package manager, and healthy library ecosystem means whatever I'm cross compiling is less likely to need to resort to platform specific nonsense in need of manual porting.
But the main draw of Rust to me is the ability to write memory safe code and race-safe native code, reducing how many heisenbugs I encounter and the portion of the codebase where they might lurk. If your codebase is large, and mostly memory/race safe, this is a huge advantage. If your codebase is large, and only 50% memory/race safe, you're still frequently in debugging hell inside a large haystack.
If you take a super solid C++ dependency, maybe you kinda retain that advantage, but most C++ dependencies I take on don't meet that high bar (including those me and my coworkers write!) Alternatively, I could take on a relatively mediocre Rust dependency, and I'll probably still retain that memory/race-safety advantage. There are exceptions to this of course thanks to Rust's many opt-in escape hatches - but I also find it much easier to audit Rust code for their use and the possible problems they may cause, versus C++ where every array access is suspect.
Do you think all rust developers are connected together like some sort of hivemind? What an open source developer chooses to spend their time on doesn't require the consideration and consolidation of the entire rust team, nor does it speak for all of them.
That said there are plenty of reasons why they shouldn't just import c++ libraries, one of them being that c++ is not the same language as rust.
A certain mindset is bothered by observing others apparently wasting effort they don't need to. It has nothing to do with any benefit or loss for the observer. Very roughly, one can think of it as a sort of altruism focused on efficiency.
This mindset may be appeased by demonstrating why you don't believe your actions are actually a waste of effort.
it has to be done at some point. I still have vcpkg on my Windows machine, because I needed to install it and a few GB of other things, just to use PostgreSQL in Rust a few years ago (I think Diesel).
If someone hadn't implemented the libpq protocol in Rust, I likely wouldn't have written in binary copy support on the dataframe in one afternoon.
This reduces the barrier for people to productively use their favourite languages. It might be unproductive at first, but we see the benefits as time goes on.
> This reduces the barrier for people to productively use their favourite languages. It might be unproductive at first, but we see the benefits as time goes on.
It is vastly unproductive to reproduce anything useful in 10 different language just because our cross-language tooling awfuly suxxe.
The lack of cross-language dev-oriented lead us in the ridiculous situation where every language has its own package manager and his completely unable to use a package made for an other language.
There is reasons to that... but much more political than technical and that's the sad state we are in.
I don't understand how it is completely unproductive. Surely there must some advantage that Rust has over C++? If that advantage exists then there is your justification for reinventing something that already exists. If all you do is reuse existing C++ libraries in Rust then why even bother with Rust in the first place? Why not just use C++ directly? Well, the most of the value from using Rust is derived from the things that are written in Rust.
Let me give you an example but this time with Groovy. Groovy runs on the JVM and that means it benefits from the existing JVM ecosystem, right? Well, of course you get the benefit of existing libraries but there was a crucial difference between using them through Groovy/Grails than natively through Java. There were lots of grails plugins that were very thin layers above the Java libraries yet somehow the usability increased a hundredfold. A small plugin with maybe 1000 lines of code ended up providing more value to our business than the part of the library that was written in Java.
The right way to design a data analysis DSL is to do it the way Python's dask does it: build an operation graph and execute it as the last step. The trouble is that Dask doesn't get it right either because, as part of its graph formation, it computes sizes of operands, and computing operand sizes can involve huge amounts of computation, and so dask, in effect, is also an "immediate mode" system.
What I really want to see is something lazy that can do sane query planning and that can work within limited system resources. Maybe one day I'll open source the work I've done in this space. Query languages are infinitely nicer for analytics than data processing libraries.