In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!
Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!
*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.
I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer
The convention in Julia is that a package that defines a type Abc is called Abcs.jl. Also, DataFrames.jl provides its own manipulation functions which DataFramesMeta is a wrapper around using metaprogramming, hence the name.
That makes sense, but I still think the meta name is confusing. I mean, as a user the fact that it was implemented using metaprogramming techniques has no bearing, it's an implementation detail. Actually, my brain never thought to associate meta in this context with metaprogramming. Makes sense in hindsight, but still confusing.
But still, I can't really come up with a nicer name. VerbalDataFrames to match the dplyr verbs idiom?
Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.
I don't like it as much as dplyr and I stand behind that. It's too "clever", especially with respect to joins.
Everything is fine "once you understand how to use it", even assembly code, but it's not equally expressive or intuitive. So I don't value data.table speed that much, it's my thinking and typing speed that's usually the limiting factor. I would always recommend dplyr over anything else for someone learning how to use tables.
I also can't help but point out that data.table has the worst first FAQ answer I've ever seen in software documentation: https://cran.r-project.org/web/packages/data.table/vignettes.... Just astonishingly bad. I could write an essay about the unique and diverse ways in which this thing is both incredibly poorly organized and deeply user-hostile.
But if you truly have a need for speed on large datasets, it may be for you.
In what way data.table trumps dplyr? Genuinely interested in knowing.
While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.
dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]
the easiest way to think about it is data.table is for people who are doing a lot of exploratory data analysis every day. If you're doing the same thing over and over, it makes sense to create a DSL specific to that task and optimize the hell out of it. that's basically data.table.
dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.
I disagree. Doing data manipulation one action at a time in a piped sequence is easiest to reason about because the state right before you apply a new operation is always clear.
data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.
To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):
dplyr:
df %>%
group_by(a) %>%
filter(b < mean(b))
data.table:
DT[DT[, .I[b < mean(b)],
by = .(a)]$V1]
Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.
(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)
The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.
The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.
One plus with dplyr is that I can share the code with non-R programmers (and even some non-programmers) and they can follow what is happening pretty easily, while data.table takes some more explanation.
dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!
Meh. Some people will never stop using Perl or APL because you can get anything done in five random characters (well, anything the language is optimized to express, everything else is a lot harder). I respect it but it's not for me.
The tidyverse has the most advanced and intuitive versions of all the things you mention IMO. It has evolved a lot in the past couple years and your impressions of it could be out of date.
There is also the dtplyr backend for data.table speed with dplyr syntax, but I don't even bother because dplyr is almost always fast enough for me.
I did go check out what's new in the tidyverse after your comment and was pleased to see new functions like pivot_wider and pivot_longer replacing the extremely confusing mess of spread and unite. So it's great to see the ecosystem evolving toward better usability. However I would hardly count it as a victory when late in the game you have to change the API for some core data manipulation functions because you made them too confusing the first time around.
I think you are also maybe assuming everyone has the same use-case as you for data manipulation libraries. If you are coming from a non-programming context and picking up R for the first time, no doubt tidyverse is the way to do that. The verbosity is obviously a benefit if you're having to read someone else's code and are not interested in learning a DSL just to understand what columns are being filtered on or dropped or whatever.
But if you are doing data analysis full time and are writing thousands of lines of throwaway EDA code a week, most of it only to be seen by yourself, the concision and speed that data.table offers is basically second to none, in any language. Rapid iteration for you personally is the point. Less typing is good, because you're trying to move as fast as possible to explore hypotheses. Execution speed on medium sized data is important, because a few extra seconds on every run matters a lot when you are running 500 micro-batches of analysis code a day. And as the h2o benchmarks show, data.table is still quite a bit faster than dplyr. Obviously not everyone needs the speed, but a lot of us do!
It's my hypothesis that pretty much everyone who loves data.table is a finance/trading type person who as you say needs to quickly write tons of throwaway exploratory code to analyze large stock price datasets or the like.
I would probably prefer data.table to dplyr in that use case as well. The creator of data.table clearly comes from that background and wrote it for those kind of workloads.
I will also admit that the latest data.table tutorials suggest a lot of improvement over time. data.table made some truly WTF decisions in its early versions and has backtracked on all of it. The join API is much more reasonable now and it supports non-equijoins, which for many people could be the decider vs dplyr just by itself.
The dplyr API has only evolved so much because Hadley set insanely high standards for how powerful and intuitive it should be. So personally I don't count it against them that they didn't get it 100% right the first time.... even though I personally have been burned a couple times by all the changes. I think it's worth it for what they have achieved.
Not that it's all roses. Tidyverse stack traces have become kind of horrible. They're dozens and dozens of layers deep and you have to be pretty experienced to sift through the noise. I'm an old hand and know how to deal with it, which is probably the way a lot of people feel about their favorite table package... even gag Pandas.
In my case your guess is completely correct, as I learned data.table in a financial company analyzing large insurance datasets :)
I apologize if I came across as a hardliner. Sometimes I feel like data.table is not well advertised for how capable it is, so I will defend the library if given the chance. Surprisingly how many "big data workloads" you can replace with a high memory cloud instance and a simple data.table script. Cheers to using the right tool for the right job.
My background is (somewhat) similar but for some reason I've been a stick in the mud keeping on with dplyr. Somehow I feel the verbosity helps make sure I get things right. And a lot of my code is not throwaway, I have pipelines I have to maintain and teach others to maintain. (There's a guy on my team replacing some of my uglier code with simple data.table non-equijoins and I can't even argue with him as I mentioned earlier!)
I'm glad you're having success with data.table and I totally support you against the forces of evil trying to make us use Spark or whatever is the latest big data nonsense to analyze a few million rows.
It's like how we may not agree what project management tool to use but we all agree it's not JIRA :)
I may end up switching to data.table after all. I find dplyr easier to reason about for complex production pipelines that need to be precisely "correct", but all the package developers are raising the bar all the time and data.table may be OK for this use case by now. I definitely do feel the pain point of dplyr slowness here and there.
Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?
There have been a number of interesting attempts at this. They have names like dplython, And haven't really caught on widely. Python isn't really the best language to build a dplyr-like API in since both the structure and the culture of the language are against metaprogramming and nonstandard evaluation to create DSLs.
I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.
Having done this, a couple notes on what will unavoidably differ in Python
* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.
* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.
* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.
You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.
And if you want you and your team mates to hate you when they need to work on your code later, and you’ve got random, mystery functions all over the place.
> No Index
> They are not needed. Not having them makes things easier. Convince me otherwise
Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).
The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.
Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.
There are so many dataframe libraries, many of which have APIs closely following pandas, but not drop-in replacements. I wish we could agree on a standard describing the core parts of what a dataframe must do, such that code depending only on those operations can easily move between dataframes.
This was my PhD focus. We identified a core "dataframe algebra"[1] that encompasses all of pandas (and R/S data.frames): a total of 16 operators that cover all 600+ operators of pandas. What you describe was exactly our aim. It turns out there are a lot of operators that are really easy to support and make fast, and that gets you about 60% or so of the way to supporting all of pandas. Then there are really complex operators that may alter the schema in a way that is undeterminable before the operation is carried out (think a row-wise or column-wise `df.apply`). The flexibility that pandas offers is something we were able to express mathematically, and with that math we can start to optimize the dataframe holistically, rather than chipping away at small parts of pandas that are embarrassingly parallel.
Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.
I've used pandas off and on for the better of a seven or eight years and, unlike other python libraries, I always feel like I'm starting from scratch every time I begin a new project. following a tutorial/official API on a second screen just to remember how to do fairly basic stuff.
The reason is that once I'm done building whatever model I've needed it works so well I don't have to touch it again for a few years and I forget everything I learned (or the API changes again).
In Julia there's something better, called Tables.jl. It's not exactly an API for dataframes (what would be point the of that? You don't need many implementations of dataframes, you just need one great one). Instead it's an API for table-shaped data. Dataframes are containers for table-shaped data.
I wrote a library that wraps polars DataFrame and Series objects to allow you to use them with the same syntax as with pandas DataFrame and Series objects. The goal is not to be a replacement for polars' objects and syntax, but rather to (1) Allow you to provide (wrapped) polars objects as arguments to existing functions in your codebase that expect pandas objects and (2) Allow you to continue writing code (especially EDA in notebooks) using the pandas syntax you know and (maybe) love while you're still learning the polars syntax, but with the underlying objects being all-polars. All methods of polars' objects are still available, allowing you to interweave pandas syntax and polars syntax when working with MppFrame and MppSeries objects.
Furthermore, the goal should always be to transition away from this library over time, as the LazyFrame optimizations offered by polars can never be fully taken advantage of when using pandas-based syntax (as far as I can tell). In the meantime, the code in this library has allowed me to transition my company's pandas-centric code to polars-centric code more quickly, which has led to significant speedups and memory savings even without being able to take full advantage of polars' lazy evaluation. To be clear, these gains have been observed both when working in notebooks in development and when deployed in production API backends / data pipelines.
I'm personally just adding methods to the MppFrame and MppSeries objects whenever I try to use pandas syntax and get AttributeErrors.
They have a benchmark for expressiveness (as opposed to performance). Part of this inquiry has been to form a "standard library" of Dataframes operations.
Polars could bring the best of both worlds together if it can codegen python api calls to their Rust equivalent. A user conducts ad-hoc analysis and rapid development with Python. When the work is ready to ship, the user invokes a codegen to transform into Rust-equivalent api calls, resulting in a new rust module.
I’ve been using it for the past quarter. In addition to the speed, I’m very pleased with the pyspark-esque api. This means migrating code from research to production is that much easier.
I'm confused. Polars is built on top of the Rust of bindings for Apache Arrow. Arrow already has Python bindings. What does this project add by creating a new Python binding on top of the Rust binding?
I'm reading all these comments and keep asking myself if I'm missing something, because I honestly sort of like pandas' API?
Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.
So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing
This question was asked last time the author posted this few months ago. I’m surprised they didn’t update the benchmarks. Kind of makes me think Vaex is faster.
If you use Pandas daily, maybe get used to it and can ignore the issues, but for anyone using Pandas occasionally, it's every time a huge pain trying to figure out how to use it. The API is not intuitive and the documentation is very verbose and unclear. And stackoverflow top answers are often the "old way" of doing something when yet another way of doing the same thing has been added to the API.
For some people pandas seems to click. Good for you. I always struggle with google and the manual to get even simple things done.
I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.
I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.
it's just so bloated and verbose. many ways to do the same things, annoying defaults (how is column not the default axis to drop?), indices are beyond frustrating (have never met anyone who doesn't just reset them after a groupby), inconvenient to do custom aggregations, very slow, not opinionated enough
then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls
I've never seen the term "dataframe" used as it is on this webste, and the commenters here seem to all use it. Judging by the examples it seems to just refer to a "row" from e.g. a CSV or SQL query. So is that all it is, or am I missing something?
Does anybody here know dataframe systems that are able to handle file sizes bigger than the available RAM? Is polars able to handle this? I am only aware of disk.frame (diskframe.com), but don't know how well it performs.
To you and all the other sibling comments: Thanks a lot! Exactly what I have been looking for!
With regard to Vaex, I would really be interested in an independent benchmark comparing it to dask, spark, data.table etc. However, I have seen in the comments that others also can't find that.
It looks interesting but phrases like "embarrassingly parallel execution" make my marketing hype detectors trigger. Maybe they could tone down their self promotion just a touch. Also "Even though Polars is completely written in Rust (no runtime overhead!) ...". I find that hard to believe.
It's a term for the nature of a problem, not a library or software package. It looks like they have designed the API so that "embarrassingly parallel" problems can naturally be computed using Polars. That would be fantastic, much better than Pandas. The way they write it sounds like marketing fluff to me and that's a shame because Polars looks like a useful thing.
“Embarrassingly parallel execution” means that it parallelizes (only) problems that are embarrassingly parallel. The meaning is clear — if you want to be really pedantic about it, problems are “parallelizable” and only execution is “parallel”, but “embarrassingly parallelizable” is too many syllables.
They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.
Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.
- On advanced query: 3rd, 6th, 6th, 4th (up 1), - (out of memory).
> The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is.
Not true. Upon quick peek on the bench code, ClickHouse and Spark use in-memory table. I assume other engines too.
Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).
Also in the second run, julia is not the fastest. Julia would not be faster than Rust, its got a garbage collector. This is what you see in the join benchmarks that really push the allocator.
Next to that, the databases run in in-memory mode, so there is not disk overhead. Spark is slower because JVM + row-wise data.
> Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).
Here's my view: The author of that page has commented here on HN; If my claim was so outrageously wrong as you claim, he would've corrected it.
as mentioned in that thread, GC and strings, or especially a combination of the two, can be very much a downer in terms of julia performance. That's actually pretty surprising since strings are often as important if not more important than numbers for a lot of data processing needs.
I'd also say in terms of compilation time, some autocaching layer outside of precompilation would do wonders.
Agree .. and I was looking for an option to sort by second run.
One trick I've tried to some effect is to run jl code on a smaller data sizes so the compilation gets done and then repeat on the large one so it doesn't get interrupted by compilation. Not sure if this is a recommended approach. Benchmarking Julia is a pain for this reason - compilation always gets mixed up with runtime. But it hasn't prevented me from using it interactively. Pretty happy with it actually.
Not really. They are designed to showcase a common use case across multiple technologies.
The beauty of this benchmark is that there is a hardware limit included so that it forces you to create novel solutions to perform well.
>Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
Not sure where you're getting that but even on second run Julia doesn't really compete with DT/Polars
It's obvious that you're promoting duck eggs at the expense of, say, chicken eggs or quail eggs or even ostrich eggs. Maybe you could tone that down a bit.
> considering that you compile once and run millions of times.
If you’re writing data pipelines then yes, but a lot of Pandas users use it interactivity. As much as I’d rather use Julia, the last time I tried it I found myself waiting for computation far more often than with a Jupyter/Python workflow.
Give it another try. They've improved the first run times quite a bit over the last few versions. Package precompilation has gotten way better as well.
The embarrassingly parallel is aimed at the expression API. This allows one to write multiple expressions, and all of them get executed parallel. (So embarrassingly, meaning they don't have to communicate and use locks).
Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!
*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.