Polars: Fast DataFrame library for Rust and Python

civilized · on Dec 17, 2021

In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!

Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!

*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.

cigrainger · on Dec 17, 2021

I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer

anko · on Dec 17, 2021

that's really cool, thanks!

pdeffebach · on Dec 17, 2021

DataFramesMeta.jl might be exactly what you are looking for then! The syntax is very close to dplyr, but has performance benefits thanks to Julia.

Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/

fault1 · on Dec 17, 2021

DataFramesMeta is great!

But I always get confused by the name. Since DataFrames.jl is lower level shouldn't that be DataFramesBase.jl and the meta package be DataFrames.jl?

pdeffebach · on Dec 17, 2021

Yes it absolutely needs a new name!

Hasnep · on Dec 17, 2021

The convention in Julia is that a package that defines a type Abc is called Abcs.jl. Also, DataFrames.jl provides its own manipulation functions which DataFramesMeta is a wrapper around using metaprogramming, hence the name.

fault1 · on Dec 17, 2021

That makes sense, but I still think the meta name is confusing. I mean, as a user the fact that it was implemented using metaprogramming techniques has no bearing, it's an implementation detail. Actually, my brain never thought to associate meta in this context with metaprogramming. Makes sense in hindsight, but still confusing.

But still, I can't really come up with a nicer name. VerbalDataFrames to match the dplyr verbs idiom?

Hasnep · on Dec 20, 2021

Yeah, I agree it's not a good name. I think using the word macro instead of meta is more useful to the user, something like DataFramesMacros.jl.

davnn · on Dec 17, 2021

One of the piping macro packages + dataframes.jl works as well.

vavooom · on Dec 17, 2021

Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.

civilized · on Dec 17, 2021

I don't like it as much as dplyr and I stand behind that. It's too "clever", especially with respect to joins.

Everything is fine "once you understand how to use it", even assembly code, but it's not equally expressive or intuitive. So I don't value data.table speed that much, it's my thinking and typing speed that's usually the limiting factor. I would always recommend dplyr over anything else for someone learning how to use tables.

I also can't help but point out that data.table has the worst first FAQ answer I've ever seen in software documentation: https://cran.r-project.org/web/packages/data.table/vignettes.... Just astonishingly bad. I could write an essay about the unique and diverse ways in which this thing is both incredibly poorly organized and deeply user-hostile.

But if you truly have a need for speed on large datasets, it may be for you.

nojito · on Dec 17, 2021

The FAQ isn't for new data.table users.

https://rdatatable.gitlab.io/data.table/articles/datatable-i...

Which is why it isn't really linked anywhere else.

minimaxir · on Dec 17, 2021

There is an official dplyr extension that leverages data.table: https://dtplyr.tidyverse.org/

vatican_banker · on Dec 17, 2021

In what way data.table trumps dplyr? Genuinely interested in knowing.

While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.

dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]

extr · on Dec 17, 2021

the easiest way to think about it is data.table is for people who are doing a lot of exploratory data analysis every day. If you're doing the same thing over and over, it makes sense to create a DSL specific to that task and optimize the hell out of it. that's basically data.table.

dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.

RcrdBrt · on Dec 17, 2021

Anecdotal data: I found that data.table ingestion speed with fread() trumps absolutely everything else

civilized · on Dec 17, 2021

This observation is pretty widely shared.

nojito · on Dec 17, 2021

The difficulty to read is a misnomer.

    Dt[rows, columns, groups]

Assuming your dplyr code is generally split apply combine, the dt version is shorter and easier to reason around.

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

civilized · on Dec 17, 2021

I disagree. Doing data manipulation one action at a time in a piped sequence is easiest to reason about because the state right before you apply a new operation is always clear.

data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.

To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):

dplyr:

  df %>% 
    group_by(a) %>%
    filter(b < mean(b))

data.table:

  DT[DT[, .I[b < mean(b)],
        by = .(a)]$V1]

Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.

(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)

The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.

The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.

nojito · on Dec 19, 2021

Your code golf example makes no sense.

    B_mean <- dt[, mean(b)]

    Dt[b<b_mean, by=.(a)]

Unlike the dplyr solution the dt solution is robust and we can independently test to make sure the mean of b makes sense.

The very easy to reason around concept of dt[rows, columns, groups] makes the code extremely clear.

Your translation example is absolutely bonkers because it’s trying to pigeonhole the simplicity of dt into the nonsense that is dplyr.

temp8964 · on Dec 17, 2021

The easiest to understand data frame API syntax is SQL: select cols from df where rows match condition group by grouping cols.

data.table syntax is just like that. But less verbose. Plus super fast. No reason to not love it.

civilized · on Dec 17, 2021

I agree that if that's all you do with data, data.table makes it easy.

gullywhumper · on Dec 17, 2021

One plus with dplyr is that I can share the code with non-R programmers (and even some non-programmers) and they can follow what is happening pretty easily, while data.table takes some more explanation.

extr · on Dec 17, 2021

dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!

civilized · on Dec 17, 2021

Meh. Some people will never stop using Perl or APL because you can get anything done in five random characters (well, anything the language is optimized to express, everything else is a lot harder). I respect it but it's not for me.

The tidyverse has the most advanced and intuitive versions of all the things you mention IMO. It has evolved a lot in the past couple years and your impressions of it could be out of date.

There is also the dtplyr backend for data.table speed with dplyr syntax, but I don't even bother because dplyr is almost always fast enough for me.

extr · on Dec 17, 2021

I did go check out what's new in the tidyverse after your comment and was pleased to see new functions like pivot_wider and pivot_longer replacing the extremely confusing mess of spread and unite. So it's great to see the ecosystem evolving toward better usability. However I would hardly count it as a victory when late in the game you have to change the API for some core data manipulation functions because you made them too confusing the first time around.

I think you are also maybe assuming everyone has the same use-case as you for data manipulation libraries. If you are coming from a non-programming context and picking up R for the first time, no doubt tidyverse is the way to do that. The verbosity is obviously a benefit if you're having to read someone else's code and are not interested in learning a DSL just to understand what columns are being filtered on or dropped or whatever.

But if you are doing data analysis full time and are writing thousands of lines of throwaway EDA code a week, most of it only to be seen by yourself, the concision and speed that data.table offers is basically second to none, in any language. Rapid iteration for you personally is the point. Less typing is good, because you're trying to move as fast as possible to explore hypotheses. Execution speed on medium sized data is important, because a few extra seconds on every run matters a lot when you are running 500 micro-batches of analysis code a day. And as the h2o benchmarks show, data.table is still quite a bit faster than dplyr. Obviously not everyone needs the speed, but a lot of us do!

civilized · on Dec 17, 2021

It's my hypothesis that pretty much everyone who loves data.table is a finance/trading type person who as you say needs to quickly write tons of throwaway exploratory code to analyze large stock price datasets or the like.

I would probably prefer data.table to dplyr in that use case as well. The creator of data.table clearly comes from that background and wrote it for those kind of workloads.

I will also admit that the latest data.table tutorials suggest a lot of improvement over time. data.table made some truly WTF decisions in its early versions and has backtracked on all of it. The join API is much more reasonable now and it supports non-equijoins, which for many people could be the decider vs dplyr just by itself.

The dplyr API has only evolved so much because Hadley set insanely high standards for how powerful and intuitive it should be. So personally I don't count it against them that they didn't get it 100% right the first time.... even though I personally have been burned a couple times by all the changes. I think it's worth it for what they have achieved.

Not that it's all roses. Tidyverse stack traces have become kind of horrible. They're dozens and dozens of layers deep and you have to be pretty experienced to sift through the noise. I'm an old hand and know how to deal with it, which is probably the way a lot of people feel about their favorite table package... even gag Pandas.

extr · on Dec 17, 2021

In my case your guess is completely correct, as I learned data.table in a financial company analyzing large insurance datasets :)

I apologize if I came across as a hardliner. Sometimes I feel like data.table is not well advertised for how capable it is, so I will defend the library if given the chance. Surprisingly how many "big data workloads" you can replace with a high memory cloud instance and a simple data.table script. Cheers to using the right tool for the right job.

civilized · on Dec 17, 2021

My background is (somewhat) similar but for some reason I've been a stick in the mud keeping on with dplyr. Somehow I feel the verbosity helps make sure I get things right. And a lot of my code is not throwaway, I have pipelines I have to maintain and teach others to maintain. (There's a guy on my team replacing some of my uglier code with simple data.table non-equijoins and I can't even argue with him as I mentioned earlier!)

I'm glad you're having success with data.table and I totally support you against the forces of evil trying to make us use Spark or whatever is the latest big data nonsense to analyze a few million rows.

It's like how we may not agree what project management tool to use but we all agree it's not JIRA :)

I may end up switching to data.table after all. I find dplyr easier to reason about for complex production pipelines that need to be precisely "correct", but all the package developers are raising the bar all the time and data.table may be OK for this use case by now. I definitely do feel the pain point of dplyr slowness here and there.

nuq · on Dec 17, 2021

True that data.table is much simpler and faster one of the reasons I switched from dplyr to data.table

ttymck · on Dec 17, 2021

Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?

civilized · on Dec 17, 2021

There have been a number of interesting attempts at this. They have names like dplython, And haven't really caught on widely. Python isn't really the best language to build a dplyr-like API in since both the structure and the culture of the language are against metaprogramming and nonstandard evaluation to create DSLs.

otsaloma · on Dec 17, 2021

Agreed, dplyr is great.

I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.

Having done this, a couple notes on what will unavoidably differ in Python

* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.

* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.

* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.

https://github.com/otsaloma/dataiter

matham · on Dec 17, 2021

>NumPy (and Pandas) is still missing a proper missing value (NA).

But if it's missing a missing value, doesn't that mean that it has a proper missing value?

I'll let myself out now...

_Wintermute · on Dec 17, 2021

You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.

pietroppeter · on Dec 17, 2021

still very small yet, but Nim's dataframe library (datamancer) has a dplyr api (and it is fast): https://github.com/SciNim/Datamancer

Being in Nim, it will be easy also to add sweet DSLs.

BiteCode_dev · on Dec 17, 2021

You don't need to write "import pandas; pandas.bla()", you can do "from pandas import *; anything_in_pandas()" if you want quick and dirty.

FridgeSeal · on Dec 17, 2021

And if you want you and your team mates to hate you when they need to work on your code later, and you’ve got random, mystery functions all over the place.

cabalamat · on Dec 17, 2021

> dplyr

Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?

gpderetta · on Dec 17, 2021

From the python docs:

  > No Index
  > They are not needed. Not having them makes things easier. Convince me otherwise

Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).

The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.

ritchie46 · on Dec 17, 2021

> (as usual project is called select...)

Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.

sriku · on Dec 17, 2021

Hmmm .. in the linked benchmarks [1], DataFrames.jl (Julia library) appears to be fairly competitive.

[1] https://h2oai.github.io/db-benchmark/

abeppu · on Dec 16, 2021

There are so many dataframe libraries, many of which have APIs closely following pandas, but not drop-in replacements. I wish we could agree on a standard describing the core parts of what a dataframe must do, such that code depending only on those operations can easily move between dataframes.

devin-petersohn · on Dec 17, 2021

This was my PhD focus. We identified a core "dataframe algebra"[1] that encompasses all of pandas (and R/S data.frames): a total of 16 operators that cover all 600+ operators of pandas. What you describe was exactly our aim. It turns out there are a lot of operators that are really easy to support and make fast, and that gets you about 60% or so of the way to supporting all of pandas. Then there are really complex operators that may alter the schema in a way that is undeterminable before the operation is carried out (think a row-wise or column-wise `df.apply`). The flexibility that pandas offers is something we were able to express mathematically, and with that math we can start to optimize the dataframe holistically, rather than chipping away at small parts of pandas that are embarrassingly parallel.

Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.

[1] https://arxiv.org/pdf/2001.00888

cardosof · on Dec 17, 2021

Awesome work, thanks!

carterschonwald · on Dec 17, 2021

This is really cool! Thx for sharing

teruakohatu · on Dec 16, 2021

Worse than that, pandas has a terrible API to start with. Going from the QueryVerse to Pandas feels like going back in time.

bshipp · on Dec 17, 2021

I've used pandas off and on for the better of a seven or eight years and, unlike other python libraries, I always feel like I'm starting from scratch every time I begin a new project. following a tutorial/official API on a second screen just to remember how to do fairly basic stuff.

The reason is that once I'm done building whatever model I've needed it works so well I don't have to touch it again for a few years and I forget everything I learned (or the API changes again).

ddavis · on Dec 16, 2021

There is an effort for this: https://github.com/data-apis/dataframe-api

eternalban · on Dec 17, 2021

https://data-apis.org/dataframe-protocol/latest/design_requi...

sdfgsdf · on Dec 17, 2021

In Julia there's something better, called Tables.jl. It's not exactly an API for dataframes (what would be point the of that? You don't need many implementations of dataframes, you just need one great one). Instead it's an API for table-shaped data. Dataframes are containers for table-shaped data.

austospumanto · on Dec 17, 2021

https://github.com/austospumanto/minimal-pandas-api-for-pola...

pip install minimal-pandas-api-for-polars

I wrote a library that wraps polars DataFrame and Series objects to allow you to use them with the same syntax as with pandas DataFrame and Series objects. The goal is not to be a replacement for polars' objects and syntax, but rather to (1) Allow you to provide (wrapped) polars objects as arguments to existing functions in your codebase that expect pandas objects and (2) Allow you to continue writing code (especially EDA in notebooks) using the pandas syntax you know and (maybe) love while you're still learning the polars syntax, but with the underlying objects being all-polars. All methods of polars' objects are still available, allowing you to interweave pandas syntax and polars syntax when working with MppFrame and MppSeries objects.

Furthermore, the goal should always be to transition away from this library over time, as the LazyFrame optimizations offered by polars can never be fully taken advantage of when using pandas-based syntax (as far as I can tell). In the meantime, the code in this library has allowed me to transition my company's pandas-centric code to polars-centric code more quickly, which has led to significant speedups and memory savings even without being able to take full advantage of polars' lazy evaluation. To be clear, these gains have been observed both when working in notebooks in development and when deployed in production API backends / data pipelines.

I'm personally just adding methods to the MppFrame and MppSeries objects whenever I try to use pandas syntax and get AttributeErrors.

chrisaycock · on Dec 17, 2021

Types for Tables was posted to HN last week:

https://news.ycombinator.com/item?id=29509439

They have a benchmark for expressiveness (as opposed to performance). Part of this inquiry has been to form a "standard library" of Dataframes operations.

gpderetta · on Dec 17, 2021

I believe Codd took a stab at it a few years ago. He had some success, but didn't break in data science.

ivirshup · on Dec 17, 2021

The guy who coined a new term every time he had a new product to sell?

contravariant · on Dec 17, 2021

That's SQL isn't it?

gpderetta · on Dec 17, 2021

Well, relational algebra/calculus, but close enough.

vincent-toups · on Dec 17, 2021

God please anything to liberate me from pandas, which has one of the wildest API's I've ever had to routinely work with.

Dowwie · on Dec 17, 2021

Polars could bring the best of both worlds together if it can codegen python api calls to their Rust equivalent. A user conducts ad-hoc analysis and rapid development with Python. When the work is ready to ship, the user invokes a codegen to transform into Rust-equivalent api calls, resulting in a new rust module.

ahurmazda · on Dec 16, 2021

I’ve been using it for the past quarter. In addition to the speed, I’m very pleased with the pyspark-esque api. This means migrating code from research to production is that much easier.

riskneutral · on Dec 17, 2021

I'm confused. Polars is built on top of the Rust of bindings for Apache Arrow. Arrow already has Python bindings. What does this project add by creating a new Python binding on top of the Rust binding?

bogeholm · on Dec 17, 2021

Polars is not using Rust bindings for Arrow, it uses a Rust implementation called arrow2: https://github.com/pola-rs/polars/blob/master/polars/polars-...

Arrow2: https://lib.rs/crates/arrow2

Fiahil · on Dec 17, 2021

… and it’s using arrow2, not the official, unsafe, arrow crate. Great, it means we can use it !

optimalonpaper · on Dec 17, 2021

I'm reading all these comments and keep asking myself if I'm missing something, because I honestly sort of like pandas' API?

Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.

So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing

jmakov · on Dec 16, 2021

How does compare to Vaex?

rp1 · on Dec 17, 2021

This question was asked last time the author posted this few months ago. I’m surprised they didn’t update the benchmarks. Kind of makes me think Vaex is faster.

ritchie46 · on Dec 17, 2021

The benchmarks are hosted by H2oAI, not by the polars team. Vaex is not in that benchmark.

I don't believe Vaex would be faster though. They aim at larger than RAM data processing, not maximum in-memory performance like we do.

VHRanger · on Dec 16, 2021

That's the real question

unixhero · on Dec 17, 2021

What makes Pandas so bad and what makes Dplyr so great?

I have used Pandas a lot for data analysis and for data integration duct tape scenarios. For me it has been a low bar for achieving a lot.

otsaloma · on Dec 17, 2021

If you use Pandas daily, maybe get used to it and can ignore the issues, but for anyone using Pandas occasionally, it's every time a huge pain trying to figure out how to use it. The API is not intuitive and the documentation is very verbose and unclear. And stackoverflow top answers are often the "old way" of doing something when yet another way of doing the same thing has been added to the API.

wodenokoto · on Dec 17, 2021

For some people pandas seems to click. Good for you. I always struggle with google and the manual to get even simple things done.

I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.

I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.

bllguo · on Dec 17, 2021

it's just so bloated and verbose. many ways to do the same things, annoying defaults (how is column not the default axis to drop?), indices are beyond frustrating (have never met anyone who doesn't just reset them after a groupby), inconvenient to do custom aggregations, very slow, not opinionated enough

then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls

StreamBright · on Dec 17, 2021

I could never use Pandas without SO and the documentation and I use it for almost 10 years.

I have no idea what is the intention of the developers most of the time.

unixhero · on Dec 24, 2021

Aha, so you're productive right?

the_biot · on Dec 17, 2021

I've never seen the term "dataframe" used as it is on this webste, and the commenters here seem to all use it. Judging by the examples it seems to just refer to a "row" from e.g. a CSV or SQL query. So is that all it is, or am I missing something?

wodenokoto · on Dec 17, 2021

A data frame is one of the basic, built-in data structures in R, which was released in 1993. And R was based on an even older S.

So it’s not a new thing.

If you don’t work in computational statistics / data science it might not be a well known term, though.

milliams · on Dec 17, 2021

A "dataframe" is a "table"

maxerickson · on Dec 17, 2021

It's a column oriented data structure.

rytill · on Dec 17, 2021

How would this compare to loading a sqlite database into memory and performing queries with it?

1egg0myegg0 · on Dec 17, 2021

Polars would be 10-100x faster, but so would DuckDB!

rytill · on Dec 17, 2021

Wow, that’s amazing. I’ll definitely try it out. Do you know if there is any built-in functionality related to data compression or data loaders?

pvitz · on Dec 17, 2021

Does anybody here know dataframe systems that are able to handle file sizes bigger than the available RAM? Is polars able to handle this? I am only aware of disk.frame (diskframe.com), but don't know how well it performs.

alexisread · on Dec 17, 2021

I believe Vaex can do this, in addition to GPU processing and reading direct from s3. https://github.com/vaexio/vaex

pvitz · on Dec 17, 2021

To you and all the other sibling comments: Thanks a lot! Exactly what I have been looking for!

With regard to Vaex, I would really be interested in an independent benchmark comparing it to dask, spark, data.table etc. However, I have seen in the comments that others also can't find that.

chrisaycock · on Dec 17, 2021

The H20 benchmarks cover Dataframe operations:

https://h2oai.github.io/db-benchmark/

It has pandas, dask, Spark, data.table, Polars, etc. Sadly, Vaex is currently missing from this suite.

Matumio · on Dec 17, 2021

For Python there is Dask: https://docs.dask.org/en/stable/dataframe.html

Fiahil · on Dec 17, 2021

You either stream them, or use bigger VMs.

KptMarchewa · on Dec 17, 2021

Apache Spark.

VHRanger · on Dec 17, 2021

cmollis · on Dec 17, 2021

spark dataframe api..

thenipper · on Dec 17, 2021

We've been thinking about trying this out at work for some of our data pipelines/simplified models. The speed/ergonomics look great.

ZeroGravitas · on Dec 17, 2021

Is there a plugin to use this as a visidata backend? I quite like their UX.

xiaodai · on Dec 17, 2021

It's great to see innovation in this area.

Maxion · on Dec 17, 2021

I wouldn't really call it innovation, it's more just a project trying to bring to python something similar to the tidyverse from R.

callmerk · on Dec 17, 2021

nas · on Dec 16, 2021

It looks interesting but phrases like "embarrassingly parallel execution" make my marketing hype detectors trigger. Maybe they could tone down their self promotion just a touch. Also "Even though Polars is completely written in Rust (no runtime overhead!) ...". I find that hard to believe.

lern_too_spel · on Dec 16, 2021

"Embarrassingly parallel" is a technical term, not a marketing term. https://en.wikipedia.org/wiki/Embarrassingly_parallel

nas · on Dec 17, 2021

It's a term for the nature of a problem, not a library or software package. It looks like they have designed the API so that "embarrassingly parallel" problems can naturally be computed using Polars. That would be fantastic, much better than Pandas. The way they write it sounds like marketing fluff to me and that's a shame because Polars looks like a useful thing.

goodside · on Dec 17, 2021

“Embarrassingly parallel execution” means that it parallelizes (only) problems that are embarrassingly parallel. The meaning is clear — if you want to be really pedantic about it, problems are “parallelizable” and only execution is “parallel”, but “embarrassingly parallelizable” is too many syllables.

nojito · on Dec 16, 2021

Why?

The benchmarks speak volumes.

https://h2oai.github.io/db-benchmark/

sdfgsdf · on Dec 17, 2021

The benchmarks speak volumes of dishonesty.

They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.

Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.

EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.

apd_ · on Dec 17, 2021

> Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...).

Not true. If we'd rank them by second run Julia would be:

- On simple query: 1st, 1st, 4th, 1st, 5th (down 1).

- On advanced query: 3rd, 6th, 6th, 4th (up 1), - (out of memory).

> The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is.

Not true. Upon quick peek on the bench code, ClickHouse and Spark use in-memory table. I assume other engines too.

ritchie46 · on Dec 17, 2021

Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).

Also in the second run, julia is not the fastest. Julia would not be faster than Rust, its got a garbage collector. This is what you see in the join benchmarks that really push the allocator.

Next to that, the databases run in in-memory mode, so there is not disk overhead. Spark is slower because JVM + row-wise data.

sdfgsdf · on Dec 17, 2021

> Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).

Here's my view: The author of that page has commented here on HN; If my claim was so outrageously wrong as you claim, he would've corrected it.

fault1 · on Dec 18, 2021

yeah, but your claim was "Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run"

notice this isn't even a language vs language benchmark. it's libraries and frameworks.

plus I don't think even the author of the julia library in question would agree with your statement: https://discourse.julialang.org/t/the-state-of-dataframes-jl...

as mentioned in that thread, GC and strings, or especially a combination of the two, can be very much a downer in terms of julia performance. That's actually pretty surprising since strings are often as important if not more important than numbers for a lot of data processing needs.

I'd also say in terms of compilation time, some autocaching layer outside of precompilation would do wonders.

rscho · on Dec 17, 2021

> Julia would not be faster than Rust, its got a garbage collector.

Having a garbage collector does not intrinsically make things slower. Especially so outside of the benchmarking microcosm.

adgjlsfhk1 · on Dec 17, 2021

that said, Julia currently has a slow GC so it does hurt. GC performance is being worked on though. I have high hopes for a year or 2.

sriku · on Dec 17, 2021

Agree .. and I was looking for an option to sort by second run.

One trick I've tried to some effect is to run jl code on a smaller data sizes so the compilation gets done and then repeat on the large one so it doesn't get interrupted by compilation. Not sure if this is a recommended approach. Benchmarking Julia is a pain for this reason - compilation always gets mixed up with runtime. But it hasn't prevented me from using it interactively. Pretty happy with it actually.

nojito · on Dec 17, 2021

>The benchmarks speak volumes of dishonesty.

Not really. They are designed to showcase a common use case across multiple technologies.

The beauty of this benchmark is that there is a hardware limit included so that it forces you to create novel solutions to perform well.

>Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.

Not sure where you're getting that but even on second run Julia doesn't really compete with DT/Polars

adgjlsfhk1 · on Dec 17, 2021

the benchmarks are a bit out of date (missing DataFrames 1.2/1.3, Julia 1.7, CSV 0.9). I'm planning on running an updated version this weekend.

1egg0myegg0 · on Dec 17, 2021

If you wouldn't mind, please update DuckDB as well!

adgjlsfhk1 · on Dec 17, 2021

Can you make a PR to https://github.com/oscardssmith/db-benchmark? I don't know DuckDB, so I don't know what the change would be.

throwawaybutwhy · on Dec 18, 2021

It's obvious that you're promoting duck eggs at the expense of, say, chicken eggs or quail eggs or even ostrich eggs. Maybe you could tone that down a bit.

prionassembly · on Dec 17, 2021

Julia doesn't really compete with anything, despite having some cool tech behind it.

It's like -- Julia is the Rory Gilmore of programming languages.

paulgb · on Dec 17, 2021

> considering that you compile once and run millions of times.

If you’re writing data pipelines then yes, but a lot of Pandas users use it interactivity. As much as I’d rather use Julia, the last time I tried it I found myself waiting for computation far more often than with a Jupyter/Python workflow.

queuebert · on Dec 17, 2021

Give it another try. They've improved the first run times quite a bit over the last few versions. Package precompilation has gotten way better as well.

paulgb · on Dec 17, 2021

Glad to hear it, I will!

adgjlsfhk1 · on Dec 17, 2021

DataFrames1.3 is a lot faster specifically.

rscho · on Dec 17, 2021

Maybe you should hop on the website of duckdb before commenting...

ritchie46 · on Dec 17, 2021

The embarrassingly parallel is aimed at the expression API. This allows one to write multiple expressions, and all of them get executed parallel. (So embarrassingly, meaning they don't have to communicate and use locks).

space_rock · on Dec 16, 2021

He is basically describing benefits of the rest language so it's perfectly credible

nas · on Dec 17, 2021

How so? Does Rust have zero runtime overhead? I would find that hard to believe.

BenFrantzDale · on Dec 17, 2021

It’s a compiled optimized language. Along with C++, it’s one of the few languages to have essentially no runtime overhead.