Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been working on a data frame implementation for Python. I think API-wise we can do a lot better than Pandas. Especially after having seen and almost daily used dplyr with R, when having to use something else, I miss that convenience of a clear and consistent API and the chaining of operations. I don't know yet if this project makes sense in terms of speed and corner case handling. I haven't done any real-world work with it yet, but at least it's been a good learning project.

https://github.com/otsaloma/dataiter

https://github.com/otsaloma/dataiter/blob/master/dataiter/da...



As someone who uses pandas daily and detests it violently, godspeed.


One interesting usecase for a Pandas replacement is AWS lambda functions. If you have a skinnier package that can get 80% of the data-processing niceness whilst using up a smaller % the Lambda function's size limit this could come in very handy for many people.

Also agree that the dplyr syntax is cleaner.


Nice project! Not a quarantine project, but we've been building data frame abstractions in Python for genetics [1] [2]. We spent a lot of time studying the existing abstractions (pandas, R/dplyr, pyspark, etc.) Desinging a data frame in Python is an interesting and challenging problem. Our design is far from perfect, but I think we've found an interesting design point. Here's your example in Hail:

  >>> vehicles = hl.import_table('vehicles.csv', impute=True, delimiter=',', quote='"')
  >>> t = vehicles.filter(vehicles.make == "Saab")
  >>> t = t.order_by(t.year)
  >>> t.show(3)
  +-------+--------+-------+-------+----------------+-------------------+---------------------+-------+----------+-----------+-------+-------+
  |    id | make   | model |  year | class          | trans             | drive               |   cyl |    displ | fuel      |   hwy |   cty |
  +-------+--------+-------+-------+----------------+-------------------+---------------------+-------+----------+-----------+-------+-------+
  | int32 | str    | str   | int32 | str            | str               | str                 | int32 |  float64 | str       | int32 | int32 |
  +-------+--------+-------+-------+----------------+-------------------+---------------------+-------+----------+-----------+-------+-------+
  |   380 | "Saab" | "900" |  1985 | "Compact Cars" | "Automatic 3-spd" | "Front-Wheel Drive" |     4 | 2.00e+00 | "Regular" |    19 |    16 |
  |   381 | "Saab" | "900" |  1985 | "Compact Cars" | "Automatic 3-spd" | "Front-Wheel Drive" |     4 | 2.00e+00 | "Regular" |    21 |    16 |
  |   382 | "Saab" | "900" |  1985 | "Compact Cars" | "Manual 5-spd"    | "Front-Wheel Drive" |     4 | 2.00e+00 | "Regular" |    23 |    17 |
  +-------+--------+-------+-------+----------------+-------------------+---------------------+-------+----------+-----------+-------+-------+
  showing top 3 rows
Hail's tables are functional. Operations like `filter` and `order_by` return new tables. That means it would be an error to use `vehicles.year` in the `order_by`, since the input and the sort expression refer to different tables. Unfortunately, this means you can't use `.` chaining.

A little more background on the project: Hail's raison d'etre is a 3-dimensional generalization data frames we use for genetic data called a MatrixTable [3]. Conceptually, it is matrix-of-dicts rather than lists-of-dicts.

Genetic data is massive, so all of this is lazy and works on out of core data. The Python front end constructs an IR representing the query, it's fed through a query optimizer (written in Scala) and executed by a backend. We're working on multiple backends, but our primary backend right now is Spark.

[1] https://hail.is/docs/0.2/index.html

[2] https://hail.is/docs/0.2/hail.Table.html

[3] https://hail.is/docs/0.2/hail.MatrixTable.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: