Interesting approach.
I'm currently not satisfied by Pandas which seems to be the defacto tool for processing tables.
But I find the query API really unnatural especially for filtering.
Do you have some benchmark for performances ? Is this more aimed at playing around in a notebook or used inside a full data processing pipeline?
Pandas are great, but I had a few cases where I had frustrating experience -- dealing with Decimal & float columns is a pain (missing data without any signs when using both in calculations).
However this was not the reason why I needed to build convtools, I needed to process reports, touching only some columns (without failing if an unrelated column is no longer processable). So I needed to reuse and combine python expressions across multiple procedures.
There are no benchmarks at the moment, you can just pass debug=True to the gen_converter method to see the generated code and judge whether it's optimal for your use case.
This is a python library which generates simple python code:
- without unnecessary conditions and loops
- without keeping all items of iterable in memory to aggregate (it leverages reducers)
- making no use of C-extensions.
not op, but the API does not feel coherently designed, with the same sort of complete-but-hard-to-learn vibe as php's standard library.
there are no mypy module stubs so ide autocomplete generally just doesn't work (and likely never will properly as the API is often inconsistent in its return types based on what you pass to it)
The docs are detailed but most is the meat is in great long module-level manual pages, which are difficult to use as a quick reference.
basically I have been using pandas for about a year now and I still hit around one multi-hour long 'how do I do this seemingly basic operation' dive into stack overflow/GitHub/etc per week.
pandas code itself is very difficult to understand, due to being based around weird python metaprogramming mixin patterns and needing to do a fair amount of optimised stuff in cython anyway.
with that said I still have been using Pandas for a year and it lets me do my job, so hey it's not all bad. designing a general purpose api like this correctly the first time is probably impossible, and I'm really grateful for the work the pandas devs have achieved.
Do you have some benchmark for performances ? Is this more aimed at playing around in a notebook or used inside a full data processing pipeline?