Show HN: ConvTools – generates Python code of conversions, aggregations, joins

MJSplot_author · on March 29, 2020

Just my initial response.

Maybe fine a better first example? It is quite code dense:

    conv = c.aggregate({
       "a": c.reduce(c.ReduceFuncs.Array, c.item("a")),
       "a_sum": c.reduce(c.ReduceFuncs.Sum, c.item("a")),
       "b": c.reduce(c.ReduceFuncs.ArrayDistinct, c.item("b")),
    }).gen_converter()
    conv(input_data)

when compared a trivial native python equivalent:

    conv = lambda data:{ 'a': [el['a'] for el in data ],
                    'a_sum' : sum( [el['a'] for el in data ]),
                    'b': list(set( [el['b'] for el in data ])), }
    conv(input_data)

which appears to have the same functionality. This is quite off putting and it took me a while to dig down to find why convtools can offer more than just an extra abstraction layer to learn. Perhaps pick an example that shows off the non trivial functions like joins or GroupBy?

westandskif · on March 30, 2020

Just to add my previous answer: the trivial native python equivalent doesn't have the same functionality, because it consumes data iterator 3 times in your case, while convtools would consume it only once.

westandskif · on March 30, 2020

Thank you! I will add join and group_by examples shortly

uryga · on March 29, 2020

looks really cool! congrats, seems like a lot of work went into this, and code generation is always fun.

i do have some feedback about the readme though...

maybe i'm not the intended audience, but the examples didn't work great for me - the readme just shows the code without the inputs/outputs, so you kind of have to guess what it does. ("show me your tables, not your flowcharts" etc.)

i also think you should add some more basic examples / common tasks, e.g. converting AOS to SOA:

  [{'a': 5,  'b': 'foo'},
   {'a': 10, 'b': 'bar'} ]
    
  c.fun['stuff'](data) # look how concise!
  
  {'a': [5,     10   ],
   'b': ['foo', 'bar']}

and build up to more complex stuff from there, to help readers get a feel for the library.

and i think the examples should be a bit higher up in the readme – ofc wanting to describe how cool the implementation is is natural :) but honestly, when i'm looking at a library like this, i want to be able to make a quick assessment if might be useful for me - the implementation is kind of secondary in most cases.

now, despite what i wrote above, i'd love to hear some stuff about the implementation :) as someone who also wrote a library that does runtime python codegen, what's your approach to that?

westandskif · on March 29, 2020

thank you very much for the feedback! this is very valuable! :)

Regarding the README, I'll improve it within the next few days.

As for the approach, the main assumption was that everything is simple as long as you deal with expressions only, so I've introduced every expression I needed as a conversion object (each able to generate the code within the context).

Exceptions are custom code generating parts (e.g. aggregate, reducers) and the part where I break down piped conversions into a series of statements in the top level converter.

Another tricky piece was to support parametrization - e.g. c.input_arg here - https://convtools.readthedocs.io/en/latest/cheatsheet.html#c... So it was necessary to make every conversion know about every inner dependency it has, to make all dependencies pop up, to know function signatures & parameters needed to be passed during internal generation of functions.

uryga · on March 29, 2020

sounds interesting, i'll have a look at the code when i have time. i've mostly done compiler stuff like this with the "one function with a huge switch on the expression type" approach, curious to see what the more OOP-ish way looks like.

btw: wow, that cheatsheet is exactly what i had in mind on my first comment, that's the kind of stuff i'd like in a readme! maybe a few excerpts with a link to the whole thing.

some more remarks if you're interested:

---

in that cheatsheet it'd be cool to also show the generated code for each example, maybe in a collapsible box or sth – in that context the actual semantics of a convtools expression are useful to know.

---

have you thought about some magic syntactic sugar? the current approach is kind of visually heavy, since you're basically writing an AST by hand. with some __dunder__ hacking you could easily (?) add a "magic" api like

  from convtools.magic import magic as m
  
  c.item('key') ->
  m['key']
  
  c.call_method('foo', ...) ->
  m.meth.foo(...)
  
  c.call_function('bar', ...)  ->
  m.func.bar(...)

or something similar. it might be a bit too magical for some tastes, but if you're constructing a python expression, it kind of makes sense to use python syntax for that

westandskif · on March 29, 2020

oh, thanks, I'll add links to cheatsheet and quickstart pages to the README, it really makes sense!

As for the magic-stuff, I was contemplating designing the API with this approach, but I changed my mind because it would be difficult to tell which python expressions are evaluated at the moment of a conversion definition AND which in the compiled code.

However if we imagine this "magic" API, then it could be even closer to normal python code:

  m["key"].some_method(...)

which would resolve everything under the hood.

===

as for the collapsable generated code examples -- I've jotted down :)

uryga · on March 29, 2020

> it would be difficult to tell which python expressions are evaluated at the moment of a conversion definition AND which in the compiled code

this is ofc a valid concern in any metaprogramming situation. has this been a problem in your experience? i'm guessing e.g. generating a conversion based on a list of fields is a thing someone might do, but it feels like a minority usecase (at least to me, someone with no actual experience with using the library :p)

westandskif · on March 30, 2020

Re: whether it's been a problem in my experience -- sort of - yes

So now I'm doing my best to observe: PEP-20 the 2nd commandment with the hope that I'm not violating the 1st commandment badly :) https://www.python.org/dev/peps/pep-0020/

Also I see another upside of this no-magic syntax in that it is distinctive -- there's no way to mix up convtools-related code with any other python code.

carapace · on March 29, 2020

In re: the "magic", it seems to me that you could use a mock object and get the required info from it.

westandskif · on March 30, 2020

In terms of the implementation, it's kind of trivial to implement the "magic", but it would be both confusing and inconsistent, see below.

e.g. imagine a case where you'd want to call datetime.strptime, partially initializing it at the moment of conversion definition. at the moment it is:

  c.call_func(
      datetime.strptime,
      c.item("updated"),
      "%Y-%m-%d"
  )

but it's unclear to me how would the "magic" approach deal with the case above.

carapace · on March 30, 2020

Replace your current namespace with a mock object? You can eval or exec code with any dictionary-flavored object, and mocks can imitate dicts.

edit: no, you can't in Python 3..

I'll try to have a closer look later today.

germanjoey · on March 30, 2020

What do you think about a DSL alternative?

westandskif · on March 30, 2020

We would need to bring the whole python into it :/

gigatexal · on March 29, 2020

looks really cool -- will evaluate when the API is stable -- would hate to adopt only to have it change from under me

gwenzek · on March 29, 2020

Interesting approach. I'm currently not satisfied by Pandas which seems to be the defacto tool for processing tables. But I find the query API really unnatural especially for filtering.

Do you have some benchmark for performances ? Is this more aimed at playing around in a notebook or used inside a full data processing pipeline?

westandskif · on March 30, 2020

Pandas are great, but I had a few cases where I had frustrating experience -- dealing with Decimal & float columns is a pain (missing data without any signs when using both in calculations).

However this was not the reason why I needed to build convtools, I needed to process reports, touching only some columns (without failing if an unrelated column is no longer processable). So I needed to reuse and combine python expressions across multiple procedures.

There are no benchmarks at the moment, you can just pass debug=True to the gen_converter method to see the generated code and judge whether it's optimal for your use case. This is a python library which generates simple python code: - without unnecessary conditions and loops - without keeping all items of iterable in memory to aggregate (it leverages reducers) - making no use of C-extensions.

duckmysick · on March 30, 2020

What doesn't satisfy you in Pandas?

akdor1154 · on March 30, 2020

not op, but the API does not feel coherently designed, with the same sort of complete-but-hard-to-learn vibe as php's standard library.

there are no mypy module stubs so ide autocomplete generally just doesn't work (and likely never will properly as the API is often inconsistent in its return types based on what you pass to it)

The docs are detailed but most is the meat is in great long module-level manual pages, which are difficult to use as a quick reference. basically I have been using pandas for about a year now and I still hit around one multi-hour long 'how do I do this seemingly basic operation' dive into stack overflow/GitHub/etc per week.

pandas code itself is very difficult to understand, due to being based around weird python metaprogramming mixin patterns and needing to do a fair amount of optimised stuff in cython anyway.

with that said I still have been using Pandas for a year and it lets me do my job, so hey it's not all bad. designing a general purpose api like this correctly the first time is probably impossible, and I'm really grateful for the work the pandas devs have achieved.

carapace · on March 29, 2020

Pretty awesome.

What about errors at conversion time? Is there any help for that or do you just get the raw traceback of the generated code?

westandskif · on March 30, 2020

On exception it populates linecache so that tracebacks are normal and you can debug it normally with pdb post-mortem debugging - https://docs.python.org/3/library/pdb.html#pdb.post_mortem

Nothing else at the moment, but I've written down this point to contemplate in the nearest future -- thank you!

carapace · on March 30, 2020

Ah, that's cool. Cheers!

birdculture · on March 30, 2020

Looks neat! Would you ever consider adding a mode that skips code generation?

westandskif · on March 30, 2020

It's not possible to skip the code generation part because a resulting converter is always compiled from the code written under the hood. Could you please share what is your concern about this? I'd really like to better understand it!

JFYI: it's possible to skip running "gen_converter" method, it's possible to just use "execute" -- it runs "gen_converter" under the hood: c.group_by( c.item(0) ).aggregate({ c.item(0): c.reduce(c.ReduceFuncs.Sum, c.item(1)) }).execute([ (0, 1), (0, 2), (0, 3), (1, 10), (1, 12), ])

  Out[5]: [{0: 6}, {1: 22}]

The downside is that you won't be reusing the converter.

pyuser583 · on March 30, 2020

Looks promising!