The idea was to focus on querying tools. ujson and orjson (as well as the json m...

davidatbu · on April 15, 2022

> resulting in large programs with lots of boilerplate

That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.

Thank you for this benchmark! I'll probably switch to spyql now from jq.

> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

Yes, I definitely think this is worth mentioning upfront in the future, since, IIUC, orison's core uses Rust (the serde library, specifically). The initial title gave me the impression that a pure-Python json parsing-and-querying solution was the fastest out there.

A parallel I think is helpful to think about is saying something like "the fastest BERT implementation is written Python[0]". While the linked implementation is written in Python, it offloads the performance critical parts to C/C++ through TensorFlow.

I'm not sure how such claims advance our understanding of the tradeoffs of programming languages. I initially thought that I was going to change my mind about my impression that "python is not a good tool to implement fast parsing/querying", but now I haven't, so I do think the title is a bit misleading.

[0] https://github.com/google-research/bert

dmoura · on April 15, 2022

Thank you for your feedback! I understand your point of view, let me share mine.

spyql is 100% Python code and it is not a thin layer over something else. Every row of data goes through a query engine built in python that takes care of evaluating the query, filtering and aggregating data, among other stuff. The only part that is offloaded to standard or external modules is the decoding and encoding from/to specific data formats. In the case of this benchmark, spyql uses the orjson module to convert each input json object into a python dict, one at a time.

Due to the nature of Python as an interpreted language, it is natural that python modules leverage C (or Rust) to provide highly efficient implementations of core functionalities. For instance, the json module of the standard library is implemented in C. If we would use the json module in the benchmark instead of orjson, spyql would remain as one of the fastest and lightest tools for querying json data. Using orjson, gives an extra boost of performance.

If you think it is worthwhile, I can add another benchmark entry where spyql uses the standard json lib. The queries would be exactly the same, I just need to use in the query `FROM json` instead of `FROM orjson`.

davidatbu · on April 15, 2022

I honestly totally agree with your POV now. I don't think another benchmark entry would be worthwhile either.

Thank you for making this tool again!

jammycrisp · on April 15, 2022

> I should mention that spyql leverages orjson, which has a considerable impact on performance

Even with orjson, you're still paying the cost of creating a new PyObject for every node in the JSON blob. orjson is well engineered (as is the backing serde-json decoder), but any JSON decoder that isn't using naive algorithms is mostly bound by the cost of creating PyObjects. Allocating in Python is _slow_.

I wrote a quick benchmark (https://gist.github.com/jcrist/de29815389eaed4eaf5b24fbcfdab...) showing a handwritten query that accesses only a few fields in a 13 MiB JSON file. The same query is repeated with a number of different Python JSON libraries. Results:

    $ python bench_repodata_query.py 
    msgspec: 45.018014032393694 ms
    simdjson: 61.94157397840172 ms
    orjson: 105.34720402210951 ms
    ujson: 121.9699690118432 ms
    json: 113.79130696877837 ms

While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.

If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.

the_duke · on April 15, 2022

When a Python tool is unexpectedly fast the answer is almost always: because the expensive part is implemented externally in a low level language.