The idea was to focus on querying tools. ujson and orjson (as well as the json module from python's standard library) offer json decoding and decoding but not a querying language: you need to implement the query logic in Python, resulting in large programs with lots of boilerplate. Still, I agree that Pandas is an outlier... it was included due to its popularity for querying datasets.
I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.
> resulting in large programs with lots of boilerplate
That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.
Thank you for this benchmark! I'll probably switch to spyql now from jq.
> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.
Yes, I definitely think this is worth mentioning upfront in the future, since, IIUC, orison's core uses Rust (the serde library, specifically). The initial title gave me the impression that a pure-Python json parsing-and-querying solution was the fastest out there.
A parallel I think is helpful to think about is saying something like "the fastest BERT implementation is written Python[0]". While the linked implementation is written in Python, it offloads the performance critical parts to C/C++ through TensorFlow.
I'm not sure how such claims advance our understanding of the tradeoffs of programming languages. I initially thought that I was going to change my mind about my impression that "python is not a good tool to implement fast parsing/querying", but now I haven't, so I do think the title is a bit misleading.
Thank you for your feedback! I understand your point of view, let me share mine.
spyql is 100% Python code and it is not a thin layer over something else. Every row of data goes through a query engine built in python that takes care of evaluating the query, filtering and aggregating data, among other stuff. The only part that is offloaded to standard or external modules is the decoding and encoding from/to specific data formats. In the case of this benchmark, spyql uses the orjson module to convert each input json object into a python dict, one at a time.
Due to the nature of Python as an interpreted language, it is natural that python modules leverage C (or Rust) to provide highly efficient implementations of core functionalities. For instance, the json module of the standard library is implemented in C. If we would use the json module in the benchmark instead of orjson, spyql would remain as one of the fastest and lightest tools for querying json data. Using orjson, gives an extra boost of performance.
If you think it is worthwhile, I can add another benchmark entry where spyql uses the standard json lib. The queries would be exactly the same, I just need to use in the query `FROM json` instead of `FROM orjson`.
> I should mention that spyql leverages orjson, which has a considerable impact on performance
Even with orjson, you're still paying the cost of creating a new PyObject for every node in the JSON blob. orjson is well engineered (as is the backing serde-json decoder), but any JSON decoder that isn't using naive algorithms is mostly bound by the cost of creating PyObjects. Allocating in Python is _slow_.
I wrote a quick benchmark (https://gist.github.com/jcrist/de29815389eaed4eaf5b24fbcfdab...) showing a handwritten query that accesses only a few fields in a 13 MiB JSON file. The same query is repeated with a number of different Python JSON libraries. Results:
$ python bench_repodata_query.py
msgspec: 45.018014032393694 ms
simdjson: 61.94157397840172 ms
orjson: 105.34720402210951 ms
ujson: 121.9699690118432 ms
json: 113.79130696877837 ms
While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.
If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.
I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.