Dammit, I re-read the paper and I was wrong, you were right. But I can't delete/change my comment, so please downvote away. (I hate wrong information on the internet, and to be the source of it is a horrible feeling. Sorry.)
So the big wins are when only a couple fields need to be read and the majority of the data can be ignored. They've even integrated this into spark. It'd be really nice to see the code released!
"A key challenge for achieving these features is to jump directly to the correct position of a queried field with-out having to perform expensive tokenizing steps to find the field" - I think jsoniter does that (or something similar) http://jsoniter.com
SIMD optimization is actually not as easy as it sounds.
You can use SIMD to greatly speed up looking for ", for example, but with JSON that is not enough, because there could be an embedded " in the string, escaped as \"
And if you find a ", checking whether the previous character is a \ is not enough either, because it could be \\" (an escaped backslash at the end of the string).
The main takeaway of this, to me, is: if you design a language like JSON, make the grammar easily parsable. Escape " as \22 for example, or =22 or basically anything not countaining the escaped character, then you can use SIMD to look for the end of the string very efficiently.
My main takeaway is that if you care about performance so much, just ETL the data to be avro, thrift, protobuf, hdf5, netcdf, parquet, arrow, anything but plain text.
When I was grinding away on the 128B edge graph on my laptop, literally 85% of the time was parsing integers. Get your data to native binary as soon as you can (edit: or out of plain text at least, per parent).
Yep. After working with JSON logs from 50GB up, converting them to Parquet (plus snappy compression) was really liberating. This also led me to Apache Drill, one of the best onprem data exploration tools IMO
https://github.com/pikkr/pikkr