Mison: A Fast JSON Parser for Data Analytics [pdf]

Dammit, I re-read the paper and I was wrong, you were right. But I can't delete/change my comment, so please downvote away. (I hate wrong information on the internet, and to be the source of it is a horrible feeling. Sorry.)

XR0CSWV3h3kZWg · on Sept 19, 2017

So the big wins are when only a couple fields need to be read and the majority of the data can be ignored. They've even integrated this into spark. It'd be really nice to see the code released!

lower · on Sept 20, 2017

> It'd be really nice to see the code released!

I'm surprised the conference accepted the paper without the source code being made public.

jarym · on Sept 20, 2017

"A key challenge for achieving these features is to jump directly to the correct position of a queried field with-out having to perform expensive tokenizing steps to find the field" - I think jsoniter does that (or something similar) http://jsoniter.com

fefe23 · on Sept 20, 2017

SIMD optimization is actually not as easy as it sounds. You can use SIMD to greatly speed up looking for ", for example, but with JSON that is not enough, because there could be an embedded " in the string, escaped as \" And if you find a ", checking whether the previous character is a \ is not enough either, because it could be \\" (an escaped backslash at the end of the string).

The main takeaway of this, to me, is: if you design a language like JSON, make the grammar easily parsable. Escape " as \22 for example, or =22 or basically anything not countaining the escaped character, then you can use SIMD to look for the end of the string very efficiently.

fnord123 · on Sept 20, 2017

My main takeaway is that if you care about performance so much, just ETL the data to be avro, thrift, protobuf, hdf5, netcdf, parquet, arrow, anything but plain text.

frankmcsherry · on Sept 20, 2017

When I was grinding away on the 128B edge graph on my laptop, literally 85% of the time was parsing integers. Get your data to native binary as soon as you can (edit: or out of plain text at least, per parent).

summarity · on Sept 20, 2017

Yep. After working with JSON logs from 50GB up, converting them to Parquet (plus snappy compression) was really liberating. This also led me to Apache Drill, one of the best onprem data exploration tools IMO

kuwze · on Sept 19, 2017

Reminds me of the fastest JSON parser written in D[0].

[0]http://forum.dlang.org/thread/20151014090114.60780ad6@marco-...

js8 · on Sept 20, 2017

I am surprised they don't mention succinct trees in the references. Looks like this was invented independently.