Hacker News new | past | comments | ask | show | jobs | submit login
Mison: A Fast JSON Parser for Data Analytics [pdf] (vldb.org)
103 points by fanf2 on Sept 19, 2017 | hide | past | favorite | 18 comments



Someone already implemented it in rust.

https://github.com/pikkr/pikkr


Looks like this is just a library so far. It'd be nice to see something that could compete with q using this.


Do you mean the tool 'jq'?


whoops yep.



They quite clearly state that Jackson is the fastest, later in the paper.


But it's not. It's just the default/most popular one.

It's up from 2x slower than fastest Java one, as clearly shown in the links.


I read the paper quickly, but Jackson was really really fast. Note that their charts are measuring throughput, so a larger bar is better.


Dammit, I re-read the paper and I was wrong, you were right. But I can't delete/change my comment, so please downvote away. (I hate wrong information on the internet, and to be the source of it is a horrible feeling. Sorry.)


So the big wins are when only a couple fields need to be read and the majority of the data can be ignored. They've even integrated this into spark. It'd be really nice to see the code released!


> It'd be really nice to see the code released!

I'm surprised the conference accepted the paper without the source code being made public.


"A key challenge for achieving these features is to jump directly to the correct position of a queried field with-out having to perform expensive tokenizing steps to find the field" - I think jsoniter does that (or something similar) http://jsoniter.com


SIMD optimization is actually not as easy as it sounds. You can use SIMD to greatly speed up looking for ", for example, but with JSON that is not enough, because there could be an embedded " in the string, escaped as \" And if you find a ", checking whether the previous character is a \ is not enough either, because it could be \\" (an escaped backslash at the end of the string).

The main takeaway of this, to me, is: if you design a language like JSON, make the grammar easily parsable. Escape " as \22 for example, or =22 or basically anything not countaining the escaped character, then you can use SIMD to look for the end of the string very efficiently.


My main takeaway is that if you care about performance so much, just ETL the data to be avro, thrift, protobuf, hdf5, netcdf, parquet, arrow, anything but plain text.


When I was grinding away on the 128B edge graph on my laptop, literally 85% of the time was parsing integers. Get your data to native binary as soon as you can (edit: or out of plain text at least, per parent).


Yep. After working with JSON logs from 50GB up, converting them to Parquet (plus snappy compression) was really liberating. This also led me to Apache Drill, one of the best onprem data exploration tools IMO


Reminds me of the fastest JSON parser written in D[0].

[0]http://forum.dlang.org/thread/20151014090114.60780ad6@marco-...


I am surprised they don't mention succinct trees in the references. Looks like this was invented independently.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: