My thoughts on MessagePack

ynniv · on June 11, 2012

cheald has an excellent comment regarding MessagePack, JSON, and Protocol Buffers in the post from 15 hours ago: http://news.ycombinator.com/item?id=4091051

   MessagePack is not smaller than gzip'd JSON
   MessagePack is not faster than JSON when a browser is involved

Other comments from that post include:

   MessagePack is not human readable
   MessagePack has issues with UTF-8
   Small packets also have significant TCP/IP overhead

Really, anyone who hasn't read the other comments should: http://news.ycombinator.com/item?id=4090831

frsyuki · on June 11, 2012

Although the original blog post focuses on JavaScript and browsers, MessagePack itself doesn't mainly focus on them.

A major use case of MessagePack is to store serialized objects in memcached. A blog post written by Pinterest describes this use case (http://engineering.pinterest.com/posts/2012/memcache-games/). They use MessagePack with Python which is faster than one with JavaScript. They could store more objects in a server without performance declination (e.g. gzip).

It's true that MessagePack is not always faster than JSON (e.g. within browsers), and it's not always smaller than other serialization methods (e.g. with gzip compression). So we should consider that which serialization methods should I use for "my" case.

There are also general tendency which is helpful to select MessagePack or JSON:

    MessagePack is faster to serialize binary data such as thumbnail images.
    MessagePack is better to reduce overheads to exchange small objects between servers.
    JSON is better to use it with browsers.

alexgartrell · on June 11, 2012

> They could store more objects in a server without performance declination (e.g. gzip).

The performance declination argument is bullshit. Network's a million [0] times slower than gzip.

Truth be told, once you're on the network, you're already screwed w.r.t. most serialization. The only thing efficient compression/decompression is going to buy you is lower CPU (memcached servers run at like 2% CPU util, even under heavy load [1]).

Memcache at Facebook actually uses the ascii protocol, and the memcached implementation is a braindead strtok parser (some of our other stuff uses ragel -- you'll have a hell of a time out optimizing ragel with the right compiler flags -- I've tried and failed).

Just use whichever serialization format has the best API, because I can say with near certainty that it's not going to be a perf problem for you if you're touching disk, network, etc.

[0] Obviously a made up number, but it's way slower. Especially if you're unlucky and lose a packet or something.

[1] With the exception of weird kernel spin lock contention issues, which can happen if you're not sharding your UDP packets well and trying to reply from 8+ cores on 1 UDP socket. You probably aren't.

catwell · on June 11, 2012

I +1 that. I have working experience with MessagePack and I can confirm it works for the following use cases:

* RPC communication between servers where binary data is exchanged and its structure is not always the same (ie. difficult to use something that requires an IDL).

* Serialization and storage of objects that will be sent over the network (note: you can batch MessagePack objects just by concatenating them).

* Communication between a server and a native mobile application. Native applications live in a binary world whereas Web applications live in a text-based world where JSON is better.

The human readability argument is poor: the JSON that is sent over a network is not usually human readable, so you would use a prettyfier to read it anyway. Moreover a MessagePack message is standalone / self-describing, ie. you don't need an IDL description to read it. So in both case, reading the message is just adding another block to a pipeline...

jamesaguilar · on June 11, 2012

That test you linked to which claims that messagepack is 4x faster seems to rely on the serialized text staying in-process. The vrefbuffer is only zero-copy as long as you don't need to send it to any API which reads strings or char buffers (e.g. any RPC or network-oriented mechanism). Am I reading it right?

latch · on June 11, 2012

I've used it to store blobs of data in databases, specifically when space tends to directly equate to memory (like redis and mongodb). You can take this pretty far and apply different serialization or compression algorithms based on the data (and store a field that says which approach was used for when you deserialize it).

Using this from the browser is not the first thing that came to my mind.

kenneth · on June 11, 2012

Storing a message-pack serialized object in MongoDB is silly, because Mongo is essentially storying BSON-serialized object on disk. You're double-serializing data in two competing formats.

latch · on June 11, 2012

BSON isn't meant to be compact, it's meant to be quick and efficient to serialize and deserialize. You store it inside MongoDB as bindata, and you'll save space.

frsyuki · on June 11, 2012

MessagePack includes a concept named "type conversion" to support types which are not supported by its wire format. With the concept, we can serialize/deserialize user-defined classes as well as strings with encodings.

So far, MessagePack for Java, C++ and D implement the concept.

shykes · on June 11, 2012

Zerorpc (http://github.com/dotcloud/zerorpc-python) uses msgpack as its primary serialization format. Among other things it is significantly more efficient for floating point data. At dotCloud we use it to move many millions of system metrics per day, so it adds up. Also worth noting: msgpack maps perfectly to json, so there's nothing keeping you from translating it to json on its way to the browser, where json is indeed a better fit. In practice this doesn't affect performance since you typically need to parse and alter the message at the browser boundary anyway, for some sort of access control.

frsyuki · on June 11, 2012

We're using MessagePack in a Rails application to log user behaviors and analyze them.

Compared with other serialization libraries such as Protocol Buffers, Avro or BSON, one of the advantages of MessagePack is compatibility with JSON. (In spite of its name, BSON has special types which cause incompatibility with JSON)

It means we can exchange objects sent from browsers (in JSON format) between servers written in different languages without losing information.

I will not use MessagePack with browsers but it's still useful to use it with web applications.

rkalla · on June 11, 2012

If JSON compatibility is an issue, have you looked at UBJSON? http://ubjson.org/

May be a bit bigger than msgpack but is damn-near human readable even in its binary format and really easy to encode/decode. Also 1:1 compatibility with JSON.

Compatibility and simplicity were the core design tenantes. It may not be the right choice, just throwing it out there incase it helps.

Disclaimer: I am the author of the spec.

chmike · on June 14, 2012

I looked at it. It's design process is not completed. One strong negative point is that it enforces big endian integer encoding.

Another one is that it doesn't use the value space of tags as efficiently as message pack. I would use the unused space to encode small string size in the tag since objects (associative arrays) have generally many short identifier strings as keys.

I sent these as comments and requests for change but didn't receive any response yet. I don't know how open its design process is.

kiyoto · on June 11, 2012

Sounds cool. I would take a look. I think what this space (serializers) needs is objective/holistic evaluations of pros and cons of different approaches. (disclaimer: I am involved with MessagePack, although not a committer of any of its drivers).

dkhenry · on June 11, 2012

As much as we would like to jump back on the MessagePack is no JSON alternative here, I would like to commend The author on taking the criticism posted earlier like a mature adult and explaining his point of view. Even admitting that some of the benchmarks might have been misleading.

kenneth · on June 11, 2012

I've personally been using BSON[1] as a binary alternative to JSON, and it's worked out great. I've written an Objective-C wrapper around the C lib, in case anybody's interested. Every other language has a solid implementation from 10gen (the MongoDB guys). It's a solid format with a clear spec that's extensible and fully supports UTF-8.

[1]: http://bsonspec.org/

stock_toaster · on June 11, 2012

I am kind of partial to tnetstring passed through lzf right now.

samsoffes · on June 11, 2012

I think it's hilarious that 2 of the 3 comments on the gist are about how he formatted his Markdown. sigh

I think MessagePack is great. Good for him for trying to make something better. He openly admits it's not better all the time. Why not help?

TheRevoltingX · on June 11, 2012

Very interesting discussion. I work on a 2D MMORPG for Android. This is extremely relevant to me. I have a few questions though.

What if you take compression and deserialization out of the picture? For example, in my server I have a hash like data structure that gets turned into JSON for browsers and byte array for mobile clients.

For example, because the data has to be transferred at fast rates and will be going over mobile networks. The size of the packet matters because every millisecond counts.

Then to read the data, I simply read the stream of bytes and build the objects I need on the client. This has to happen mostly without allocations for example on Android to avoid the GC.

So a few questions: Does deserializing JSON cause any memory allocations? If you're not tokenizing the data and don't need to parse it, will it be a significant gain over s serialized byte protocol or JSON?

In any case, I'll experiment on my end and perhaps blog about my own findings.

frsyuki · on June 11, 2012

Disclaimer: I'm authoer of MessagePack for C++/Ruby and committer of one for Java.

As for strings, JSON has to allocate memory and copy to deserialize strings because strings are escaped.

MessagePack does't have to allocate/copy because the serialized format of strings is same as the format in memory. But it depends on the implementation whether actually it doesn't allocate/copy.

C++ and Ruby implementations try to suppress allocation and copying (zero-copy). But Java implementation doesn't support zero-copy feature so far (we have plan to do so. Here is "TODO" comment: https://github.com/msgpack/msgpack-java/blob/master/src/main...).

As for the other types, C++ implementation (and new Ruby implementation which is under development) has memory pool which optimizes those memory allocations petterns. But it's hard to implement such optimizations for Java because JVM (and Dalvik VM) doesn't allow to hook object allocation.

TheRevoltingX · on June 11, 2012

Interesting, thanks so much for the response. I'll keep this in mind as I continue to develop my app. It's still between custom byte stream, json, thrift, etc.

But MsgPack looks interesting as well and, if anything, these blog posts have brought it into the light for me.

I looked at the java class and what might help is if you can set a buffer size and use that buffer to store the data in the buffer and expand it if necessary. But that seems like a lot of work. But yeah, not sure if you can optimize based on usage patterns due to the constraint you said. In any case, great stuff and thanks for the info.

keithseahus · on June 11, 2012

The best way of serialization at present. I like the thought, performance and various implementations.

twelvechairs · on June 11, 2012

As nice as Message Pack might be, 'The best way of serialization' I'm not sure is a very helpful statement. It can't be 'the best' because 'the best' depends on the specifics of what you are doing. As noted in the OP: "...its pros and cons should be carefully considered, and there are many situations where it simply does not offer enough advantage...".

I, for instance, am still using the much-less-cool yaml, because I need to reference the same object at multiple points within the same serialization. JSON and (AFAIK) msgpack just dont do that, so in this case there is simply no argument. It took me far too much playing around to figure this out, because the internet is full of "JSON > yaml" and similar broad statements, and very few plain descriptions of what the actual different use cases for each type of serialization might be.

stavros · on June 11, 2012

Isn't JSON a subset of YAML? So isn't it, quite literally, that YAML > JSON?

TylerE · on June 11, 2012

Only if you don't consider having a totally compliant implementations for pretty much every platform a feature. YAML is a big hairy beast.

barrkel · on June 11, 2012

"Best" is a subjective value judgement here. Difference scenarios call for different strategies.

Once upon a time, I had to write a serializer that was expected to be run hundreds or thousands of times a second, once for every request in an AJAX app. The solution I chose was to not serialize at all; always keep the data in a binary blob (a byte array) and write a facade of ephemeral objects on top of it, essentially containing nothing more than their offsets into the blob. Because the ratio of read/write operations to serialization operations was so low, this made a lot more sense than building an object graph, only to throw it away a millisecond or two later.