I worked on a product inside Google which used protos (v1) as the data format to a web front end, and in practice, that system was a failure, in part to the decision to use protos. The deserialization cost of protocol buffers is too high if you're doing complex data throughput, and even though the data size is smaller, it's better to send larger gzipped JSON (which will be decompressed in native code) and deserialized into JS (also via native code). We weren't using ProtoBuf.js, but our own internal javascript implementation of a similar library, and doing all of this in JS was too expensive. Granted, we were sending around protos that had multi megabyte payloads at times.
We rewrote our app eventually to send protos in JSON format to the app, while just letting our backends still pass around native protos, it worked a lot better.
Things have changed a lot since your experience, I think. For one, a different encoding called "JSPB" has become the de facto standard for doing Protocol Buffers in JavaScript, at least inside Google. JSPB is parseable with JSON.parse(), so it avoids the speed issues you experienced.
And looking forward, JavaScript parsing of protobuf binary format has gotten a lot faster, thanks in large part to newer JavaScript technologies like TypedArray. Ideally JSPB would be deprecated as a wire format in favor of fast JavaScript parsing of binary protobufs, but this would of course be contingent on the performance being acceptable.
I'm not sure that TypedArray will help that much. For web apps, most of the data is strings and at some point you have to deserialize the strings so that regular JavaScript code can work with them (rather than asm.js code which would work with the bytes directly).
The proof of concept would be to send an array of strings as bytes in a TypedArray, deserialize it to an array of JavaScript strings using JavaScript (not native code), and show that this is about as fast as doing the same thing using JSON.parse(). It seems likely that JSON.parse() will have an easier time creating all those JavaScript strings and other objects at once from native code.
What benefits do I get from ProtoBuf, apart from the standard binary wire format?
JSON is just more popular as a serialization format. It doesn't matter what what programming language or OS I am on, there is almost always a built-in library that de/serialize JSONs at reasonable speed. To send the JSON objects around from one service to another, I can just gzip the string if it's big, or just plain UTF-8 string if it's not.
ProtoBuf has to provide more values for people like me to switch. I would rather try out Apache Avro first as a replacement for what I am doing right now.
In my opinion, the biggest benefit from using protobuf is that the schema exists in a .proto file. This can be used to provide all sorts of conveniences.
With a plain JSON-based API, you copy and paste field names out of sample code or the documentation. If you spell a field name wrong, there will be no error on the client. If you're lucky, the server might error out because it didn't recognize the property name, but it also might not. If you send an integer when the server was expecting a string, the server might automatically convert or it might not.
With protobuf, the schema is explicit in a .proto file. That means that the client library can tell you, at the precise moment that you say msg.misspledFieldName, that the field name doesn't exist. Or if you try to put an integer in there instead of a string, it can tell you about that too. Basically it makes for a tighter feedback loop, which is almost always better.
In statically-typed languages like C++ or Java, the schema can be used to generate static types too, so it's actually a compile-time error when you misspell a field name.
> It doesn't matter what what programming language or OS I am on, there is almost always a built-in library that de/serialize JSONs at reasonable speed.
XML isn't painful because it has a schema, XML is painful because it wasn't really designed for RPC, so getting to feature parity with something like Protocol Buffers takes a whole stack of XML technologies and a huge mess of complexity.
Protocol Buffers were designed from the ground up for RPC, and as a result are far simpler and more convenient to use than XML. Seriously, nobody who uses Protocol Buffers compares them to XML, because it's not even a comparison.
The encoding for protobufs is significantly more compact than JSON. If you're logging or persistently storing your data on the server, this can cut down your storage and bandwidth costs significantly, particularly if you're operating at Google scale.
Haberman also mentioned the schema benefits.
All that said, I'm using JSON for my current startup. I view them as optimizing for different parts of the product's lifecycle: JSON lets you quickly adapt the protocol and switch out different languages for different services when you're figuring out what product to build, while Protobuf saves you money when you're trying to scale it. I'm also pretty intrigued by Cap'n Proto as a high-performance serialization format, since it fixes a lot of the problems we faced using protobufs at scale at Google, but its language support just isn't up to protobuf/JSON yet, and the protocol is quite complicated.
I use Protobufs for my startup and it has saved us an incredible amount of time building out iOS, Android, and Web clients. With a small team, any time we can shave by not having to re-write the modeling layer in all of these languages is a big win. As the writer of the APIs, I publish the new Protobuf models/services and then can switch over and instantly start working with real objects in Swift or Java.
Coming from a larger startup, I've also experienced the pains of trying to maintain JSON objects between different services. Protobufs have some quirks, but I think its a great solution to get behind at any stage.
This will allow you to switch between JSON and protobuf binary on the wire easily, while using official protobuf client libraries. So you can choose easily whether you care more about size/speed efficiency or wire readability. Best of both worlds!
I work on the protobuf team at Google and would be happy to answer any questions.
I'm really bummed that you got rid of required fields in pb3. Now every consumer has to write additional code to verify that their required fields are actually available, and the proto spec is barely useful as an actual interpretable spec -- you have to specify requirements purely in comments.
On top of which, you've defined built-in default values for empty fields; this means that, without warning, an accidentally missing field will inject bad data into any consumer that doesn't carefully check for the existence of all required fields.
These are basically killer issues for us; we're not going to adopt an "update" that requires us to write JSON-style "hey, does this field exist?" code everywhere.
Co-author of proto3 here. The reasons we chose lowerCamelCase are compatibility with Google's REST APIs (like Gmail API) and readability for users who work with JSON output directly without client library. API designers should not define confusing data schemas, let alone allow collisions. Most proto messages have small number of fields, avoiding name collision is a trivial for an API designer.
I think this may be to maintain style conventions with JavaScript and previous versions of Google Cloud APIs which were all based in lowerCamelCase (not 100% sure though, don't quote me on this).
A benefit of this decision is that if you create JSON manually in JavaScript (ie. without a protobuf library) your JSON objects will match JavaScript conventions.
I'm not super familiar with the Java implementation, but I believe it's pretty optimized and appears to do very well generally on that benchmark.
One unavoidable issue is that, unlike JSON, protobuf serializers have to do two passes over the message tree, because in protobuf binary format all submessages are prefixed by their length. The first pass just calculates lengths, while the second performs the actual serialization. This could potentially slow down serialization compared to JSON, especially for message trees with lots of nodes/depth.
Interesting idea, and could be interesting to experiment with.
There are a lot of practical challenges though -- to provide a useful API you'd have to reverse the string at the end, since socket APIs don't generally provide streaming "WriteReverse" functions, and for good reason, because it would force them to buffer arbitrary amounts of data.
So we'd have to reverse the entire string at the end. The question is whether this would be cheaper than doing a second pass over the message tree. And also keep in mind that you would need the first pass to decode everything -- including UTF-8 data -- in reverse. But since the UTF-8 APIs for Java strings probably don't support this, you'd probably have to encode it, then reverse it to put it in the encoding buffer. That way when it gets reversed again at the end, it would be proper UTF-8.
At the end of the day, this probably wouldn't end up faster than what we do now. But can't say for sure without trying!
Reversing the output didn't come up when I did this sort of thing in C (for a different format): you can fill a buffer from the end, then just write the tail. But I guess Java does stick you with that? And I didn't have to deal with encoding to UTF-8 (ouch), or whatever other complications you might face, like streaming. Oh well! Hope the suggestion was stimulating anyway.
It's possible to encode a protobuf as JSON and we do it all the time at Google. In browsers, native JSON parsing is very fast and the data is compressed, so going to a binary format doesn't seem worthwhile. The .proto file is used basically as an IDL from which we generate code.
Personally I've found JSON encoded protobufs to be almost universally awful.
The most common method is to use an array indexed by the field number. I've seen protobufs with hundreds of fields so that's hundreds of nulls as the string "null".
The alternative is to have JSON objects with attributes named after the protobufs field name. This isn't without warts either and seems to be less prevelant in my experience.
Another problem is JavaScript doesn't support all the data types you can get in protobufs, most notably int64s.
Protobufs are relatively space efficient (eg variable width int types). JSON encoded protobufs much less so.
Perhaps the rise of browser support for raw binary data will make this less awful.
Many consider it a virtue to use the same code on the client and server. It explains things like this and GWT. Personally opi think this is horribly misguided and a fools errand. You want to decouple your client and server as much as possible (IMHO).
> The most common method is to use an array indexed by the field number. I've seen protobufs with hundreds of fields so that's hundreds of nulls as the string "null".
What you are describing here is known as the "JSPB" wire format. This is a serialization that is only ever used for JavaScript, and only used there because, historically, parsing binary protobufs in JavaScript was too slow. With TypedArray and other JavaScript enhancements, this is changing. Ideally, JSPB wire format would be phased out completely.
> The alternative is to have JSON objects with attributes named after the protobufs field name. This isn't without warts either and seems to be less prevelant in my experience.
proto3's JSON is an improvement on ascii protobufs, but since it uses field names, it doesn't have the same backward compatibility guarantees as a format that uses tag numbers.
It would be nice if we had a standardized JSPB wire format that used tag numbers, rather than the various unofficial implementations we have now.
There's no standard for this and it's mostly not open source as far as I know. The overall approach is called "JSPB" but there are various flavors. A typical use case is for a web app that has its own private RPC to its own servers, so interop isn't an issue. Also, web apps generally don't need or want to work with binary data, so better not to send it.
I recently became the maintainer of the Dart protobuf library which supports both JSON and binary format [1], [2]. However, the JSON format isn't necessarily compatible with other protobuf libraries you've seen.
Just a guess here: I would have an agreed-upon key name suffix like "__b64_enc" or something. Serializers take a field "foo" of type bytes and serialize it as "foo__b64_enc": b64(value), and deserializers strip and base64-decode.
edit: it's exactly what you would do if you wanted to pass any binary data as json over the wire, regardless of whether you're using protobufs. you'd just get it "for free" (meaning you wouldn't have to write the boilerplate, not that you don't have to en/decode).
I like to use Protobuf in my server code, but then support JSON _or_ Protobuf as the encoding. So browsers can continue to use JSON, but the server gets strongly-typed Protobuf structures.
Oh cool, didn't know there was a new version of protocol buffers. I ended up choosing Thrift for my current project due to wider language support, but I have been frustrated with some the limitations of the IDL (primarily no recursive data structures, so no generic storage of JSON-like objects).
Another one of the goals of proto3 is to increase language support a lot. Just this year we have added Ruby, Objective C, and C#, and keep your eyes peeled for more. proto3 is still alpha, but this is the direction things are going.
proto3 is especially designed to be paired with gRPC, which is also in alpha but also going for wide language support: http://www.grpc.io/
if you are using a statically typed language, binary formats like Protobuf are a big win, but if you are going to have the dynamic language overheard that comes with JS, there isn't much gain to be had from binary formats.
One thing, where protobuf (at least protobuf-net) really shines, is serialization of data into a binary format which is incredibly fast. In .NET, all inbuilt alternatives are slower by a large margin.
I agree. I recently converted some large files that were previously stored using XmlSerializer to use protobuf-net, and I found an 8x increase in space efficiency, and 6-7x increase in (de)serialization efficiency. It really is a fantastic library, and if your classes are already marked up for serialization, there is very minimal work required to make the switch. For files that need not be human-readable, protobuf is definitely the way to go.
It would probably be better to try something like Cap'n Proto or SBE if worried about performance. Otherwise I think sticking to GZIP'd json isn't going to lag that far behind. Protocol buffers biggest benefit IMHO is just their .proto file for cross language code generation.
I have it on a todo list to port an SBE parser to ScalaJS. ScalaJS already backs java ByteBuffers with javascript TypedArrays. That should be really fast, the same stuff that is being worked on for making asm.js fast will also make the Cap'n Proto / SBE approach fast, so I think this has the most promise of bringing really high-performance data transfer capabilities to the browser.
Having used both on a few projects, including a JS frontend, my advice is:
"Don't use protobufs if you don't have to".
Protobufs can be much faster, and provide a strict schema, but it comes at the price of higher maintenance costs. JSON is much simpler, easier to implement, and MUCH easier to debug. If your GPB looks like it's building properly, but fails to parse, it's a huge pain to try and decode/debug the binary. You'll wish you could just print the JSON string.
If you need the speed and schema, then GPBs are great. In our case, we got a huge speed boost just by avoiding string building/parsing inherent in JSON.
Could you elaborate on the maintenance costs? We use ProtoBufjs for our own real-time whiteboarding webapp over web sockets, and in the long run having strict schemas has saved us a lot of time. We're a distributed team with different members working on the front and backends, and we frequently refer to our proto files to remember how data is transferred and how it should be interpreted (explained in our proto commented code).
Are the maintenance costs related to debugging unparsable messages? We've almost never had an issue there, so maybe we've just been lucky?
Co-author of proto3 here. Proto3 was specifically designed to make proto more friendly in variety of environments, which includes native JSON support. New Google REST APIs are defined in proto3, which are open sourced[1].
In my experience, it's not that tough to write a 'proto-to-dict' function in python, which lets you crack open the proto and look at its juicy innards...
I'm curious if Google has a common envelope they send all service messages with. Ie. A common way of specifying pagination parameters, auth tokens etc. when sending protobuf messages between services. I've been using protobufs for my services and wrote a ServiceRequest object which has worked well. I was more just surprised about not being able to find much documentation on actual deployments as opposed to just simple tutorials.
With one end in Python 2 and the other end in Javascript, using binary protobufs seems misplaced optimization. It's nice to know the support is there (well, not in Python 3, apparently), in case you need to talk to something that speaks protobufs.
I'm looking forward to seeing protobufs in Rust as a macro. It should be possible; there's an entire regular expression compiler for Rust as a compile-time macro, which is a useful optimization.
One of the comments on that article was "YAY! JSON is wastefully large. I'd love to replace it." Is this true? I'm confused why JSON would be seen as a wasteful as a format. It seems to be that with any decent compression I would think it's hard to get much smaller. In this case I'm not talking about the other advantages Protobuf offers, I just want to know about size.
There are basically 2 areas where JSON is really wasteful. Compression can help with both of those.
1. Dictionary keys are repeated when you have an array of similar objects.
2. Non-text data. JSON can't natively represent binary data, forcing people to use things like base64 for binary and base10 for numbers.
> "YAY! JSON is wastefully large. I'd love to replace it." Is this true? I'm confused why JSON would be seen as a wasteful as a format.
It transmits type and field names. Depending on how complex your data is those strings could be a large part of the data.
{ "person": { "age": 30, "shoesize": 10 } }
The above is what, 4-5 bytes of protobuf? I'm not sure what the gzipped-json data is but likely a lot more. If you were to send a list of 100 such person objects, the difference would be smaller.
Assuming the integer fields are regular varint types (and not the "fixed" integer encoding), and assuming the tag numbers were all under 16, then this would be a six-byte protobuf.
No, it isn't true, but regardless of what format you use, there will always be someone who's not happy. Actually, I think that applies to everything in life.
Another way to do this is to specify the protocol in protobuf but have the server translate responses and requests to and from json. The java protobuf library does that for you out of the box. This is easier to implement. I would be curious to compare performance of both approaches in different contexts.
CORBA was ridiculously complex, because they tried to make remote objects look like local ones, with messages, reference counting, naming, discovery, etc. Protobuf is just a serialization mechanism. You're thinking at a lower level of abstraction - it's all just PODs that go over the wire, you build your own RPC framework on top of that (or use gRPC, which is Google's protobuf-over-HTTP2 RPC library) and think in terms of requests & responses.
IMHO trying to make everything look like an object was a mistake, and newer RPC frameworks like gRPC, Thrift, and JSON-over-HTTP are much easier to use than the late-90s frameworks like RMI, CORBA, and DCOM. Sometimes you don't want abstraction, because it abstracts away details you absolutely need to think about.
There is definitely a place for binary serialization/de-serialization and transmission. Inter-system communication is probably the best place for binary or any place that needs high speed real-time communication with the smallest size to fit in MTU limits (game protocols over UDP for instance). Any place that you control the client and server is ok to use binary.
However, I do feel there is a strange swaying back to binary (Protobuf/HTTP/2/etc). Developers are trying to wedge it in now in places it may cause more problems because it is more efficient in performance but not in use or implementation. Plus, like mentioned in this thread, you can compress JSON to be very small to send over the wire which makes the compactness of it a non-issue in non real-time cases. Going binary just to go binary is more trouble than it is worth in most cases.
- Binary over keyed plain text (JSON) is harder to generically parse objects i.e. dictionaries/lists for just a few fields/keys.
- Binary over JSON also seems to lock down messaging more, people have more work to change binary explicit messages because of offset issues and client/server tools must be in sync rather than just adding a new key that can be pulled as needed.
- Third party implementation and parsing of JSON/XML is more forgiving making version upgrades and changes easier to do. This is especially apparent on projects that are taken over by other developers.
- The language/platform on the backend leaks into the messaging. For instance Protobuf only runs on js/python currently and has various versions. The best messaging is independent of the platform and versioning is easier.
I would bet binary formats end up causing more bugs over keyed/plaintext (JSON/XML and possibly compressed) though I have nothing to back that up by except my own experience largely in game development where networking state is almost always binary, for server/data I wouldn't use it unless it needs to be real-time.
That being said Protobuf is awesome and I hope developers are using it where it is best suited and that developers don't start obfuscating messaging for performance where it doesn't really need to be, better to be simple unless you need to make it more complex at every level.
At work we're using HTTP requests and now we added RabbitMQ in the last few months to deal with the fact that our frontend has to talk to our backend. After seeing this article it feels like we chose the wrong tool for the job; protobuf/thrift appear to be typed which would have saved us a lot of frustration as we've already run into multiple cases where the receiver or sender have messed up the type conversion or parsing.
I don't see how protobuf is mutually exclusive with RabbitMQ. RabbitMQ is a message broker and can send around byte arrays. These byte arrays can be anything, including protobuf messages.
I guess it only makes sense if you are already using protobuf everywhere else in your stack. Specially if you are leveraging GRPC[0] which is already profobuf over HTTP. The network tab problem could be solved by an extension, or browsers could offer the tools built-in if there were to become a trend.
It's rather painful that they don't seem to have any design docs up for their HTTP transport, leaving it to things like FAQ entries to explain these details. This was my first thought too- grpc does this.
My understanding of ASN.1 is that it has no affordance for forwards- and backwards-compatibility, which is critical in distributed systems where the components are constantly changing.
...
OK, I looked into this again (something I do once every few years when someone points it out).
ASN.1 _by default_ has no extensibility, but you can use tags, as I see you have done in your example. This should not be an option. Everything should be extensible by default, because people are very bad at predicting whether they will need to extend something later.
The bigger problem with ASN.1, though, is that it is way over-complicated. It has way too many primitive types. It has options that are not needed. The encoding, even though it is binary, is much larger than protocol buffers'. The definition syntax looks nothing like modern programming languages. And worse of all, it's very hard to find good ASN.1 documentation on the web.
It is also hard to draw a fair comparison without identifying a particular implementation of ASN.1 to compare against. Most implementations I've seen are rudimentary at best. They might generate some basic code, but they don't offer things like descriptors and reflection.
So yeah. Basically, Protocol Buffers is a simpler, cleaner, smaller, faster, more robust, and easier-to-understand ASN.1.
Has the title been updated? It currently is "Using Protobuf instead of JSON to communicate with a front end", which is not click bait at all. The author used Protobuf instead of JSON as an experiment, and concluded that there is no reason to use it.
MessagePack worked great for us, fast, compact serialization, easy to use, great platform & language support. I've never used protocol buffers, mainly because I really dislike that you have to write .proto files that are then translated to code, which IMO In many situations is an unnecessary kludge. I understand it can be useful especially if you need to serialize the same things from different languages, and don't want to write the same serialization code twice (or more), but if that's not a concern for your project, I have no idea why I should prefer protocol buffers over MsgPack
We rewrote our app eventually to send protos in JSON format to the app, while just letting our backends still pass around native protos, it worked a lot better.