Hacker News new | past | comments | ask | show | jobs | submit login
Add extra stuff to a “standard” encoding? Sure, why not (rachelbythebay.com)
219 points by l0b0 on Sept 20, 2023 | hide | past | favorite | 155 comments



> author decided it was a good idea to prepend the message with the message length encoded as a varint.

> WHY? Oh, why?!

Uh oh. Is this my HN moment?

This is exactly how I implemented it at my company. We had to write many protobuf messages to one file in bulk (in parallel). I did a fair amount of research before designing this and didn’t find any standard for separating protobuf messages (in fact, found that there explicitly isn’t a standard in that protobuf doesn’t care). So I thought rather than using some “special” control character, like a null byte, which would inevitably be not-so-special and collide with somebody else’s (like Schema Registry’s “magic byte”), I’d use something meaningful like the number of bytes the following record is.

As for why I chose varint instead of just picking an interger size, well for one I got nerd-sniped by varint encoding and thought it would be cool to try and implement it in Scala. Secondly, I thought if I chose a fixed size integer, no matter what size I pick, my users will always surprise me and exceed it at least once, and when that happens, kaboom! I wanted to future proof this without wasting 64 goddamn bytes in front of each message, and also I got nerd-sniped, OK?!?

Someone on my team recently shared one of these files outside the company and so I really hope she’s not talking about me but that’s a crazy coincidence if not!


Congrats :) you perfectly re-created the actual Protobuf stream format. [1]

https://protobuf.dev/reference/java/api-docs/com/google/prot...


> didn’t find any standard for separating protobuf messages

The fact that protobufs are not self-delimiting is an endless source of frustration, but I know of 2 standards for doing this:

- SerializeDelimited* is part of the protobuf library: https://github.com/protocolbuffers/protobuf/blob/main/src/go...

- Riegeli is "a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding": https://github.com/google/riegeli



Huh? This is an entirely different serialization format altogether.


Create a MessagePack message containing a byte array, put the protobuf in that. Of course, the enterprise solution would be to hex encode it and put it in a CDATA section in an XML document.


Poster is being cute.


Just stick this in front of your frames:

    81 A2 70 62 C6 <32 bit be length> <protobuf data>


You got me. What's that format / magic number? At first I thought it was the zlib header for incompressible data, but that's not right. Is that just the MAC address of some NIC you own?


It's msgpack:

0x8n A map containing "n" key/value pairs -> 0x81 introduces a single key/value pair

b101XXXXX : A string where XXXXX is the string length -> 0xA2 introudces a 2 byte string

0xC6 followed by a 32-bit value N introduces an array of bytes of size N

0x70 0x62 "pb"

So it roughly corresponds to

  {"pb": <protobuf data> }


Doesn't have a Brainf*k implementation, disqualified.

/s


Prepending the message with a delimiter (size varint) is pretty common, even part of the reference Java implementation: https://protobuf.dev/reference/java/api-docs/com/google/prot...


> “I wanted to future proof this without wasting 64 goddamn bytes in front of each message”

64 bytes would be a 512-bit integer. That seems like excessive future proofing for the length of any message that would be transmitted before the Sun runs out of fuel.


Fun fact: just 103 more bits and it would be enough to address every point in the observable universe distinguishible by Planck length [1].

Not my thought originally, I heard it from somewhere else but can't find it. Possibly from Foone Turing.

[1] https://www.wolframalpha.com/input?i=log2%28+volume+of+unive...


I wonder whether there’s an upper bound on the largest number that can be expressed in the observable universe.

All digital representations rely on discrete states in hardware, and there’s a finite number of those in the observable universe, so there should a finite maximum number for computers.


Depends on what you mean by “expressed”. E.g. you can represent a busy beaver that expresses the length of its output.

We don’t know if there is only a finite number of discrete states in the observable universe. The Planck length is not a discretization.


> Depends on what you mean by “expressed”

I'll have to think about this. I want a very physical meaning of the term using values rather than references. One where the largest number that can be expressed using up/down fingers with two hands is 1024, not a sign language reference to a googolplex.


If you are OK with a sum constants times powers of two, would you be ok with other hyperoperations on two?

https://en.m.wikipedia.org/wiki/Hyperoperation

I wonder if it would be helpful to restrict the question a bit, maybe something like: what is the largest number for which the number, and also all smaller magnitude integers, can be expressed.


You might enjoy:

"Measuring the intelligence of an idealized mechanical knowing agent" https://philpapers.org/archive/ALEMTI-2.pdf

"Intuitive Ordinal Notations" https://github.com/semitrivial/IONs


all expression is reference


Related to https://en.m.wikipedia.org/wiki/Berry_paradox

Just as a quick example of why it's a bit absurd - I name that number you just defined $zeta$. Now I make $zeta'$ = zeta^zeta. Or whatever manipulation you like. Adding constraints is addressed in the link.


And zeta' can not be expressed by any state of the visible Universe.

The GP question was not about encoding, and thus is not subject to compression. The largest number we can measure of anything is a pretty well defined concept.


Tbf though, I'm sure the number of cat pics floating about the internet dwarfs this number, so it depends on the data I suppose.

Plus rather simple things like pi could create rather a long message.


I suppose pi has a cheat code: if you made the largest possible circle in the universe you only need enough digits to distinguish points on its circumference that are a Planck distance apart. Then you can either ignore any digits beyond that or even simply make them up, as there would be no way to measure the difference.


A related point of fact is protobufs are limited to 2G anyway. You wouldn’t need 64 bits. Anything bigger than about 100MB is dangerously large anyway.


You can also define a wrapper message with a single repeated field. The resulting encoding would be `varint(field id * 8 + 1) varint(length)` plus the actual message, so it can be also easily generated from and parsed back into a raw byte sequence without protoc.


This approach is problematic because, as far as I know, all implementations expect to deserialize from a complete buffer -- at least all official implementations. That means the entire "stream" would have to be buffered in memory.

You can, of course, simply read field by field, but few libraries expose the ability to do that simply. And the naive operation then becomes quite problematic indeed.


This is the same as it's implemented in ClickHouse. In fact, it has two formats: `Protobuf` (with length-delimited messages) and `ProtobufSingle` (no delimiters, but only a single message can be read/written). And it is fairly common:

https://clickhouse.com/docs/en/sql-reference/formats#protobu...


Conforming implementations will fail to interoperate with their “protobuf” serialization; it’s absolutely incorrect for them to call their length-prefixed framing protobuf.


From my experimentation with writing my own push notification client for android, this is exactly what Google does with their push messages.


If it was for fun and to learn how, that's fair. But are you aware of https://ntfy.sh?


Ah I was just experimenting with what was possible, essentially I wanted to intercept notifications from an app to use for an automation.


Can you share more?


You've basically got two ways of handling this:

Using a delimiter to mark end of packet. For text protocols we usually use "\n" to be the delimiter. This usually requires some escaping in the packet to make it unambiguous. Two standard protocols for this are SLIP and HDLC.

Length encoding - like what you did.

The downside of the delimiter approach is that it changes the length of the packet - when you are escaping one byte becomes two and that's sometimes a pain if you're doing it in place in memory (less of a problem if you're streaming byte by byte). The big advantage is that it allows for resynchronisation - if you lose a single byte from your stream, or your length byte ever gets corrupted then you're permanently out of sync - the receiver will never again know where the start or end of a packet is. With the delimiter approach, you just lose one packet. So if you're ever doing this for a UART or network stream or something, always do the delimiter approach!


> So if you're ever doing this for a UART or network stream or something, always do the delimiter approach!

I'm confused. I mean for UART, sure. But network streams is usually sent over a protocol that recovers lost data. Am I reckless for sending length-prefixed data chunks over TCP?


No, you're right that this approach is ok for TCP streams. I shouldn't have thrown that in there without clarification that I meant something more generic.

I actually use SLIP for packetisation over TCP anyway, because then syncing for any other logging or diagnostic thing that joins halfway through is easy. Basically I find it to just be a more robust system.


"I actually use SLIP for packetisation over TCP anyway"

Whoa, that's an acronym I haven't heard in a long time.

SLIP/TCP seems redundant. Your serial encoding IP then sending it over TCP/IP? Is this a tunnel?


TCP is a stream with no packet boundaries. If you want to separate that stream into a series of messages then you need to do packetisation. The SLIP delimiter approach is a pretty good one.

You're right that it was originally for segmenting a serial link to send separate IP packets (so you could theoretically have SLIP over TCP over IP over SLIP), but it works just as well in other contexts.


A TCP network stream will guarantee no missed bytes via behind-the-scenes retransmissions, no?


Yes, that's why it's very common to send length-prefixed messages over TCP.

(The TCP checksum is very weak at only 16 bits, but there's usually another layer like TLS that gives you integrity for free so nobody cares)


> isn’t a standard in that protobuf doesn’t care

Shove protobuf into Something Else that does packet delimitation for you. I'm fond of SQLite for offline cases as a richer alternative to sstable.


Oh who hasn’t written some LengthPrefixedProtobufRecord class. I get why Rachel would be annoyed if someone claimed it was protobuf, but even then it’s like, 8 seconds in xxd to see what’s going on.

I shudder to think how well shit must be going for this to merit a Rachel post.


Why not just define a higher level proto that contains all possible (maybe repeated) protos you might want to include? Then if one of the included protos is not present, the higher level proto will efficiently encode that, and nothing gets broken.


If you want to future proof it you need to version it. It sounds like you're trying to pack many things into a single file, so having the first few bits of the file represent a version allows you to use fixed length integers without fear of them being too small (in the future). You can reserve the "last" version for varint if you truly need it.

In general, I find adding versions to things allows for much more graceful future redesigns and that is, IMO, invaluable if you're concerned about longevity and are not confident in your ability to perfectly design something for the indefinite future.


I don't see the point. As soon as you bump the version number, old versions of the software will refuse to read newer files. So you might as well use a new format and file extension for the newer files. Of course the new software will read both formats.

The way protobufs handles versioning (old software ignores unknown fields) is far superior and realistically everyone uses 64 bit fixed length sizes everywhere


YAGNI

The idea is so simple that any change would be a misfeature.


I'm a big YAGNI proponent but format version numbers are not something I would dare to YAGNI.


Even if it's varint separated messages?


You stick a magic number and a version number at the top of the file and call it done. It’s trivial and buys you a great deal.

The magic means it’s possible to identify the file type. Maybe you’ll add a tool later that operates on multiple types of files.

The version means you can evolve the contents in non-backwards-compatible ways, while maintaining the ability to read/parse the old version.

It’s a pain in the ass to add a magic or version number later; there’s a reason why nearly every file format on the planet has both.


> nearly every file format on the planet

XML

JSON

YML

TOML


  <?xml version="1.1" encoding="UTF-8" ?>
The others are serialization formats, like protobuf — they’re not file formats.


XML prolog is optional.

> they're not file formats

Brb, gonna delete all my files without file formats


If you prefer to ship fragile binary file formats for no reason other than finding it too onerous to define something as trivial as an eight-byte file header, that’s silly, but nobody is going to stop you.

If you’re going to use text-based serialization formats as your justification for the decision, however, I’d suggest you look into all the fun bugs, security issues, and weird edges cases that arise from parsers having to make a best guess at character encoding and file format when all you have to work with is the file extension, maybe a byte order mark, and heuristics over the file contents.


Do you mean 64 bits I assume?

Because that's all you would need.


Your length-prefixed message frames represent a distinct serialization scheme that provides external framing for embedded messages; it’s a standalone format, and you can use whatever frame header you want without worrying about collision.

The only issue is if you were to ship a “protobuf” library that emits/consumes your (very much not protobuf) framing format.

Also, a 64 bit frame length would only be 8 bytes, not 64 :-)


This post's outrage is misplaced.

Protobuf encoders/decoders commonly implement two formats: delimited format, and non-delimited format. Typically non-delimited is the default, but delimited format is supported by many implementations including Google's main first-party ones. In fact, the Java implementation shipped with this support when Protobuf was first released 15 years ago (I wrote it). C++ originally left it as an exercise for the application (it's not too hard to implement), but I eventually added helper functions due to demand.

Both formats can be described as "standard", at least to the extent that anything in Protobuf is a standard.

So clearly the bug here is that one person was writing the delimited format and the other person was reading the non-delimited format. Maybe the confusion was the result of insufficient documentation but certainly not from a library author doing something crazy.


It's sort of misplaced, I'll agree.

Merely using the delimited format without any other sort of framing is almost always a bad idea because of precisely the ambiguity TFA discusses.

I'm pretty sure delimited streams are rarely used in the wild instead of something more robust/elaborate such as recordio, which specifically are almost always prefixed with a few magic bytes to mitigate this problem.

Edit: Also, why is there no publicly available recordio specification? Infuriating.


> I'm pretty sure delimited streams are rarely used in the wild

They are somewhat common inside Google at least.


I actually went through all projects listed in [1] because I remember this very quirk. It turns out that there are many such libraries that have two variants of encode/decode functions, where the second variant prepends a varint length. In my brief inspection there do exist a few libraries with only the second variant (e.g. Rust quick-protobuf), which is legitimately problematic [2].

But if the project in question was indeed protobuf.js (see loeg's comments), it clearly distinguishes encode/decode vs. encodeDelimited/decodeDelimited. So I believe the project should not be blamed, and the better question would be why so many people chose to add this exact helper. Well, because Google itself also had the same helper [3] [4]! So at this point protobuf should just standardize this simple framing format [5], instead of claiming that protobuf has no obligation to define one.

[1] https://github.com/protocolbuffers/protobuf/blob/main/docs/t...

[2] https://github.com/tafia/quick-protobuf/issues/130

[3] https://protobuf.dev/reference/java/api-docs/com/google/prot...

[4] https://github.com/protocolbuffers/protobuf/blob/main/src/go...

[5] Use an explicitly different name though, so that the meaning of "encoding/decoding protobuf messages" doesn't change.


Yep, and this variant to the encoding is documented at https://protobuf.dev/programming-guides/techniques/#streamin...

Definitely seems to be a routine addition to the standard supported by many libraries.


> Yep, and this variant to the encoding is documented at [...]

It only suggests the length prefix and doesn't define the exact encoding at all.


This is not a "variant". Top-level "message" has no length. Implementations that add length are free to do it in whatever way they like, but that's not part of the format because the format itself doesn't have a concept of "sequence of messages". It has lists, but those need to be inside of messages.

And since some do it in the same way, sometimes it works. The two typical approaches I found in the wild: use varint as a length and do nothing at all. Typically, the second one implies that the user, if they want to send a sequence of messages need to get creative and invent some form of connecting them together.

GRPC is the first kind. So, all those using GRPC rather than straight-up Protobuf are shielded from this problem.


Stepping up an abstraction level in this discussion...does anyone have any insight into _why_ an encoding format wouldn't want to have length prefixes standardized as part of the expected header of a message? From what I can tell, there isn't a strong argument against it; assuming you're comfortable with limiting messages to under 2^32 bits, prefixing an unsigned length should only take four bytes per message, which doesn't seem like it would ever be a bottleneck, and it allows the receiving side of a message to know up front exactly how much memory to allocate, and it makes it much easier to write correct parsing code that also handles edge cases (e.g. making it obvious to explicitly handle a message that's much larger than the amount of memory you're willing to allocate). The fact that there are formats out there that don't mandate length prefixing makes me think I might be missing something though, so I'd be interested to hear counterarguments.


In general, the issue is with composition. Your full message is someone else field. Having a header that only occurs in message initial position breaks this.

For protobuffs in particular, I have no idea. If you look at the encoding [0], you will see that the notion of submessages are explicitly supported. However, submessages are preceeded by a length field, which makes the lack of a length field at the start of the top-level message a rather glaring omission. The best arguement I can see is that submessages use a tag-length-value scheme instead of length-value-tag. This is because in general protobufs use a tag-value scheme, and certain types have the begining of the value be a length field. This means that to have a consistent and composable format, you would need to message length to start at the second byte of the message. Still, that would probably be good enough for 90% of the instances where people want to apply a length header.

[0] https://protobuf.dev/programming-guides/encoding/


Protobuf is sort of unique in the serialization format that it can be indefinitely extended in principle. (BSON is close, but has an explicit document size prefix.) For example, given the following definition:

    message Foo {
        repeated string bar = 1;
    }
Any repetition of `09 03 41 42 43` (a value "ABC" for the field #1), including an empty sequence, is a valid protobuf message. In the other words there is no explicit encoding for "this is a message"! Submessages have to be delimited because otherwise they wouldn't be distinguishable from the parent message.


This is an interesting design decision between protobuf and cap’n proto, one has “repeated field of type X” while the other has “field of type list of X”. So one allows adding a repeated/optional field to a schema without re-encoding any messages, while the other supports “field of type list of lists of X” etc.


> So one allows adding a repeated/optional field to a schema without re-encoding any messages,

Hmm don't they both allow that? Or am I misunderstanding what you mean here?

I guess the interesting (though only occasionally useful) thing about protobuf is if you concatenate two serialized messages of the same type and then parse the result, each repeated field in the first message will be concatenated with the same field in the second message.


Yes, "submessages"in Protobuf have the same field serialisation as strings

The field bytes dont really encode tag and type, they encode tag and size (fixed 32bit, 64bit or variable length)

Protobuf is a TLV format. In that regard, it's not unique at all.


> you will see that the notion of submessages are explicitly supported.

This is misinterpreting what actually happens. "Message" in Protobuf lingo means "a composite part". Everything that's not an integer (or boolean or enum, which are also integers) is a message. Lists and maps are messages and so are strings. The format is designed not to embed the length of the message in the message itself, but to put it outside. Why -- nobody knows for sure, but most likely a mistake. After all it's C++, and by the looks of the rest of the code the author seems like they felt challenged by the language, so they felt like it'd be too much work if / when they realized that the encoding of the message length was misplaced to put it in the right place, and so it continues to this day.

For the record, I implemented a bunch of similar binary formats, eg. AMF, Thrift and BSON. The problem in Protobuf isn't some sort of a theoretical impasse. It's really easy to avoid it, if you give it like an hour-long thinking, before you get to actually writing the code.


So the length of the submessage is part of its parent, and the top level message has no explicit length because it has no parent? It seems terrible for most purposes.


Precisely. This is also unique to Protobuf (at least, I don't know other formats, and I had to implement a handful, that do that).


> Having a header that only occurs in message initial position breaks this.

Why would it break it? It may make it slightly harder to parse, but since the header also determines the end of the message, anyone parsing the outer message would have a clear understanding that the inner header can be safely ignored as long as the stated outer length has not been matched.


Yeah, it's not clear to me why this is an issue either. I'd expect that parsers would be written as "parse length, if it's valid, allocate that many bytes and read them in, then parse those bytes as a message".


I'd say the main arguments are:

1. Many transports that you might use to transmit a Protobuf already have their own length tracking, making a length prefix redundant. E.g. HTTP has Content-Length. Having two lengths feels wrong and forces you to decide what to do if they don't agree.

2. As others note, a length prefix makes it infeasible to serialize incrementally, since computing the serialized size requires most of the work of actually serializing it.

With that said, TBH the decision was probably not carefully considered, it just evolved that way and the protocol was in wide use in Google before anyone could really change their mind.

In practice, this did turn out to be a frequent source of confusion for users of the library, who often expected that the parser would just know where to stop parsing without them telling it explicitly. Especially when people used the functions that parse from an input stream of some sort, it surprised them that the parser would always consume the entire input rather than stopping at the end of the message. People would write two messages into a file and then find when they went to parse it, only one message would come out, with some weird combination of the data from the two inputs.

Based on that experience, Cap'n Proto chose to go the other way, and define a message format that is explicitly self-delimiting, so the parser does in fact know where the message ends. I think this has proven to be the right choice in practice.

(I maintained protobuf for a while and created Cap'n Proto.)


About #1, FYI you should never trust the content-length of an HTTP request. It actually says that in one of the spec versions or another.

The problem with writing the length out at the beginning of the message is that you need to know the length before you write it out. For large objects that may cause memory issues/be problematic.

In many cases it works just fine. I doubt any protocol puts a "0" as the length for a dynamic length, but I can see a many-months long technical fight about that particular design decision.


> FYI you should never trust the content-length of an HTTP request. It actually says that in one of the spec versions or another.

That's not quite right. When a Content-Length is present, it is always correct by definition (it is what determines the body size; there's no way for the body to be a different size). What you probably mean is that you shouldn't ever assume that a Content-Length is available, because an HTTP message can alternatively be encoded using Transfer-Encoding: chunked, in which case the length is not known upfront. That is true, but doesn't change my point: Either Content-Length or chunked transfer encoding serve to delimit the HTTP entity-body, which makes any other in-band delimiter redundant.


About #2, it seems that every time someone creates some format with a prefix serialized size, the next required step is creating a chunked format without that prefix.

What IMO is perfectly ok. Chunking things is a perfectly fine way to allow for incremental communication. Just do it up-front, and you will not need the complexity of 2 different sizing formats.


In truth, the vast majority of protobuf users do not construct messages in a streaming way anyway; they have the whole message tree in memory upfront and the serialize it all at once.

"Streaming" in gRPC (or Cap'n Proto) involves sending multiple messages. I think this is probably the right approach. Otherwise it seems very hard to use a message as its streaming in as you can never know which fields are not present vs. just haven't arrived yet. But representing a stream as multiple messages leaves it up to the application to decide what it wants in each chunk, which makes sense.


Yeah, that's another good point I forgot to mention; it's often hard to know the difference between "I haven't received this info yet but it might still come" and "I know this info definitely isn't coming at all". I've dealt with this more often in the context of a specific schema rather than in the encoding format itself (e.g. sending back a simple "success" response when relaying a message where the response will be delivered asynchronously so that the other side can tell the difference between the message or response getting lost and the message being successfully sent and the response just not having arrived yet), but it's definitely possible for an encoding format to be designed in a way where it's not clear where the message should end, and having a length prefix is an effective way to deal with that as well.

I also fully agree with the "streaming can be done as multiple messages" approach; from the discussion here, it sounds like there may be some nice use cases where having a length prefix would be prohibitive (e.g. compression being generated on-the-fly), but these don't sound like typical use cases for encoding formats intended to be used generally; if anything, I'd expect something like a gzip response to be sent back as the entirety of a response (e.g. an HTTP get request for a specific file) rather than a part of a message in some custom protocol using protobuf or something similar.


> serialize incrementally

You cannot do it in Protobuf anyways (you need to allocate memory, remember? and you need to know how much to allocate, so you need to calculate the length anyways, you just throw it away after you calculate it, fun, fun fun!).


You don't necessarily have to allocate space for the whole message upfront. You can write some of it into a buffer, write that out to the socket, fill the buffer with more stuff, etc. There are actually programs that do this, although it's true that it's rarely worth the complexity.


Not going to work really.

The way it works in real life if, say, you want to serialize a list of varint is that you'd need some small memory chunk (let's call it staging) where you write individual integers (although, this is a bad idea for long lists, as you'd really want to write multiple elements at once if you have enough elements to justify spawning more threads). So, in this staging area you write those integers, more or less a byte at a time. You know they aren't going to take more than 8 bytes in the worst case, so you can have your staging area be 8 bytes.

Then you need to keep track of how many bytes in total you wrote. And, at this point you may start writing the field with the serialized list (in bytes). The field will contain the length of the list. So, you've already calculated the length even before you started writing. Also, you need to store those varints somewhere before you start writing the field with the list...

Protobuf isn't designed to do streaming. Well, really, it isn't designed at all. Like I wrote elsewhere, it was implemented first and then there was an attempt to describe what was implemented and call that "design". Having implemented several formats (eg. FLV and MP4) that were designed for streaming, I'm very confident Protobuf authors never concerned themselves with this aspect.


Among other things, length prefixing is annoying when streaming; it basically requires you to buffer the entire message, even if you could more efficiently stream it in chunks, because you need to know the length ahead of time, which you may very well not.


Remember FLV? What about MP4? Surprise! Both use length prefixing, and stream perfectly fine.

Length-prefixing is not a problem for streaming. Hierarchical data is, but even then, you have stuff like SAX (for XML).

The problem with Protobuf and why you cannot stream it is that it allows repetition of the same field, and only the last one counts. So, length-prefixing or not, you cannot stream it, unless you are sure that you don't send any hierarchical data (eg. you are sending a list of integers / floats).

Ah, also, another problem: "default fields". Until you parsed the entire message you don't know if default fields are present and whether they need to be initialized on the receiving end.


> length prefixing is annoying when streaming

This can be avoided by magic number. If length is 0, then message length isn't known.


That does leave one problem: you still need a way to segment your stream. Most length-prefixed framing schemes do not have any way to segment the stream other than the length prefix. What you wind up wanting is something like chunked encoding.

(Also, using zero as a sentinel is not necessarily a good idea, since it makes zero length messages more difficult. I'd go with -1 or ~0 instead.)


Using `-1` would require using a singed integer for the length, which I guess could be done if you're fine with having the maximum length be half as long, but that also raises the question of what to do with the remaining negative values; what does a length of -10 mean?

I thought -0 is only something in floating point numbers, not integers, and using floats for the length of a message sounds like a nightmare to me.


Ah, I was a little unclear. I mean ~0 as in NOT 0, an integer with all bits set. This is also the same as -1 in two's compliment. So basically, I'm suggesting you use the maximum unsigned integer value as a sentinel. That doesn't work if you're using a variable-length unsigned integer like base128vlq, but if you're doing base128vlq you could always make a special sentinel for that (e.g. unnecessarily set the high bit and follow it with a zero byte; this would never normally appear with a base128vlq.)


Or it doesn't have a length. For messaging protocols - or in general - magic should be avoided at all times.


If you have random access, you could leave some space and then go back and fill in the actual length value. Would work better with fixed size integer as you know ahead of time how much space to leave.


Just leave some empty electromagnetic waves in the cables before your message, got it.


If you’re streaming, you generally don’t have random access.


Somehow I missed the ‘streaming’ part… my bad


Some compression formats such as gzip support encoding streams.

This is useful when you don't know the size in advance, or if you compress on demand and want the receiver to start reading while the sender is still compressing.

One example could be a web service where you request dynamic content (like a huge CSV file). The client can start downloading earlier, and the server doesn't need to create a temporary file. The web service will stream the results directly and encoding it in chunks.


> Some compression formats such as gzip support encoding streams.

More accurately speaking gzip (and many other compressed file formats) has the file size information, but that information should (or can, for others) be appended after the data. Protobuf doesn't have any such information, so a better analogue would be the DEFLATE compressed bytestream format.


Gzip in particular has more something akin to a size checksum appended at the end, i.e., decompressed size modulo 2^32. This is not very helpful, especially as the maximum compression ratio is ~1032, it means that this "size" could already overflow for a gzip file that is only 4 MiB in compressed size.

https://stackoverflow.com/a/69353889/2191065


You may want to have yield/streaming senantics where length is not know in advance.


The length should be known in advance in order to be written, so the message cannot be incrementally written. You need more complex framing scheme like Consistent Overhead Byte Stuffing for that. And many applications do want a variable number of length bytes because i) 4 bytes is actually too long for short messages and ii) some message can exceed 2^32 bytes. Not to say the generic varint encoding is good for this purpose, though [1].

[1] If you ever have to design one, make sure that reading the first byte is enough to determine the number of subsequent length bytes.


> 4 bytes is actually too long for short messages

Would it ever be an actual bottleneck though? If it's not actually impeding throughput, I feel like this is more of an aesthetic argument than a technical one, and one where I'd happily sacrifice aesthetics to make the code simpler.

> some message can exceed 2^32 bytes

Fair enough, but that just makes the question "would 8 bytes per message ever actually be a bottleneck", which I'm still not convinced would ever be the case


I agree that 4 byte overhead in a single item is not a big deal even when the item itself is short. But if you have multiple heterogeneous items, it becomes worthwhile to set an upper bound for the relative ratio of overhead over data. (SQLite, for example, makes extensive use of varints for this reason [1].) I personally think a single byte overhead is worthy for sub-100-byte data, while larger ones should use a fixed size prefix because the overhead is no longer significant.

[1] https://www.sqlite.org/fileformat.html#record_format


I work on an embedded OS where our IPC buffer is only about 140 bytes.

Anything more requires multiple IPCs, with lots of expensive context switches.

Wasting even one precious byte on a pointless header would absolutely be an issue in this environment.


That makes sense. I've probably been thinking more about remote network protocols, where the time it takes for a message to reach its destination will be so large that the overhead for a few extra bytes per message is negligible; for a protocol used solely within a given local system, there would be a much lower threshold of acceptable overhead.


> and it allows the receiving side of a message to know up front exactly how much memory to allocate

Not necessarily. Can you really trust the length given from a message? Couldn't a malicious sender put some fake length to fool around with memory allocation?

I was under the impression that something like this caused Heartbleed (to use one example):

* https://en.wikipedia.org/wiki/Heartbleed


Heartbleed was caused by allowing the user to specify the length of the response, not that of a request.

When receiving a message, if the user gives you a wrong length, you'll simply fail in parsing their message. Of course, it is up to you to protect against DOS attacks (like someone sending you a 5 TB message, or at least a message that claims it is 5TB) - but that is necessary regardless of whether they tell you the size ahead of time or not.

With heartbleed, a user sent a message saying "please send me a 5MB large hearbeat message", and OpenSSL would send a 5MB re-used buffer, of which only the first few bytes were overwritten with the hearbeat content. The rest of the bytes were whatever remained out of a previous use of the buffer, which could be negotiated keys etc.


I don't see how sending a bad length could cause a memory issue in this case; if a message has a length that's much longer than expected, the receiving side could just discard any future messages from that destination (or even immediately close the connection). If the message is much shorter than the data received, the bytes following it would be treated as the start of a new message, and the same logic would apply.


This is nothing general. Just someone who created Protobuf "forgot" to do it. There was no reason not to do it given how everything else is encoded in Protobuf.

My guess is that Protobuf was first implemented then designed. And by the time it was designed, the designer felt too lazy to do anything about the top-level message's length. There are plenty of other technical bloopers that indicate lack of up-front design, so this wouldn't be very unlikely.


Streaming has already been mentioned. Efficiency might be another argument. If your messages are typically being sent through a channel that already has a way of indicating the end of the message then having to express the length inside the message as well would be a useless overhead in bytes sent and code complexity.


This assumes that only messages from controlled sources will be received though, right? If you're receiving messages over a TCP socket or something similar, that seems like a potentially flawed assumption; I'd think anything parsing messages coming from the network should be written in a way that explicitly accounts for malicious messages coming from the other side of a connection.

EDIT: I'm also still not any more convinced that four bytes per message would ever be a bottleneck for any general purpose protocol, but I'd be curious to hear of a case where that would actually be an issue.


> This assumes that only messages from controlled sources will be received though, right?

I don't think so. The question of whether you trust the length indication to be correct (you almost certainly shouldn't) seems to me to be independent of whether the length indication comes from inside the message or from some outside wrapper.


I might have misunderstood what you were suggesting; from rereading, it sounds like you're suggesting to rely on something like the end of a TCP packet rather than having an explicit length. If this is what you mean, my concern would be that requiring a protocol to map 1:1 to the transmission protocol's packets (or a similar construct) can be limiting; it would require all messages to fit in a single buffer, which could prevent it from working with clients or servers configured to use a different length and might make it difficult to use with other transmission protocols.

My question in the beginning of this thread was intended to be specifically about general purpose formats like protobuf; I think relying on the semantics of TCP or something like that might be a good choose for a bespoke protocol, but it doesn't seem like a great idea for something expected to be used in a wide variety of cases.


For a slow protocol like protobufs that is rarely streamed, I agree a length prefix should be the default

One way to make streaming work is just to allow the length value to be bigger than needed and add a padding scheme at the end of the message. This is overhead free in terms of processing time since fields must be decoded sequentially anyway.


> For a slow protocol like protobufs that is rarely streamed...

In my experience, protobufs are often streamed, especially in the cases where performance matters.


Protobuf over UDP can use the UDP payload length. Likewise for the many variants of self-sychronising DLE framing (DLE,STX..DLE,ETX) used on serial links.

A varint length field prepended to protobuf messages (sent over a reliable transport, such as TCP) seems sane.


Framing is a distinct concern from payload serialization.

Most protocols and serialization formats already define a form of length-prefixed framing; requiring that a protobuf payload also carry such a header would simply be a waste of bytes.

Additionally, it ensures that protobuf can be serialized and streamed without first computing the payload length, which would require serializing the entire message first.


Pedantry regarding the article:

The field prefix byte in Protobuf doesn't really encode "tag and type" as stated in the article, it encodes tag and size (whether the field is fixed 64bit, fixed 32bit, varint, or variable size)

This is pretty self evident when you look at how submessages are encoded the same way as strings, both are just arbitrary variable length blobs.

You cannot reliable determine from a Protobuf message whether a field is an integer, a double, a bool, or an enum without the schema.

Protobufs is a TLV format that just happens to have a compact binary encoding.


Isn't this exactly what `writeDelimitedTo` does? https://protobuf.dev/reference/java/api-docs/com/google/prot...

I thought this was common and well-known, but apparently not.


> we started looking at this protobuf library he had selected, and sure enough, the author decided it was a good idea to prepend the message with the message length encoded as a varint.

> WHY? Oh, why?!

> And yes, it turns out that other people have noticed this anomaly. It's screwed up encoding and decoding in their projects, unsurprisingly. We found a (still-open) bug report from 2018, among others

If anyone knows which library/language these issues the author is talking about are in, please tell us. I'd like to avoid that library if possible


> If anyone knows which library/language these issues the author is talking about are in, please tell us.

https://github.com/protobufjs/protobuf.js/issues/987 maybe (based on "And yes, that string in this post is entirely deliberate").


This is a bug related to decoding what the user believe is a protobuf file but actually is a length-prepended protobuf file. The article would complain about whoever wrote that file, not about the code that is reading it.


https://github.com/protobufjs/protobuf.js/wiki/How-to-revers...

Prepending the message with the length means the message is length-limited. Seems standard practice here.


yep, I think it's quite standard. The BitTorrent protocol also uses it extensively : https://wiki.theory.org/BitTorrentSpecification#Bencoding


I feel like I recall seeing something similar in nanopb.


That would be the official implementations of protobuf from the people that created it.


Folks, if you send raw PB over the wire in a protocol without framing, it _needs_ to be length prefixed somehow. This type of "bug" isn't an unreasonable thing to do, nor uncommon even. If I had a gripe, it's that I hate varints. Lots of wasted cycles to save not that many bytes in the grand scheme of things, especially when you consider other forms of compression layered on top, MTU, etc.


It’s absolutely an unreasonable thing to do.

You don’t conflate framing with payload by emitting invalid non-standard framed data from a “protobuf” encoder. They’re separate concerns and need to remain that way.


If you read the post, the conclusion is:

> you skip the "helper" function that's breaking things.

Yea ok, I'm just going to assume this helper function added framing unless told otherwise. Where in this post did you even read that framing and payload data were conflated (not to mention that there are better protocols that include framing metadata).


Except that it’s part of the official implementation from google. It’s how you write multiple such messages to a stream.


There’s a non-default `writeDelimitedTo()` that emits a length-prefixed message.

The result is not protobuf and doesn’t claim to be.

Looking at my employer’s protobuf runtime for our major programming language, we don’t even support it.


reading and writing length delimited messages is part of the official implementations. [1] I don't understand the claim of "non-default". There are multiple ways to write protobuf data, and this is one of them. It's actually the only one I've ever used, because why would you not?

If your implementation doesn't support something in the reference implementation, that seems like your problem, not anyone else's.

[1] https://protobuf.dev/reference/csharp/api-docs/class/google/...


> It's actually the only one I've ever used, because why would you not?

Because it’s not part of the protobuf specification, not part of a valid protobuf message, and framing is a transport/file format concern and should not be performed by default.

If someone is expecting to receive a protobuf-encoded payload, it must not include a framing header.

> If your implementation doesn't support something in the reference implementation, that seems like your problem, not anyone else's.

Someone else’s failure to follow the spec is not our problem.


OP just told you it was in the spec, and then linked you directly to it. You're just repeating your previous post without adding anything new. Here's a more direct link:

https://protobuf.dev/reference/csharp/api-docs/class/google/...


In my job I often have to deal with developers who were supposed to implement a standard and went with a "more like guidelines" mentality. So I'm with the author here: Either you claim to support protobuff and fully implement the spec as-is, or you cannot claim to support protobuf (with maybe some leeway on a "receive-only" label if you support a superset). Otherwise you have to name/specify your protocol modification.


They don't have their own framing.

...and thus it was added.

We finally had to get down to individual bytes from the network dump to try to sort it out.

The perils of abstraction strike again. Chances are that if you had just written the code to directly send and receive the data you wanted, since you control both ends and know exactly what the bytes will be, you'd never have run into this.


> Chances are that if you had just written the code to directly send and receive the data you wanted, since you control both ends and know exactly what the bytes will be, you'd never have run into this.

Instead you would have run into different problems..


It's indeed a shame that protobuf has so many bit patterns available for the first byte (lowermost 3 bits can't be 6 or 7, and 4 cannot be also used alone), and some of them could have been used for a unified length-delimited framing format.


On one side it's not great, it could be made more obvious to the library user.

On the other side, if they wrote multiple projects and never noticed the behavior, how bad can it be ?

It would be interesting if that extra header optimized the processing a lot, pushing other libraries to have it as an option.


> On the other side, if they wrote multiple projects and never noticed the behavior, how bad can it be ?

Many projects will choose a standard encoding to give them language independence, but start by using the same language and libraries on both ends of the pipe. Therefore,you might not notice the library is incompatible until quite late in a project's development when you try to replace a component with a seemingly compatible, alternative implementation.


It looks like Protobuf actually has a test suite to ensure compliance with the protocol:

https://github.com/protocolbuffers/protobuf/tree/main/confor...

Seems like a good idea for protocols in general to have an official test suite, as a way to address this problem


The conformance suite tests encode and decode for single messages; you wouldn’t run it for a custom framing implementation; The gRPC interop tests are closer to what you’d want.

Framing appears higher up the stack, as an RPC transport, or structured storage like the recordio format referenced by the author. The article sounds like the client expected application/protobuf but the server sent application/custom-protobuf-framing.


I think it is easy to confuse the serialization format from framing because many serialization formats do self-delimit. And the protobuf terminology of "message" may suggest framing...


Not protobuf, but liblo, a library I maintain for Open Sound Control, ended up supporting two different packetization standards for stream-based transmission, because there was a lack of this in the original specification.

Originally OSC was intended to be "transport independent" and in practice only used for UDP transmission, so packetization was left as a problem for implementations to figure out. Cue the predictable problem of different incompatible ideas.

In the specification update it was suggested to use SLIP encoding for packetization, but prior to this the library had already implemented exactly this kind of length-prefix encoding. So now the library allows to select one or the other for sending, and for backwards compatibility on reception the library tries to sniff which packetization protocol is in use. (Fortunately OSC has a predictable first character, and 99% of the time this does not line up with the first byte of the length prefix, so when it starts with that character, the software switches to SLIP mode and scans for the end-of-packet code, otherwise it assumes there is a length prefix.)

It's not ideal, but it works. But it would have been simpler if the original specification had mentioned stream-based transmission. I still get the occasional question about "what are these extra bytes" at the beginning of the message when TCP is used.

Overall I find that a lot of standards have had to be hacked a bit to support packetized transmission, including JSON [0], etc. It's an odd thing to leave out, but admittedly it is indeed a transport problem and most often not considered part of the "file format". So I don't know what the best solution is, but I guess it would be good if data format specifications say something about how to handle multiple serialized instances in a stream if no other standards apply.

Interestingly YAML seemed to have thought about this with the "---" and "..." symbols, even in version 1.0 of the spec [1].

[0]: https://jsonlines.org/

[1]: https://yaml.org/spec/1.0/index.html


This is very similar to a trap I fell into with some off-brand LZ4 libraries that are laying around on Github. One, called "golz4" sounded harmless but it's actually designed to be compatible with this python mistake: https://python-lz4.readthedocs.io/en/stable/lz4.block.html

Length-prefixing an otherwise standard format, by default, kept me confused for some hours.


Yes, I fall into this trap several times a decade as well. It's incredibly annoying. Especially when you're reading a spec and it says "LZ4" but implementations implicitly expect the length prefix.

Drives me nuts!


I spent several hours earlier this year helping a colleague debug a protobuf deserialisation error, where I noticed that the first byte was a varint-encoded length of the rest of the data! He did eventually get things working, though I never followed up on the root cause.

I can’t wait to get back from vacation to ask if it was this.


In fact the sender chose the standard writeDelimited and the receiver chose to read without the delimiter.

Both ways are in the standard, but both ends have to agree. One could argue that is a flaw in protobuf itself, but the problem here was not a non-standard implementation.

The problem here was lack of experience and hasty finger pointing.


I would argue that. Preambles aren’t some quaint thing that old idiots stuck in protocols. They knew what they were doing.


Website looks down to me, here's an archive copy: https://archive.ph/OkylM


What library was that, so I know to avoid it?


https://github.com/protobufjs/protobuf.js/issues/987 maybe (based on "And yes, that string in this post is entirely deliberate").


Seems like protoduf.js has the exact same methods as Google’s implementations with the same names (encode/decode to not prepend with length, encodeDelimited/decodeDelimited for prepending length). It is hard for me to say they’re adding to the standard when they’re just replicating Google’s libraries.

https://github.com/protobufjs/protobuf.js#toolset


Ah yes, you've got it.



Oh... Protobuf... I had a displeasure to implement the binary encoder / decoder and IML parser. This is such a clown fiesta with so much advertising around it... I think some venerable Google's guru blessed it at some point or something like that, but when it comes down to the actual format and the tools it's like some not very bright first-second year student designed and implemented it.

So, here are just few things that made me decide once and forever never to touch this format:

* It cannot do streaming. It pretends that it can, but it cannot. The problem is that things that should be interpreted as hash-tables or call them "structs" don't require that key-value pairs have unique keys. Also, the last duplicate wins. So, unless you finished parsing all the pairs you cannot call any handlers / construct the "struct" because you don't know if you have the right values for it.

* The grammar is written by someone who... maaaay have seen a grammar... once... long time ago. It makes absolutely no sense. It was written after this mess was somehow implemented, but was never really checked. It's pure nonsense and nobody wants to fix it because nobody really knows how it's supposed to work, nor would they know how to encode in any grammar the actual behavior of the parser.

* Some details like default values for fields that aren't sent or bad (ambiguous) syntax that mixes package names and namespaces...

* Unnecessary constraints on field names that are motivated by unnecessary functionality to translate Protobuf to JSON.

* (C++-specific) the idea of adding messages together is the pinnacle of first-year C++ programmers: they must override a very commonly used operator in a way that breaks every contract that operator makes and for no gains except to confuse and to inconvenience everyone using this operator (usually, accidentally).

---

So, not surprisingly, there's another "feature" in Protobuf that needn't exist, shouldn't exist, and probably most implementations don't implement, but yee-haw! Someone did add it.

My understanding here is that the underlying problem is as follows: "messages" (the composite unit in Protobuf) don't encode their length at top-level. That's idiotic, but that's how it is. So, anyone who wants to send a sequence of messages needs to invent a tiny little bit on top of Protobuf format to tell the other end how to separate the incoming messages. Different Protobuf implementations do this differently. Some use the same encoding as Protobuf uses for varint, others use fixed-length int, some send a whole special header which, beside other things, contains the length information...

My guess is that what happened is that OP and his/her friend chanced on two implementations that didn't agree on how to separate messages sent in sequence. Quite possibly, one implementation wasn't even designed to send messages in sequence, and so had no mechanism for separating them (possibly implying that however uses the library should implement that) while the other one had some mechanism in place. -- I've been there with eg. Scala and C sending messages to each other. This is probably more common that just Scala + C.


Everybody discovers too late that protocol buffers doesn't have any typing and then hacks on a header solution of their own. Tale as old as time.


Or everyone thinks all the complaints are haters and ignores them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: