Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't understand why serialization formats that separate structure and content aren't more popular.

Imagine a system every message is a UID or DID (https://www.w3.org/TR/did-core/) followed by raw binary data. The UID completely describes the shape of the rest of the message. You can also transmit messages to define new UIDs: these messages' UID is a shared global UID that everyone knows about.

Once a client learns a UID, messages are about as compact as possible. And the data defining UIDs can be much more descriptive than e.g. property names in JSON. You can send documentation and other excess data when defining the UID, because you don't have to worry about size, because you're only sending the UID once. And UIDs can reference other UIDs to reduce duplication.



You just described protobufs and all its successors.

See the “@0xdbb9ad1f14bf0b36” at the top of this capnproto file for example: https://capnproto.org/language.html

It’s a 64bit random number so it’ll never have unintentional collisions.

Also note that a capnp schema is natively represented as a capnp message. Pretty convenient for the “You can also transmit messages to define new UIDs” part of your scheme :)


Protobufs is a boring old tag-length-value format. It's kind of the worst of both worlds because it has no type information encoded in to it, meaning it's useless without the schema, while still having quite a bit of overhead.

Capn'Proto is more like a formalization of C structs in that new fields are only added to the end. If memory serves, on the wire there is no tag, type or length info (for fixed size field types), and everything is rooted at fixed offsets


Mostly right. Allow me to provide some wonky details.

Protobuf uses tag-type-values, i.e. each field is encoded with a tag specifying the field number and some basic type info before the value. The type info is only just enough information to be able to skip the field if you don't recognize it, e.g. it specifies "integer" vs. "byte blob". Some types (such as byte blob) also have a length, some (integer) do not. Nested messages are usually encoded as byte blobs with a length, but there's an alternate encoding where they have a start tag and an end tag instead ("start group" and "end group" are two of the basic types). On one hand, having a length for nested messages seems better because it means you can skip the message during deserialization if you aren't interested in it. On the other hand, it means that during serialization, you have to compute the length of the sub-message before actually serializing it, meaning the whole tree has to be traversed twice, which kind of sucks, especially when the message tree is larger than the L1/L2 cache. Ironically, most Protobuf decoders don't actually support skipping parsing of nested messages so the length that was so expensive to compute ends up being largely unused. Yet, most decoders only support length-delimited nested messages and therefore that's what everyone has to produce. Whoops.

Now on to Cap'n Proto. In a given Cap'n Proto "struct", there is a data section and a pointer section. Primitive types (integers, booleans, etc.) go into the data section. This is the part that looks like a C struct -- fields are identified solely by their offset from the start of the data section. Since new fields can be added over time, if you're reading old data, you may find the data section is too small. So, any fields that are out-of-bounds must be assumed to have default values. Fields that have complex variable-width types, like strings or nested structs, go into the pointer section. Each pointer is 64 bits, but does not work like a native pointer. Half of the pointer specifies an _offset_ of the pointed-to object, relative to the location of the pointer. The other half contains... type information! The pointer encodes enough information for you to know the basic size and shape of the destination object -- just enough information to make a copy of it even if you don't know the schema. This turns out to be super-important in practice for proxy servers and such that need to pass messages through without necessarily knowing the details of the application schema.

In short, both formats actually contain type information on the wire! But, not a full schema -- only the minimal information needed to deal with version skew and make copying possible without data loss.


I wouldn't call what protobuf encodes type information. If I recall all the group stuff is deprecated, so what's left basically boils down to 3 types: 32 bit values, 64 bit values and length prefixed values, which covers strings and sub-messages. Without the schema you can't even distinguish strings from sub-objects, as they are both length prefixed as you described.

Can you even distinguish floats and ints without a schema in protobufs? I don't remember.

I really enjoy capnproto, flatbuffers and Avro and bounce between them depending on the task at hand.


> I wouldn't call what protobuf encodes type information.

Well... I would call it type information, just not complete.

> 32 bit values, 64 bit values and length prefixed values

In protobuf, most integer types are actually encoded as varints, i.e. variable-width integers, not fixed 32-bit or 64-bit. varint encoding encodes 7 bits per byte, and uses the extra bit to indicate whether there are more bytes. (It's not a very good encoding, as it is very branch-heavy. Don't use this in new protocols.)

> Can you even distinguish floats and ints without a schema in protobufs?

You can't distinguish between float vs. fixed32. But int32 would be varint-encoded, while floats are always fixed-width, so you could distinguish between those. (You can't distinguish between int32, uint32, and sint32, though -- and sint32 in particular won't coerce "in the obvious way" to the others.)

The really unfortunate thing is you can't distinguish between strings vs. nested messages (of the length-delimited variety). So if you don't specifically know that something is a nested message then it's not safe to try parsing it...


Interesting. Ids in particular are described here: https://capnproto.org/language.html#unique-ids

I wonder if giving it a name based on the hash of the definition has been explored; like Unison [0] where all code is content addressable, but for just capnproto definitions. Is there a reason not to?

[0]: https://www.unisonweb.org


Capnp uses the name of your message, but not its full definition because that would make it impossible to extend protocols in a backwards compatible way. Without the ability to add new fields, making changes to your protocol would be impossible in large orgs.


> It’s a 64bit random number so it’ll never have unintentional collisions.

It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers. That's not inconceivable.


Yes it is. Message schemas are made by humans. Most of these messages will be extended in a backwards compatible manner over the life of a project rather than replaced entirely so their IDs don’t change. That’s kinda the point of protobufs and its successors.

I’ve probably generated 100 IDs over my lifetime.


Which puts it on the same order of magnitude as the number of people on the planet. If every person alive generated a schema (or if 1/100th of all people generate 100 IDs each like you) then we'd have a small number of collisions. More likely you'd get large numbers of schema like that if there's a widespread application of a protocol compiler that generates new schema programmatically, e.g. to achieve domain separation, and then is applied at scale. I'm not saying that's likely, just that it is not, as is claimed, inconceivable.


It's only really a problem if you use the IDs in the same system. It's highly unlikely that you'd link 4B schemas into a single binary. And anyway, if you do have a conflict, you'll get a linker error.

Cap'n Proto type IDs are not really intended to be used in any sort of global database where you look up types by ID. Luckily no one really wants to do that anyway. In practice you always have a more restricted set of schemas you're interested in for your particular project.

(Plus if you actually created a global database, then you'd find out if there were any collisions...)


If you have 4 billion of them generated there’s another 1/4billionth chance you’ll generate a duplicate.

On top of that you would not only need to generate the same ID, you would need to USE it in the same system where that is could have some semantics to not cause an error.


>It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers.

If it's 64 bit, doesn't that mean you'd need to generate ~10000000000000000000000000000000000000000000000000000000000000000 (2^64) of those numbers to have a collision, not 2^32?


If you generate randomly then, due to the birthday paradox, after generating sqrt(N) values you have a reasonable chance of collision.

The birthday paradox is named after the non-intuitive fact that with just 32 people in a room you have > 50% of 2 people having a birthday on the same day of the year.


I think it's 23 people in a room. The canonical example is people on a football (soccer) pitch. With 11 per side plus the referee there's a 50% chance that two will share the same birthday.


> 32 people

Slight correction: only 23 people, actually. So in every second football ("soccer") game, you have two people on the field with the same birthday.


Does birthday paradox apply here? It’s about any pair of people having the same birthday, whereas in this case you need someone else with a specific birthday.

For example, if you generate 2 numbers and they are the same, but are different to the capnproto number, that’s a collision but doesn’t actually matter.

EDIT: It does apply, I misunderstood what the number was being used for.



You’re right, I misunderstood what the magic number was being used for.


But if my application only uses 100 schemas, I only care about a collision if it's with one of those 100.


You have a collision if any two schemas share the id, not if a specific schema collides with any of the others. So it is exactly like the birthday paradox.


Yeah, but that collision probably doesn’t matter because there’s a bunch of other variables that need to come together for it to be an issue at all.


If the schema id is the message id, in principle it could be an issue as the protocol on the wite would be ambiguous. Then again, you should be able to detect any collisions when you register a schema with the schema repo and deal with it at that time.


I don’t understand your maths here: how is generating 4billion of them is any different from generating 3 billion except a slight raise in the probability measure?


When you reach the 4 billionth version of your protocol?


All versions of the same protocol have the same ID. That is the point of IDs -- to link together different versions of the protocol.


You're right! That makes collisions even less likely then.


MD5 is a 128 bit random number no one would ever have thought would collide. 64 bits is peanuts especially when message types are being defined dynamically


Dude that’s why I said “unintentional collisions”.

Of course you can get intentional collisions. The security model here assumes that anyone that wants to know your message’s ID can just ask.

Did you know that the Internet Protocol uses a 4-bit header to specify the format (v4 or v6) of the rest of the message? They should have used 128 bits. What a bunch of fools.


MD5 is safe against unintentional collisions.


This is protocol buffers + a global type registry. I worked on such a system.


Is it public? Id love to learn more about it.


If you read the protobuf source, you can see a bunch of places where you can hook in custom type-fetching code, e.g. in the google.protobuf.Any type.

After studying it a bit, I'm certain this is how it's used inside Google (might also be mentioned elsewhere).

All you'd really need to do is to compile all protos into a repository (you can spit out the binary descriptors from protoc), then fetch those and decode in the client.

It'd actually be quite straightforward to set up,


I think the system OP is describing is a little bit more complex. You're not just describing message types, you also have message templates; a template declares a message type and a set of prefilled fields. You save data by just sending the subset of fields that are actually changing, which is a very good abstraction for market data. The template is hydrated on the protocol parsing layer so your code only has to deal with message types itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: