> Yes Avro supports a JSON encoding, but it's not canonical. To be clear, I'm no...

cbsmith · on Aug 18, 2021

> I'm saying that when you decode an Avro document, the result that comes out (presuming you don't tell the Avro decoder anything special about custom types your runtime supports and how it should map them) is a JSON document.

Semantic point: it's not a "document".

There are tools which will decode Avro and output the data in JSON (typically using the JSON encoding of Avro: https://avro.apache.org/docs/current/spec.html#json_encoding), but the ADT that is created is by no means a JSON document. The ADT that is created has more complex semantics than JSON; JSON is not the canonical representation.

> By which I don't mean JSON-encoded text, but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less.

Except Avro has data types that are not the exact set of types that exist in JSON. The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.

> Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data

BSON, by design, was meant to be a more efficient way to encode JSON data, so yes, it is a way to encode JSON-typed data. Avro, however, was not defined as a way to encode JSON data. It was defined as a way to encode data (with a degree of specialization for the case of Hadoop sequence files, where you are generally storing a large number of small records in one file).

A simple counter example: Avro has a "float" type, which is a 32-bit IEEE 754 floating point number. Neither JSON nor BSON have that type.

Technically, JSON doesn't really have types, it has values, but even if you pretend that JavaScript's types are JSON's types, there's nothing "canonical" about JavaScript's types for Avro.

Yes, you can represent JSON data in Avro, and Avro in JSON, much as you can represent data in two different serialization formats. Avro's data model is very much defined independently of JSON's data model (as you'd expect).

derefr · on Aug 18, 2021

> The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.

My understanding was always:

1. that the "primitive Avro types" are Avro's wire types, which are separate from its representable domain types. (Sort of like how RLE-ified data has wire types of "literal string" and "repeat literal N times".)

2. that any data that would not be valid as input to a JSON encoder, is not valid as input to an Avro encoder, because its wire types are defined in terms of their a mapping from a set of domain types that are exactly the set of domain types accepted by JSON encoders (whether they're explicitly noted as being those or not.)

Or, to put that another way: an Avro schema is — besides a validation step that constrains your data into a slightly-more-normalized/cleaned format — mostly a big fat hint for how to most-efficiently pack an (IMHO strictly JSONly-typed) value into a binary encoding. Differences between "long" and "int" on the wire aren't meant to decode to different domain types (at least, by default); they're just meant to restrict the data's allowed values (like a SQL DOMAIN constraint) in ways that allow it to be more predictable, and so to be wire-encoded more optimally.

Let me lay out some evidence for that assertion:

• Avro supports specifying e.g. "bytes" vs. {"array": "byte"} — there's literally no domain-type difference in those! But one is a wire-encoding optimization over the other.

• Avro has a "default" property, and this property—as part of the JSON-typed schema—can only take on JSON-typed values. Do you think this is an implementation constraint, or a design choice?

• Avro's enum type's "symbols" array? Once again, defined by (and therefore limited to) JSON string values.

• Avro doesn't implement an arbitrary-precision integer type, even though its wire-encoding for integers would support one just fine. Why? Seemingly only because JSON doesn't have an arbitrary-precision integer type (because JavaScript doesn't have a native BigNum type); nor does JavaScript/JSON have any obvious type to O(1)-efficiently deserialize a BigNum out into. (Deserializing BigNums to strings wouldn't be O(1).) Every other language offers a clean 1:1 mapping for bignums, but JavaScript doesn't, so JSON didn't, so Avro doesn't.

• And why do you think Avro schemas are stored as embedded explicitly-defined-to-be-JSON documents within the root-level record / .avsc file, anyway? This means that you are required to have a JSON decoder around (either at decode time, or at decoder codegen time) to decode Avro documents. Why would this be, if not because the Avro implementation is (ot at least originally was) expected to decode the Avro document's wire types into the JSON library's already-defined ADTs, relying on e.g. having those "default"-parameter values already loaded in in JSON-value format from the schema's decode-output, ready to be dropped seamlessly into the resulting Avro decode-output?

And the biggest knock-down argument I'm aware of:

• Avro "string" doesn't support "\u0000". Why not? Because as you've said, Avro has a "JSON encoding", which specifies one-to-one mapping for strings; and JSON doesn't support "\u0000" in strings. (Just ask Postgres's jsonb type about that.) Since an Avro string containing "\u0000" wouldn't round-trip losslessly between the JSON and binary wire-encodings, it's not allowed in strings in the binary encoding.

cbsmith · on Aug 18, 2021

Since it is a serialization format, Avro's types are its wire types. However, the primitive types are just a subset of the types that Avro supports.

Based on these comments, my best guess is you got the idea that Avro was for encoding JSON because the schema declaration is encoded in JSON, but that's not nearly the same as the data model. There are some terrible implementations of Avro libraries out there that use JSON as some kind of middleware, but that's not how Avro actually works.

If there's a type model it is derived from at all, it's the Java type model.

"byte" is not a valid type in Avro. There is only "bytes", and the domain model reflects this. You can't work with individual "byte" of a "bytes" object.

Default values are encoded in the schema, and so that does limit what kind of default values you can have, but again this is a limitation derived from the schema being defined in JSON, and how the schema language was defined in general. So your defaults have to be represented as JSON literals, but they don't even necessarily share the type of the JSON literal (e.g. a field defined as: '{"name": "foo", "type": "long", "default":1}' does not have the same default value as '{"name": "bar", "type": "int", "default":1}", because "foo" has a default value that is a long while "bar" has one that is an "int"). Note that "default values" are a property of the type, and only apply to elements inside complex data types. JSON has no such equivalent concept.

Avro's type model does have an arbitrary precision type that doesn't correlate to anything in JSON: the "decimal" logical type.

You aren't required to use a JSON decoder to decode Avro documents, nor are you required to use a .avsc file. The Avro schema file is just the standard way to represent a schema. If you have the schema, you don't need the file. JSON schema files are one of the poorer choices in the Avro design, but you'll notice that the schema is defined the way it is specifically so that it can cover a type model well outside of JSON. You'll also notice the names of types in Avro don't directly correlate to names of types in JSON.

* The \u0000 thing is a bug in avro tools, but there is nothing about the spec that prohibits having \u0000 in your strings.

I feel like in general this is like a retcon exercise, where you've reached a conclusion and are drawing evidence to prove it, while ignoring the information that contradicts. I spoke with Cutting a fair bit when he came up with Avro, and I can assure you, while the schema language does very intentionally use JSON, Avro is not a representation for JSON types.