Hacker News new | past | comments | ask | show | jobs | submit login
I accidentally used YAML.parse instead of JSON.parse and it worked (rohitpaulk.com)
346 points by rohitpaulk on Jan 24, 2022 | hide | past | favorite | 201 comments



YAML was intended to be a superset, but it isn't quite, which is about the worst case scenario. See https://metacpan.org/pod/JSON::XS#JSON-and-YAML , for instance.

(I am an absolutist on this matter. To be a superset, all, that's A L L valid JSON strings must also be valid YAML to be a superset. A single failure makes it not a superset. At scale, any difference will eventually occur, which is why even small deviations matter.)


I’ve often heard this (YAML is a superset of JSON) but never looked into the details.

According to https://yaml.org/spec/1.2.2/, YAML 1.2 (from 2009) is a strict superset of JSON. Earlier versions were an _almost_ superset. Hence the confusion in this thread. It depends on the version…


CPAN link provided by the parent says 1.2 still isn't a superset:

> Addendum/2009: the YAML 1.2 spec is still incompatible with JSON, even though the incompatibilities have been documented (and are known to Brian) for many years and the spec makes explicit claims that YAML is a superset of JSON. It would be so easy to fix, but apparently, bullying people and corrupting userdata is so much easier.


Are these documented YAML 1.2 JSON incompatibilities listed / linked to somewhere?

I assume these are something related to non-ascii string encoding / escapes?


They are listed in that same CPAN link

"Please note that YAML has hardcoded limits on (simple) object key lengths that JSON doesn't have and also has different and incompatible unicode character escape syntax... YAML also does not allow \/ sequences in strings"


The JSON::XS documentation linked above reports that YAML 1.2 is not a strict superset of JSON:

> Addendum/2009: the YAML 1.2 spec is still incompatible with JSON

The author also details their issues in, ah, getting some of the authors of the YAML specification to agree.


I just checked YAML 1.2 and it seems that 1024 limit length on keys still in spec (https://yaml.org/spec/1.2.2/, ctrl+f, 1024). So any JSON with long keys is not compatible with YAML.


The JSON specification [1] states:

> An implementation may set limits on the length and character contents of strings.

So this length limit is not a source of incompatibility with JSON.

[1] https://datatracker.ietf.org/doc/html/rfc7159#section-9


Wow! That makes it pretty hard to know you've generated useful JSON especially if your goal is to for cross-ecosystem communication.


To be fair, any JSON implentation is going to have a practical limit on the key size, it's just a bit more random and harder to figure out :)


If you mean limited by available memory, then sure but that does not apply just to key size. If you mean something else, could you elaborate?


Another reason to have a limit well below the computer's memory capacity is that one could find ill-formed documents in the wild, e.g., an unclosed quotation mark, causing the "rest' of a potentially large file to be read as a key, which can quickly snowball (imagine if you need to store the keys in a database, in a log, if your algorithms need to copy the keys, etc.)


I assume JSON implementations have a some limit on the key size (or on the whole document which limits the key size), hopefully far below the available memory.


I assume and hope that they do not, if there is no rule stating that they are invalid. There are valid reasons for JSON to massive keys. A simple one: depending on the programming language and libraries used, an unordered array ["a","b","c"] might be better mapped as a dictionary {"a":1,"b":1,"c":1}. Now all of your keys are semantically values, and any limit imposed on keys only makes sense if the same limit is also imposed on values.


Yes absolutely, in practice the limit seems to be on the document size rather than on keys specifically. That said it still sets a limit on the key size (to something a bit less that the max full size), and some JSON documents valid for a given JSON implentation might not be parsable by others, in which case the Yaml parsers are no exceptions ;)

I'm not even sure why I'm playing the devil's advocate, I hate Yaml actually :D


I guess it is about different implementations of some not properly formalized parts of the JSON spec.

There was also an article here some time ago but I cannot find it right now.


1024 limit is for unquoted keys, which do not occur in JSON


Have a closer look. The 1024 limit in version 1.2 is only for implicit block mapping keys, not for flow style `{"foo": "bar"}`


In the beginning was the SGML.

Then we said it's too verbose. We named some subsets XML, HTML, XLSX.

Then we said it's still too long. So we named some subsets Markdown, and YML.

Then we said it's still too long, and made JSON.

What's wrong with subsets? Ambiguity in naming things.

https://martinfowler.com/bliki/TwoHardThings.html

Is JSON the same as YML?

NO.

Norwegian?

https://news.ycombinator.com/item?id=26671136


> Then we said it's too verbose. We named some subsets XML, HTML, XSLX

If anything, XML as an SGML subset is more verbose than SGML proper; in fact, getting rid of markup declarations to yield canonical markup without omitted/inferred tags, shortforms, etc. was the entire point of XML. Of course, XML suffered as an authoring format due to verbosity, which led to the Cambrian explosion of Wiki languages (MediaWiki, Markdown, etc.).

Also, HTML was conceived as an SGML vocabulary/application [1], and for the most part still is [2] (save for mechanisms to smuggle CSS and JavaScript into HTML without the installed base of browsers displaying these as content at the time, plus HTML5's ad-hoc error recovery).

[1]: http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html

[2]: http://sgmljs.net/docs/html5.html


Well, Markdown and YML and JSON are not subsets of SGML, nobody claims they are, and nobody intented them as such. So there's that.


While indeed neither markdown, much less JSON syntax has been intended as an SGML app, that doesn't stop SGML from parsing JSON, markdown, and other custom Wiki syntax using SHORTREF [1] ;) In fact, the original markdown language is specified as a mapping to HTML angle-bracket markup (with HTML also an SGML vocabulary), and thus it's quite natural to express that mapping using SGML SHORTREF, even though only a subset can be expressed.

[1]: https://www.balisage.net/Proceedings/vol17/html/Walsh01/Bali...

[2]: https://daringfireball.net/projects/markdown/


First they came for the angle brackets. And I did not speak out. Because I did not use XML...


You didn't use XML? But We use XML to read the comments here on this HTML web page.

But I came for the angle brackets. Because I < We, eternally.


> Then we said it's still too long. So we named some subsets Markdown, and YML.

> Then we said it's still too long, and made JSON.

JSON is older than markdown and yaml.


Thank you for correcting history! I'd forgotten >_<


I think you'll find that in the beginning were M-expressions, but they were evil, and were followed by S-expressions, which were and are and ever will be good.

SGML and its descendants are okay for document markup.

XML for data (as opposed to markup) is either evil or clown-shoes-for-a-hat insane — I can’t figure out which.

JSON is simultaneously under- and over-specified, leading to systems where everything works right up until it doesn't. It shares a lot with C and Unix in this respect.


If XML for data is bad, check out XML as a programming language. I think this has cropped up a few times, one that stuck with me was as templating structures in the FutureTense app server, before being acquired by OpenMarket and they switched to JSPs or something.

Lots of <for something> <other stuff> </for> sorts of evil.


note: HTML5 is not a subset of SGML.


For example, this valid JSON doesn't parse as YAML:

    {
        "list": [
            {},
                {}
        ]
    }
(tested on Python)

edit: whitespace didn't quite make it through HN, here:

    json.loads('{\n  "list": [\n    {},\n\t{}\n    ]\n}')
    yaml.load ('{\n  "list": [\n    {},\n\t{}\n    ]\n}')


Python's .netrc library also hasn't supported comments correctly for like 5 years. The bug was reported, it was never fixed. If I want to use my .netrc file with Python programs, I have to remove all comments (that work with every other .netrc-using program).

It's 2022 and we can't even get a plaintext configuration format from 1980 right.


> It's 2022 and we can't even get a plaintext configuration format from 1980 right.

To me, it's more depressing that we've been at this for 50-60 years and still seemingly don't have an unambiguously good plaintext configuration format at all.


I've been a Professional Config File Wrangler for two decades, and I can tell you that it's always nicer to have a config file that's built to task rather than being forced to tie yourself into knots when somebody didn't want to write a parser.

The difference between a data format and a configuration file is the use case. JSON and YAML were invented to serialize data. They only make sense if they're only ever written programmatically and expressing very specific data, as they're full of features specific to loading and transforming data types, and aren't designed to make it easy for humans to express application-specific logic. Editing them by hand is like walking a gauntlet blindfolded, and then there's the implementation differences due to all the subtle complexity.

Apache, Nginx, X11, RPM, SSHD, Terraform, and other programs have configuration files designed by humans for humans. They make it easy to accomplish tasks specific to those programs. You wouldn't use an INI file to configure Apache, and you wouldn't use an Apache config to build an RPM package. Terraform may need a ton of custom logic and functions, but X11 doesn't (Terraform actually has 2 configuration formats and a data serialization format, and Packer HCL is different than Terraform HCL). Config formats minimize footguns by being intuitive, matching application use case, and avoiding problematic syntax (if designed well). And you'd never use any of them to serialize data. Their design makes the programs more or less complex; they can avoid complexity by supporting totally random syntax for one weird edge case. Design decisions are just as important in keeping complexity down as in keeping good UX.

Somebody could take an inventory of every configuration format in existence, matrix their properties, come up with a couple categories of config files, and then plop down 3 or 4 standards. My guess is there's multiple levels of configuration complexity (INI -> "Unixy" (sudoers, logrotate) -> Apache -> HCL) depending on the app's uses. But that's a lot of work, and I'm not volunteering...


I quite like CUELang (https://cuelang.org/), although it not yet widely supported.

It has a good balance between expressivity and readability, it got enough logic to be useful, but not so much it begs for abuses, it can import/export to yaml and json and features an elegant type system which lets you define both the schema and the data itself.

I hope it gains traction.


toml is pretty much the best one I have seen so far. At least for small to medium size config files.


Toml has some hairy bits. Lists of objects, lists of lists of objects, objects of lists of objects. Complex objects with top level fields...


Yep,

Although I do feel like there is a case to be made that if you need a Turing complete configuration language then in most cases you failed your users by pushing too many decisions on to them instead of deciding on sensible defaults.

And if you are dealing with one of the rare cases where Turing complete configuration is desirable then maybe use Lua or something like that instead.


I'm not defending YAML. YAML is terrible. It's even worse with logic and/or templates (looking at you, Ansible). Toml is certainly better but I'm still baffled as to why we don't have a "better YAML". YAML could almost be okay.


There's also Lua, which is a full Turing complete language but is still pretty nice for writing config files, and is easy to embed.


Followup to my own post: don't forget about Scheme! Same nice properties as Lua, but you get some extra conveniences from using s-expressions (which can represent objects somewhat more flexibly, like XML, than Lua, which is more or less 1:1 with JSON).


There's StrictYAML[1][2]. Can't say I've used it as let's face it, most projects bind themselves to a config language - whether that be YAML, JSON, HCL or whatever - but I'd like to.

[1] https://hitchdev.com/strictyaml/

[2] https://github.com/crdoconnor/strictyaml


Yeah, I think it's because nobody sat down and methodically created it.

People create config languages that work for their use case and then it is just a happy accident if it works for other things.

I don't think anyone has put serous effort into designing a configuration language. And by that I mean collect use cases, study how other config languages does things, make drafts, and test them. etc...


Terraforms HCL is well designed.

I know a lot of people hate it but I find it to be the only configuration language that makes any sense for moderately large configs.

It’s short, readable, unambiguous, great IDE support. Got built in logic, variables, templates, functions and references to other resources - without being Turing complete imperative language, and without becoming a xml monstrosity.

Seriously there is nothing even close to it. Tell me one reasonable alternative in wide use that’s not just some preprocessor bolted onto yaml, like Helm charts or Ansible jinja templates.


There's a world of difference between "simple configuration needs" and "complex configuration needs".

I will take a kubernetes deployment manifest as an example that you would want to express in a hypothetically perfect configuration language. Now, eventueally, you end up in the "containers" bit of the pod template inside the deployment spec.

And in that, you can (and arguably should) set resources. But, in an ideal world, when you set a CPU request (or, possibly, limit, but I will go with request for now) for an image that has a Go binary in it, you probably also want to have a "GOMAXPROCS" environment variable added that is the ceiling of your CPU allocation. And if you add a memory limit, and the image has a Java binary in it, you probably want to add a few of the Java memory-tuning flags.

And it is actually REALLY important that you don't repeat yourself here. In the small, it's fine, but if you end up in a position where you need to provide more, or less, RAM or CPU, on short notice (because after all, configuration files drive what you have in production, and mutating configuration is how you solve problems at speed, when you have an outage), any "you have to carefully put the identical thing in multiple places" is exactly how you end up with shit not fixing themselves.

So, yeah, as much hate as it gets, BCL may genuinely be better than every other configuration language I have had the misfortune to work with. And one of the things I looked forward to, when I left the G, was to never ever in my life have to see or think about BCL ever again. And then I saw what the world at large are content with. It is bloody depressing is what it is.


Cuelang?


Yeah absolutely. I think there are four corners to the square: "meant to be written by humans/meant to be written by computers" and "meant to be read by humans/not meant to be read by humans". JSON is the king of meant to be written by computers read by humans, grpc and swift and protobuf and arrow can duke it out the written by computer/not read corner. We are missing good options in written by humans half.


Dhall and Cue come to mind as ones that _feel_ more designed

https://github.com/dhall-lang/dhall-lang

https://cuelang.org/docs/usecases/configuration/


Interesting...

Me, the programmer finds those kinda cool.

And the sysadmin in me developed a dislike of both within 1 minute of looking at them.

Honestly, I think a good configuration library should be more than a spec, it should come with a library that handles parsing/validation. See, there are two sides to configuration, the user and the program. Knowledge about the values, defaults and types should live on the program side and should be documented. Then the user side of configuration can be clean and easy to read/write and most important of all, allow the user to accomplish the most common configuration without having to learn a new config language on top of learning the application.


> Honestly, I think a good configuration library should be more than a spec, it should come with a library that handles parsing/validation

You just described CUELang.

The type system allows to define a schema as well as the data, in the same file, or in 2 separate ones. Then you can call either a cli tool (that works on linux, windows or mac) or use the Go lib (or bind to it).

For compat, cue can import and export to yaml, json and protobuf, as well as validate them.


Isn't Dhall basically the same (=have the same set of features)?


In the same way python and js are basically the same.


Exactly. So if I'm going to learn/use one of them, there's no clear winner, really. Both also seem to also have about the same amount of adoption (zero?).


Ok, you have convinced me to give it a serious look.


> Yeah, I think it's because nobody sat down and methodically created it.

I think it's the opposite. There isn't a single config file that suits all needs.

Especially when you realize config isn't a single thing.

http://mikehadlow.blogspot.com/2012/05/configuration-complex...


I guess that could also be the case.

I haven’t studied it, I am just generally feeling unhappy about most software configuration.


About Ansible, I think it gained it's success partially due to YAML.

Ansible is worse than Puppet and CFEngine in many ways, but it is superior in the user interface.

It managed to not only be a config management solution, but provide a universal config language that most apps could be configured with. So for a lot of use cases, if you know Anisible/YAML then you don't have to learn a new configuration language on top of learning a new application.


The problem with Ansible is it's not universal, because most app playbooks, are configured in the worst possible way. In my experience typically you get handed an Ansible script, something which you'd hoped was declarative but isn't (like a version that apt-get grabs isn't fixed, or even, gets patched) then suddenly a downstream templated command fucks up, and the person who wrote the script isn't around anymore (or you don't trust their chops because they are a blowhard that worked at Google/Facebook and had a coddling ops team behind them in the past) or worse it's from "community" and has a billion hidden settings that you can't be bothered to grok - and so you have to dig so many layers down that you are better off just fucking rewriting the Ansible script to do the one thing which probably should have been four lines.

In any case, I found Ansible scripts to have like a 3 month half life. If we were lucky. I'm not bitter.


haha, I can go on lengthy rants about every single configuration management system that I have used.

My dream configuration system should revert to default when the config is removed (keeping data). Have a simple/easy user interface. Have maintained modules with sane defaults for the 500 most common server software. I would rather there be no module than an abandoned one with unsafe defaults, that way it is clear that I would have to maintain my own if I want to use that particular piece of software. Performant, it really shouldn't take more than a few minutes to apply a config change. No more than 30 min for initial run.


Early on, Ansible was primarily agent-less from the start which made it ridiculously easy to sneak into existing infrastructure and manual workflows. I probably would not have been able to stand up Puppet or Salt or whatever but I could run Ansible all by myself with no one to stop me :).


I'm curious what your thoughts are on a config language I'm working on.

GitHub.com/vitiral/zoa

It has both binary and textual representation (with the first byte being able to distinguish them), and the syntax is clean enough I'm planning on extending it into a markup language as well.


Even if you have sensible defaults don't you still need to be able to parse configured changes?


Not always, sometimes all other options are just wrong. Or you can auto detect the correct setting.


I understand the pragmatic reasons for it being the way it is, but I still wish TOML didn't require all strings to be quoted.


This is why I like INI. It doesn't have these problems, because it doesn't try to wrangle the notion of nested objects (or lists) in the first place. The lack of a formal spec is a problem, sure, but it such a basic format that it's kind of self-explanatory.


When the problem is TOML not supporting easy nesting, a solution of "Don't nest." works just as well in TOML as it does in ini. It's not really an advantage of ini. Especially when a big factor in TOML not making it easy is that TOML uses the same kind of [section]\nkey=value formatting that ini does!


You can use toml as a better ini by limiting yourself to the key / value schema. It still superior because:

- it has a spec

- it has other types than strings

- you can always decide you actually need nested data, and add them later


I wrote an INI parser that has numerical, boolean, timestamp, MAC address, and IP address types ;) "advantages" of not having a spec!

Seriously: for application-specific config files, the lack of a formal spec can be kind of a nice thing. You can design your parser to the exact needs of your program, with data types that makes sense for your use case. Throw together a formal grammar for use in regression testing, and you're all set.

Obviously a formal spec is essential for data interchange, but that's why JSON exists. To me, YAML is in a gray area that doesn't need to exist. The same thing goes for TOML, but to a far lesser extent.


> it has other types than strings

But isn't the config file just a string?


Everything gets serialized to a string of bytes. The point is that you can fail at parsing when the value doesn't make sense, rather than failing at some point in the future when you decide to use the value and it doesn't make sense. And if you have a defined schema, you can have your editor validate it against the schema when saving, so you don't accidentally have "FILENOTFOUND" in a Boolean.


... lack of a hex float representation ...


We do, it’s called TOML. The future is here it’s just not equally distributed.


TOML sucks for list of tables simply because they intentionally crippled inline tables to only be able to occupy one line. For ideology reasons ("we don't need to add a pseudo-JSON"). Unless your table is small, it's going to look absolutely terrible being all crammed into one line.

https://github.com/toml-lang/toml/issues/516

The official way to do list of tables is (look at how much duplication there is)

  [[main_app.general_settings.logging.handlers]]
    name = "default"
    output = "stdout"
    level = "info"

  [[main_app.general_settings.logging.handlers]]
    name = "stderr"
    output = "stderr"
    level = "error"

  [[main_app.general_settings.logging.handlers]]
    name = "access"
    output = "/var/log/access.log"
    level = "info"
vs

  handlers = [
    {
      name = "default",
      output =  "stdout",
      level = "info",
    }, {
      name = "stderr",
      output =  "stderr",
      level = "error",
    }, {
      name = "access",
      output =  "/var/log/access.log",
      level = "info",
    },
  ]
I would still reach for TOML first if I only needed simple key-value configuration (never YAML), but for anything requiring list-of-tables I would seriously consider JSON with trailing commas instead.


I see the point and this is certainly a drawback of TOML but for me this is something of a boundary case between configuration and data.

When configuration gets so complicated that the configuration starts to resemble structured data I tend to prefer to switch to a real scripting language and generate JSON instead.


Expression languages like Nix, Jsonnet, Dhall, and Cue are really nice in these situations.


for this reason I couldn't see a CI platform ever seriously consider TOML

(someone may point out to me a CI platform that relies on TOML—which I welcome)


Rust is built on TOML. For better or worse.


Do you mean Cargo? Because Cargo is not a CI system. You never embed shell commands in a Cargo.toml.

If you need to program complex logic to build a crate, you don’t write TOML. You write a build.rs file in actual Rust.


If embedding shell commands in a configuration language is considered a CI system I think we are doomed.


> JSON with trailing commas

JSON5?


It's perfect until you do a lot of nesting..


...or any nesting. TOML sucks for anything non-trivial.


It makes me sad every time I see a newly announced tool that went for YAML instead of TOML.


XML is still good.


Hmm, it looks like it’s handled comments for at least a decade:

https://github.com/python/cpython/blame/d75a51bea3c2442f81d3...

Oh, maybe it’s this issue:

https://bugs.python.org/issue34132

If I’ve read it correctly, there was a regression from Python 2.x to 3.x such that you now need to format comments:

    #like this 
Instead of:

    # like this
(A space after the # isn’t accepted by the parser.)


    try:
        try:
            import orjson as json
        except:
            try:
                import rapidjson as json
            except:
                try:
                    import fast_json as json
                except:
                    import json
        foo = json.loads(string)
    except:
        try:
            import yaml
        except:
            # try harder
            import os
            try:
                assert(os.system("pip3 install yaml") == 0)
            except:
                # try even harder
                try:
                    assert(os.system("sudo apt install python3-pip && pip3 install yaml") == 0)
                except:
                    assert(os.system("sudo yum install python3-pip && pip3 install yaml") == 0)
            import yaml
        try:
            foo = yaml.loads(string)
        except:
            try:
                ....


Great idea.

  pip install --user yaml
increases the chances it will work


A note to readers: it's not always a good idea to put automated software installation in a place that users don't expect it.

I've seen that kind of approach cause a ton of issues the moment that the software was used in a different environment than the author expected.

It's much better IMO to fail with a message about how to install the missing dependency.


This is why there should be a way to automatically install software into a sandboxed location, e.g. a virtualenv.

Considering we are having software drive cars today it should be trivial and I would say even arguably expected that software should be able to autonomously "figure out" how to run itself and avoid conflicts with other software since that's a trivial task in comparison to navigating city streets.


Brilliant! What license is this published under?


Free Art License


Tested on python what? I was curious to see what error that produced, figuring it would be some whitespace due to the difference between the list items, but using the yamlized python that I had lying around, it did the sane thing:

    PATH=$HOMEBREW_PREFIX/opt/ansible/libexec/bin:$PATH
    pip list | grep -i yaml
    python -V
    python <<'DOIT'
    from io import StringIO
    import yaml
    print(yaml.safe_load(StringIO(
    '''
        {
            "list": [
                {},
                    {}
            ]
        }
    ''')))
    DOIT
produces

    PyYAML                6.0
    Python 3.10.1
    {'list': [{}, {}]}


With leading tabs it does not work.

  $ sed 's/\t/--->/g' break-yaml.json
  --->{
  --->--->"list": [
  --->--->--->{},
  --->--->--->{}
  --->--->]
  --->}
  $ jq -c . break-yaml.json
  {"list":[{},{}]}
  $ yaml-to-json.py break-yaml.json
  ERROR: break-yaml.json could not be parsed
  while scanning for the next token
  found character '\t' that cannot start any token
    in "break-yaml.json", line 1, column 1
  $ sed 's/\t/    /g' break-yaml.json | yaml-to-json.py
  {"list": [{}, {}]}


This is completely valid YAML.

YAML does not allow tabs in indentation, but the tabs in your example are not indentation according to the YAML spec productions.

You can see it clearly here against many YAML parsers: https://play.yaml.io/main/parser?input=CXsKCQkibGlzdCI6IFsKC...

As tinita points out, sadly PyYAML and libyaml implement this wrong.

See https://matrix.yaml.info/


That's because PyYAML doesn't implement the spec correctly.


Tabs are not valid JSON


Do you have a link for that?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... says:

> Insignificant whitespace may be present anywhere except within a JSONNumber [forbidden] or JSONString [interpreted as part of the string]

And specifically lists tab as whitespace:

> The tab character (U+0009), carriage return (U+000D), line feed (U+000A), and space (U+0020) characters are the only valid whitespace characters.

More specifically, expanding https://datatracker.ietf.org/doc/html/rfc8259#section-2 gives an array as (roughly)

> ws %x5B ws value (ws %x2C ws value)* ws %x5D ws

Where `ws` explicitly includes `%x09`. Which seems to cover this case?


Per RFC 8259:

      ws = *(
              %x20 /              ; Space
              %x09 /              ; Horizontal tab
              %x0A /              ; Line feed or New line
              %x0D )              ; Carriage return


The grammar in https://www.json.org/json-en.html disagrees. It has

  json
    element

  element
    ws value ws

  ws
    ‘0009’ ws


Edited with string escapes, the tab didn't make it through HN.

The error from PyYaml 5.3.1:

    yaml.scanner.ScannerError: while scanning for the next token
    found character '\t' that cannot start any token
      in "<unicode string>", line 4, column 1


If it continues to be hard to share, I suggest encoding it as a base64 string so folks can decode it into a file with exactly the right contents.


This is, unwittingly, the most YAML-relevant comment in this thread.


Not base64, but this should be easy to reproduce:

  $ printf '{\n\t"list": [\n\t\t{},\n\t\t{}\n\t]\n}\n' > test.json

  $ jq < test.json 
  {
    "list": [
      {},
      {}
    ]
  }

  $ yamllint test.json 
  test.json
    2:1       error    syntax error: found character '\t' that cannot start any token (syntax)


Thanks, I'm finally able to reproduce this.

It would be great if instead of the histrionic message on CPAN (which amusingly accuses others of "mass hysteria"), the author would just say "YAML documents can't start with a tab while JSON documents can, making JSON not a strict subset of YAML".

The YAML spec should be updated to reflect this, but I wonder if a simple practical workaround in YAML parsers (like replacing each tab at the beginning of the document with two spaces before feeding it to the tokenizer) would be sufficient in the short term.


> "YAML documents can't start with a tab while JSON documents can, making JSON not a strict subset of YAML"

But YAML can start with tabs. Tabs are allowed as separating whitespace in most of the spec productions but are not allowed as indentation. Even though those tabs look like indentation, the spec productions don't interpret them as such.

See my comment above and esp see https://play.yaml.io/main/parser?input=CXsKCQkibGlzdCI6IFsKC...

Note: the YAML spec maintainers (I am one) have identified many issues with YAML which we are actively working on, but (somewhat surprisingly) we have yet to find a case where valid JSON is invalid YAML 1.2.


Thanks for the clarification. Let's fix it in PyYAML then :)

Speaking of PyYAML, I recently ran into an issue where I had to heavily patch PyYAML to prevent its parse result from being susceptible to entity expansion attacks. It would be nice to at least have a PyYAML mode to completely ignore anchors and aliases (as well as tags) using simple keyword arguments. Protection against entity expansion abuse would be nice too.


This parses fine as YAML in all the tools I've tried. Can you provide the specific versions of the libraries you're using?


They should remove the phrase "every JSON file is also a valid YAML file" from the YAML spec. 1) it isn't true, and 2) it seems like it goes against the implication made here:

> This makes it easy to migrate from JSON to YAML if/when the additional features are required.

If JSON interop is provided solely as a short-term solution that eases the transition to YAML, then I applaud the YAML designers for making a great choice.


> YAML was intended to be a superset

My impression was JSON came years after YAML, and it was somehow coincidental that YAML was almost a superset of JSON.

(Shockingly wikipedia tells me they both came out within a month of each other in 2001).


On the upside, if it's almost a superset then a data producer can make sure it is polyglot by sticking to the intersection of the two.

C++ is not a strict superset of C, but the ability to include C headers is very valuable.


I wasn't able to reproduce any of the issues listed on that page. Does anyone have an example?


I'm not a fan of YAML either, but I think you should not generate YAML files if you can avoid it. All YAML you encounter should be hand-written, so this problem should not occur.

I read "YAML is a superset of JSON" not as a logical statement, but as instructions to humans writing YAML. If you know JSON, you can use that syntax to write YAML. Just like, if you know JavaScript or Python (or to some extent PHP) object syntax, you can write JSON.

If you get a parse error, no biggie, you Alt+Tab to the editor where you are editing the config file and correct it. It is not like you are serving this over the net to some other program.


Same applies to TypeScript. It is not a superset of JavaScript, although many people think it is.

https://stackoverflow.com/a/53698835/


As long as you tell the typescript compiler not to stop when it finds type problems, all JavaScript works and compiles, right? That sounds like a superset to me. Syntactically there are no problems, and the error messages are just messages.


> As long as you tell the typescript compiler not to stop when it finds type problems, all JavaScript works and compiles, right?

Does such code count as valid TypeScript though? It sounds more as if the compiler has an option to accept certain invalid programs.

You could build a C++ compiler with a flag to warn, rather than error, on encountering implicit conversions that are forbidden by the C++ standard. The language the compiler is accepting would then no longer be standard C++, but a superset. (Same for all compiler-specific extensions of course.)

Personally I'm inclined to agree with this StackOverflow comment. [0] It's an interesting edge-case though.

[0] https://stackoverflow.com/questions/29918324/is-typescript-r...


It's syntactically and functionally correct, so despite the error messages I think 'valid' is a better label.

> You could build a C++ compiler with a flag to warn, rather than error, on encountering implicit conversions that are forbidden by the C++ standard. The language the compiler is accepting would then no longer be standard C++, but a superset. (Same for all compiler-specific extensions of course.)

The way I see it, these errors are already on par with C++ warnings. C++ won't stop you if you make a pointer null or use the wrong string as a map key.


congrats to all involved for sticking to their guns here. specs exist for a reason :D


Are people not even reading about what they are using?

Always read documentation or you will get burned by something completely innocuous. For example, that’s how you get Norway missing in your config file with YAML:

  NI: Nicaragua
  NL: Netherlands
  NO: Norway
Oops, “NO” evaluates to false like 10 other reserve words.


Did you read the entire YAML spec before using it (https://yaml.org/spec/1.2-old/spec.html_? And all other specs? And docs for all your dependencies? What about docs for system services on all your servers?

Software/specs/formats have edge-cases. Theses exists usually because of a tradeoff in usability. That's why there is YAML, JSON, TOML, etc. Choose the one that best fits your use-case and the strictness you need.


> Did you read the entire YAML spec before using it (https://yaml.org/spec/1.2-old/spec.html_?

This feels like an attempt at victim blaming. If you have to read an entire spec from top to bottom to avoid a pitfall in a relatively common operation, maybe something is wrong.

FWIW, I couldn't even find the relevant section in that spec from a quick glance. I probably would have to read a significant portion of that spec just to figure out where it went wrong.


I think you are answering the wrong guy, because he agrees with you.


> This feels like an attempt at victim blaming

I don't think I understand what this means in context of my comment. Are you referring to the parent of my comment?

Someone wrote an article about an interesting thing they discovered about a well-known spec and OP's response is "Are people not even reading about what they are using?" Did the author of the article do something wrong?


No, it's to your comment. One obvious reading of "did you read every relevant spec completely" is that if you didn't read every relevant spec 100% cover to cover then it's your fault that your software had a problem with that spec, and a reasonable person could view that as infeasible (those things get really long). Hence, it's easy to see your comment as victim blaming. (I'm not saying that I do or don't agree with that view, just trying to make sure everyone understands each other.)


Their comment was intended to point out the same thing you're pointing out to the thread root author.


Your tone suggests you think it's obviously infeasible to read the whole YAML speed before using it. But it is possible to read the whole JSON spec [1]. It takes less than a minute.

[1] https://www.json.org/json-en.html

Saying that all formats have edge cases as an excuse for YAML's glaring faults is, frankly, a cop out. Like if a bridge collapses when a leaf lands on it and saying, well, all bridges have some maximum load. Yes, but in this case it's so bad it's just not useful for anything.


My comment was to point out how ridiculous the parents comment is in response to the article.

What does the length of the JSON spec have to do with my comment? The parent comment says if you don’t read all your docs you will be bit by an innocuous bug. You linked to a short spec, but that doesn’t mean anything in this context.

> Like if a bridge collapses when a leaf lands on it and saying, well, all bridges have some maximum load.

Is that what you got from reading my comment or the article? Is that what yaml is like?


Sorry, I misread the flow of the conversation. If anything my comment made more sense as a reply to the one you replied to, rather than yours.

More specifically, I missed the first line of their comment: "Are people not even reading about what they are using?" If you miss off that line, then it sounds (to me) that they're arguing YAML is a terrible format. With that line, it turns out they think it's reasonable, so long as you read the (huge) spec first. Madness!


No problem! I should have quoted that bit at the top of my original comment, as you aren’t the only one who read it that way.


It's not addressing what you said. I don't think what you said is very relevant, as the "read the spec" doesn't add anything to a discussion about whether something is a good idea to use. Length of spec is relevant to that question.


Thanks, I havent seen that before. It took me 10 minutes to go through and it is very clear (took a moment to realize whitespace also alllows no character).

A couple of years ago I looked at tutorials and found it very confusing, but the spec is just great.


While you’re right it’s a shorter spec (setting aside if you’re right about your broader point) this [0] is a more reasonable spec. Even JSON with its microscopic spec has implementation details, inconsistencies, and errata. Is this why we can’t have nice things?

[0] - https://datatracker.ietf.org/doc/html/rfc7159


I’m not sure what this comment is trying to say. I’ve been programming for a while and I know plenty of talented, amazing engineers. Nobody reads the whole spec top to bottom for like, anything.


> edge-cases

Those are forgivable. I watched a team get burned by doing software HMAC and it turns out the underlying native function in the kernel is not thread safe. Would have caught me too.

JSON being a subset of YAML is a core feature. It was to help with adoption.

There's a difference in not reading all pages of every document and not reading anything at all. Not even a well researched blog post.

If this individual is a junior, then awesome, they learned a lot of valuable ideas and solutions.

If they're a senior, yikes. Don't use technology you can't explain to the junior discovering core features and writing blog posts about it


>Did you read the entire YAML spec before using it

No, not the entire spec, but I glanced through the wiki page and a few tutorials to understand it.

>Choose the one that best fits your use-case and the strictness you need.

Well, how can one choose what fits best if they don't even research the topic?


> No, not the entire spec, but I glanced through the wiki page and a few tutorials to understand it.

Then how are you protecting yourself from “innocuous bugs” you mention without reading all of the spec?


Amazing that people use YAML if it does that.


Yep. I used to love Yaml back in the day. But two things burned me.

1. Significant whitespace in a data storage file doesn't scale. Yes, eventually someone wants to dump a giant graph of data and the library breaks.

2. Intermixed context on quoteless strings. Intermixing code and data doesn't work reliably. Every where I see it tried I see it break. If you don't want quotes on your strings, then you have to put something on your keywords. A simple @no would have stopped this and other situations.

As an aside, I find it stupid to have multiple names for true, false, and null.


That's an old YAML 1.1 behavior. It was removed in 1.2 (in 2009), though some implementations (looking at you, pyyaml) are still on 1.1.


Guess what was not removed?

    1 : 1,0
    2: 01
    3: 1.0
    4: 1O
    5: 0b1
    6: 0x1
    7: 0i1
    8: hey,
    8.0: oh,
    version: [3.1, "3.1", 3.10, "3.10"]

Gives you:

    {
    "1": "1,0", 
    "2": 1, 
    "3": 1.0, 
    "4": "1O", 
    "5": 1, 
    "6": 1, 
    "7": "0i1", 
    "8": "oh,", 
    "version": [
        3.1, 
        "3.1", 
        3.1, 
        "3.10"
    ]
    }
YAML turns raw literals into strings, except when the string matches a certain format. Then it may turn it into other things, like int or float, and you better know all the rules by heart and be attentive. And not introduce any typo, which of course no human ever does.


That's why yaml is a format for computer and json a format for humans


And that's also why everybody use JSON to configure CI, and use YAML for REST API.


So YAML is fine with breaking compatibility in minor version bumps?


Semver was not yet common practice when 1.2 was released.


https://yaml.org/spec/1.2.0/ says 2009

Not breaking backwards compatibility in a minor version bump of a data format was absolutely common sense in 2009.


semver.org -- which did an amazing job popularizing this idea -- did not yet have its webpage up in July 2009 (archive.org has it in December)


In situations where I'm forced to use YAML, I just use strict JSON to avoid YAML insanity.


You just learn about the edge cases. This applies to everything. There's a number of languages where an empty list/array is truthy and a number where it's falsey. Learn the tool, do some defensive programming (always quote the strings) and you'll be fine.

(This issue was recognised and since yaml 1.2 (2009) the spec says "no" is a string, packages like https://yaml.readthedocs.io/ have migrated a while ago)


This is why I default to TOML for these sorts of lists of things.


This was a hugely bad idea; it was addressed years ago in 1.2 ... but so many 1.1 implementations


YAML's parsing of `no` as `False` has not been part of the spec for 13 years now. It was changed in YAML 1.2 in 2009 to only be `true` and `false` (with variations in case allowed I think).


I always read documentation. I just have to run into issues because I haven’t read it before I do. I imagine most engineers are like this as we tend to prefer learning by doing


Reading often doesn't save you from gotchas. Usually it just makes you think "oh, right! I remember this bullshit from the spec" after you've been stung.


maybe lame question but doesn't syntax highlighting make NO different color because it's a reserved word?

If config file is autogenerated then it's a different matter.


> Read documentation

What is this strange concept you bring up? Googling saved me an entire day of reading dry documents. I may not know how the code works, but I go around telling everyone how easy coding is because of copy+paste.


I'd expect my IDE to highlight keywords!


I have a love-hate relationship with YAML.

Hate:

- human-wise ambiguity of its syntax (If I understand correctly, you "can" indent array items, but you don't have to. And than one guys says "OK, I'm gonna indent", and another guy says "Nah. I'm not gonna indent")

- still no support for datetime as a first class citizen

- strings usually don't need quotes, until they do (I prefer to always quote)

Note that two of the above points are about allowing inconsistent styles, which is a thing I hate.

Love:

- it supports comments whereas JSON does not. If the IETF ever officially updated JSON to support C-style inline (and maybe block) comments, I would absolutely ditch YAML.


You might get a kick out of Concise Encoding then (shameless plug). It focuses on security and consistency of behavior. And it supports comments and has first class date types ;-)

https://concise-encoding.org/

In particular:

* How to deal with unrepresentable values: https://github.com/kstenerud/concise-encoding/blob/master/ce...

* Mandatory limits and security considerations: https://github.com/kstenerud/concise-encoding/blob/master/ce...

* Consistent error classification and processing: https://github.com/kstenerud/concise-encoding/blob/master/ce...


The simplest solution here would be to use JSON5 (https://json5.org/) if you're after comments.

It still doesn't support / standardize dates, though.

But realistically, it's also all about the ecosystem. VSCode for example doesn't come with JSON5 support out of the box. GitHub and many other tools / renderers and supports it at least in syntax highlighting.


Maybe they should try to file an RFC update/new spec? I would be all for it, since it is backwards compatible and covers some essential new needs of a modern configuration syntax. It seems they already have some notable people included in the project, making it more plausible to succeed as a succession to RFC8259.


So use any of a dozen or so other configuration formats that have comments but don't have all the problems YAML does. TOML is probably the most popular of those right now.


TOML is good for config, but not for data exchange though. (Why would I need comments in a data exchange format? Comments are useful when you want to annotate some sample data for other developers who will consume or generate that data.)

Also, OpenAPI specs (also known colloquially as Swagger) can only be written in JSON or YAML.


Your requirements seem to be getting pretty close to XML territory.

To note, JSON specifically doesn’t have comments to close the door to annotations and other kind of meta use.


Agreed, so use JSON for data exchange and TOML for configuration. YAML isn't all that great for data exchange either.

In the cases where you have a tool that require json or yaml, you could use something like cue, dhall, or jsonnet, and convert it into json (or yaml). Unfortunately, that's a little tricky when your build cinfiguration itself has to be in yaml, as is the case for github actions, travis-ci, etc.


I have no hate for YAML, but I think this is the "human readable" format with the most gotchas you might ever encounter.

Every time I use YAML, I got bitten by some edge case. For this reason, I wouldn't count of JSON compatibility, except if the implementation also passes a comprehensive set of JSON tests (which the few I used in the past did not).


Even after writing yaml for years, I can never get the indentation, when to use a dash or not and other things right without an editor knowing the spec of the yaml file I'm trying to write. It's bonkers.


Fun fact: Heroku's app.json actually uses a YAML parser so even though it isn't documented you could use YAML with it. (At least this was the case years ago, it's possible it may have changed)


This is at odds with the top comment here suggesting there are edge case bugs. Assuming validly heroku wouldn’t use something buggy like that either the top comment is wrong or heroku uses two different parsers after inferring type.


the top comment is wrong, YAML 1.2+ is a strict superset of JSON


It is not. YAML 1.2.2 disallows the C1 block and the surrogate block[1], while JSON allows anything except the C0 block in quoted strings[2]

1: https://yaml.org/spec/1.2.2/#chapter-5-character-productions

2: https://www.json.org/json-en.html


From later in 5.1:

> To ensure JSON compatibility, YAML processors must allow all non-C0 characters inside quoted scalars.

The wording here is admittedly confusing, but it does ensure that YAML can handle all JSON strings.


Many parsers either default to YAML pre-1.2 or do not even expose a YAML 1.2 option. PyYAML has no 1.2 option, for example. So unless Ansible is using something other than PyYAML...

Relevant (open) PR: https://github.com/yaml/pyyaml/pull/555


The top comment quotes an implementor of a YAML parser, an implementor who in an addendum specifically calls out YAML 1.2 as STILL not being a superset of JSON:

> Addendum/2009: the YAML 1.2 spec is still incompatible with JSON, even though the incompatibilities have been documented (and are known to Brian) for many years and the spec makes explicit claims that YAML is a superset of JSON. It would be so easy to fix, but apparently, bullying people and corrupting userdata is so much easier.


Unfortunately, it claims to be, but is not an actual strict superset.


JSON?


There’s an ambiguous question mark here?


I'm one of the authors of the YAML specification. https://yaml.org/spec/1.2.2/

To date we honestly have not identified a case where where valid JSON is not valid YAML 1.2.

If anyone can point out a case where this is true, please file an issue here: https://github.com/yaml/yaml-spec/issues/


Yes. Yaml is super set of JSON.

It's not a feature, it's a bug. Regardless of what Yaml group says.


It is very convenient when you need to generate YAML via text template which unfortunately appears to be something that our industry has decided is reasonable. You can do {{ someval | toJson }} and have reliable escaping. Way better than "{{ someval }}" or {{ someval | toJson | indent 13 }}


Beware of security implications of using a YAML parser:

https://docs.ruby-lang.org/en/3.0/YAML.html#module-YAML-labe...

> Do not use YAML to load untrusted data. Doing so is unsafe and could allow malicious input to execute arbitrary code inside your application.


"In which the author discovers that YAML is a spec-compliant superset of JSON"

;^)


except it seems it's not really the case? https://news.ycombinator.com/item?id=30052685


Did you try the cited example? Because I did, and it parsed as expected

I do believe there are likely some incompatibility bombs hiding in either the monster specification, or undoubtedly in the various implementations, but it was not my experience that one is the one to bring to yaml's court case


While we are complaining about YAML, I'd like them to specify a canonical output form for strings...


Finally YAML does something it’s expected to do if you acquire expectations from YAML!



So there are six criticisms actually in this file, by my count.

Of those, "NO" is fixed, 0 causing octal is fixed, to me that SQL syntax doesn't look any worse than the low bar set by normal SQL, and the CI providers using different schemas doesn't really have anything to do with YAML.

So that leaves two complaints.

I'm not immediately offended by the clock thing, like I am by NO and octal, but I don't really have the right experience to say how bad it is.

And I was going to say it's bad that nesting escaped string is hard, and it's a shame when languages don't have better quoting mechanisms... then I remembered that YAML has block quotes with no need to escape inside, so that example is just wrong. And they even link later to the stackoverflow post talking about YAML block quotes.

There are problems with YAML but these examples are not good ones.

((By the way, I've seen that "There are 63 different ways to write multi-line strings in YAML" link before but only took the time to fully understand it just now, and that's a gross exaggeration that makes me question the author more than it reinforces their point.

There's a reason the original link says -5- -6- NINE (or 63*, depending on how you count).

Block quotes have 1 or 2 characters to say what to do with newlines, then might have a digit to indicate indentation. That's "60" of the "63". I suppose if it allowed multi-digit numbers there it would be "billions" of ways to write multi-line strings in YAML? That number isn't a real criticism.))


I have now learned that YAML 1.2 removed the sexagesimal parsing. So scratch that one too.


How about no emojis? Am I the only one who finds these abominations annoying in code/docs? Slack chat? Fine. npm logs with emojis? No thanks.


You’re welcome to that opinion but you’re just going to age restrict yourself and have artificially limited impact informing others ¯\_(ツ)_/¯


Okay, since this felt like common knowledge and apparently is also an unpopular opinion: emojis can and often do improve accessibility of documentation. Like any other symbolic reference in text, their use can be defined in a key and they can be used to concisely document a support matrix, or signal complexity levels of deeper links, or… even just thematically identify content in a friendly and inviting way.


Do they? Even if you use them perfectly consistently, people need to know what they mean. For some really common ones that might be fine (green check mark on CI probably means the build ran successfully), but the moment anyone gets confused or doesn't recognize a symbol I feel like accessibility drops off a cliff.


Considering in some cultures the meaning of red and green are somewhat reversed from westeren I would not be so sure about that.

I remember being in the traffic management center in Tokyo for an arranged visit and thought the entire city had come to a halt but red was high throughput not stoppage.


I am a generation Z zoomer and people using emoji in code / docs scream immaturity to me.


I frequently place emoji in code files and database scripts to ensure other people aren't using the wrong encoding in their editors when working on our projects.


More emojis!

(Ok not really)


I have to admit I created my own YAML-like parser (meaning, syntax based on indentation).

My parser just outputs things to json

Just like python, it's just more readable.

Readability is so important for a language, it's more important than feature or correctness.


Slightly off topic, but does anybody know what lang the author is using? I think ruby?


Yep, that's Ruby.


Thank you


"Autopilot took over" - figurative autopilot or Github Autopilot?


for fun: I accidentally compiled a C program with g++ and it worked!

continue?


Accidentally read xml with html parser and it worked!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: