I suspect a lot of abuse of config files comes from moving logic out of source code for bad reasons. There are good reasons for not hard-coding, say, ports and service endpoints in your source code, because it makes it easier to run the code in different environments. However, there are also bad reasons for taking things out of code. A couple that I have encountered:
Pride in creating a "generic" system that can be configured to do all kinds of new things "without touching the code." Reality check: only one or two programmers understand how to modify the config file, and changes have to go through the same life cycle as a code change, so you haven't gained anything. You've only made it harder to onboard new programmers to the project.
Hope that if certain logic is encoded in config files, then it can never get complicated. Reality check: product requirements do not magically become simpler because of your implementation decisions. The config file will become as expressive as necessary to fulfill the requirements, and the code to translate the config file into runtime behavior will become much more complex than if you had coded the logic directly.
Hope that you can get non-programmers to code review your business logic. Reality check: the DSL you embedded in your config file isn't as "human readable" as you think it is. Also, they're not going to sign up for a Github account and learn how to review a PR so they can do your job for you.
Marketing your product as a "no code" solution. Reality check: none for you; this is great! Your customers, on the other hand, are going to find out that "no code" means "coding in something that was never meant to be a programming language."
If you write your config in a "full-blown" programming language, then your configs are full-blown programs in that programming language. This situation just plain sucks, or at least it comes very close to (or passes) the "suck" threshold every time I experience it. Code-as-configuration demands tremendous discipline from the team.
Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete. You can embed limited shell commands in the DSL as an escape hatch, but make it difficult enough that you really have to try to make a mess.
Another example of this DSL model done mostly-right is Make.
Once you accept that idea, whether to use JSON vs YAML vs TOML vs XML vs S-expressions is just bikeshedding over syntax.
As for "why YAML in 2021" specifically? Yes, YAML is a big spec and there are lot of ways to get strings wrong. But maybe you don't care or your team is unlikely to ever go near the darker corners of the spec. For simple config files, YAML is just really easy to read and write. And if you do need multi-line strings, it's a whole lot easier than doing it in JSON.
I'm personally a big fan of TOML, but maybe YAML is still better for highly-nested data.
Of course S-expressions are wonderful for many reasons, but they share the problem with JSON of being somewhat hard to diff and edit without support from tooling.
>If you write your config in a "full-blown" programming language, then your configs are full-blown programs in that programming language.
Yes, which means we have well-documented functionality and tooling of that language to deal with various use cases. Which is not going to be the case with your ad-hoc format based on YAML or JSON.
>Code-as-configuration demands tremendous discipline from the team.
No more discipline than any other form of programming.
>Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control
And here is the crux of the issue. Tools that are designed so that someone can keep "more control" rather than for tool users to solve real problems. The industry is sliding back towards bad old days of batch processing because of conceit and lack of lateral thinking.
Organizations don't have access to an infinite pool of highly disciplined software engineers, the less discipline or skill required to get something done quickly and safely, the more things they can get done with more people and more kinds of people divided into teams with different responsibilities and different kinds of code.
This is an important point. There was even a discussion here a few months back (I remembered it being more recent, but it was 4 months ago) on an article titled "Discipline Doesn't Scale" [0]. Discipline works up to a point, but the more your system relies on discipline, the more fragile it becomes as you scale (in people, in size of the system). At some point you'll hit a wall where your system is too big or you have too many people and discipline falters as a consequence, or you get slowed down maintaining discipline beyond what's reasonable for your field and customers.
I consider myself a highly disciplined software engineer, and I still want as many guard rails for myself as possible. I am a human, I make mistakes; my schema validator does not make mistakes.
The pool of people who understand any popular scripting language is incomparably larger than the pool of people who understand your clever dialect of YAML, JSON or XML.
With a schema, you have a fully-documented and soundly, statically typed DSL. If it were a Ruby library, you'd have to read the docs anyway, and you also lose static parsing and validation.
The best of both worlds is to use a de facto standardized non-executable format like INI or JSON that nearly every language supports.
Then if you need to, you can create complex or overly long configuration files in Python by inserting keys into a dictionary and dumping to ConfigParser (or however your favourite language does things). For example, its useful when writing a test for many permutations of something similar.
Meanwhile the parsing side is simple enough to be re-implemented in an hour when the time comes to rewrite your whole stack in C+Verilog for real ultimate performance.
The 2 main things are:
1) Using your own bespoke config format or some pet format that's not widely supported adds needless friction to writing little duct tape scripts, testing harnesses, and misc tools. It also adds unnecessary difficulty when porting parts of your program to new languages.
2) Using a Turing complete config format even if it's not bespoke makes all the drawbacks in (1) even more apparent.
Yes, which means we have well-documented functionality and tooling of that language to deal with various use cases. Which is not going to be the case with your ad-hoc format based on YAML or JSON.
Really? Unless it's written in Haskell or something else with a very strong type system, you won't do better than JSONSchema for validating the config file.
And here is the crux of the issue. Tools that are designed so that someone can keep "more control" rather than for tool users to solve real problems. The industry is sliding back towards bad old days of batch processing because of conceit and lack of lateral thinking.
Too much freedom is a bad thing. The industry is not "sliding" anywhere. We tried code-as-configuration, it required too much discipline, so the pendulum is swinging back. As pointed out elsewhere, hopefully Dhall will save us from all this by being the happy balance between expressive and chaos-limiting.
I almost agree with you, but then again I recall the use of YAML for Ansible configuration, and the pain that bolting on additional things has caused.
It has to be said there are a lot of things that are almost fully-scriptable, for example the "mutt" mail-client. It has a configuration language, but it isn't real in the sense that you can't define functions, use loops, etc. I eventually wrote my own mail-client so I could do complicated things with a real configuration language (lua in my case).
Seeing scripting languages grow up in an adhoc fashion often leaves you in the worst of all worlds. Once upon a time I decided I wanted to script the generation of GNU screen configuration files for example. I made a trivial patch:
* If the .screenrc file is non-executable - read/parse.
* Otherwise execute it, and parse the result.
Been a few years now, but I think the end result was that I wrote a configuration-generator in Perl that did the necessary things. (Of course this was before I submitted the "unbindall" primitive upstream, which was one small change that made custom use of screen more safer - using it as a login shell, for customers who shouldn't be able to run arbitrary things.)
> Whereas if you abuse YAML (or JSON or XML or whatever) to create a limited and hard/impossible-to-extend DSL, you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete. You can embed limited shell commands in the DSL as an escape hatch, but make it difficult enough that you really have to try to make a mess.
The article actually mentions Dhall as a solution. This engineering problem has been resolved.
Yeah, I am really excited about Dhall. I think this is the future, it supports the types of abstractions that we need without the mess of full templating or full turing completeness.
The one downside to Dhall is you really want to have an implementation for it in each common language. You can use it to generate YAML, but I think it would be better if tools understood Dhall and that is a bigger ask because it is a more complicated implementation.
Let's build Dhall implementations for every major language, convince Gabe to format things in a way that makes it look more familiar to non-haskell people and consider this problem solved.
I love Dhall and really don't understand why the industry hasn't standardised on it yet, seems like a no brainer.
I disagree with your point that it should supported by each language however, I think it's much better to use something simple like JSON as a "compilation" target since it's easy for machines to read and lets users pick the configuration backend.
Use a smart language like Dhall or bazel for managing configuration and use a mundane format like JSON for the machine, let the Dhall binary bridge the gap.
The trouble I have with DSL's is I don't work on the scripts often enough to become proficient with them. If I haven't looked at it for six months then I'm going to spend most of my time googling or reading half baked documentation.
The “bikeshedding over syntax” issue misses an important point from the post:
> [I]n many ways, XML and XSLT are better than an ad-hoc YAML based scripting language. XSLT is a documented and standardized thing, not just some ad-hoc format for specifying execution.
Standardization and reliable documentation really is an important risk mitigation compared to a “widespread” convention in YAML that might disappear (and even become confusing to new developers) if some new YAML-based API becomes more popular. In many cases this stability will not be worth the annoyances of XML, but it’s not a trivial concern.
> you still have much more control over what can and cannot be executed by the config engine, even if the DSL happens to accidentally become Turing complete
Turing completeness is a red-herring when it comes to config languages IMHO. Purity is much more important consideration, e.g. to ensure it can't delete files, or vary its output based on random network calls.
Besides, there is no true Turing complete languages as we are dealing with finite computers.
So my preference is to have a simple language with explicit limit on number of operations and the amount of memory its interpreter can access before aborting rather than a complex config without any explicit limits on complexity leading to exploits with stack or memory overflow in pathological cases.
> Besides, there is no true Turing complete languages as we are dealing with finite computers.
This is not true. Turing complete languages are so because their halting problem is undecidable, it is irrelevant that the computer you run a python program has finite memory. Check out languages like Agda where you can do general purpose computing but are not Turing complete, since all programs can be proved to halt.
> I'm personally a big fan of TOML, but maybe YAML is still better for highly-nested data.
That is also my experience. TOML is really cool for simple key/value stores, but keeping everything linear in config makes nesting error-prone, eg with [[table.subtable.list]] to append an item something to table.subtable.list. It's really easy to miss a nesting level by accident.
Also related, newcomers in TOMLland find it really confusing that appending a single line to the configuration file will append it to the latest defined table, not as a top-level key.
> Another example of this DSL model done mostly-right is Make.
I think you just internalized the pain of make. I used to be good at it, didn't program c for 20 years and came back to it for a few projects and wanted to tear my hair out.
The pls I'm currently working with have declarative build dsls in the same language (mix.exs for elixir and build.zig for zig) and this is fantastic.
So it should be for configs. Use a truly turing complete language if you need control flow. I think hashicorp got this right but by then everyone hated to have to learn ruby.
I think this is the real reason why yaml configs got popular. If you had a dsl in x language, programmers would get defensive that it was in blub and not their pl of choice. Yaml was a way of being a language agnostic neutral ground.
Regarding not wanting to learn Ruby, I wonder how much of the inertia is installation. I mean, some people seem to have visceral reactions to the syntax (I've even seen people say they dislike Elixir because it's like Ruby ). But the lesson I took away from using Ruby DSLs is users don't want to deal with figuring out how to safely install a new version without borking the system version, segregate workspaces, install packages, etc. Python suffers from that too but for some reason we all ignore it, maybe because a lot of people consider it a newbie or "easy" language and complaining about it would make them seem like "not a real programmer".
oh 100% specifically re: hashicorp using ruby, there was definitely a time between 1.8 and 2.x where installing ruby was a nightmare. That's when i quit using ruby! Even though I loved ruby. And when I saw hashicorp products using Ruby as their DSL a part of me was worried it was not a good choice for those reasons.
Python ecosystem definitely suffers from this. I tried to do some machine learning experiments and basically all of the repos I wanted to use were on 2.x and after 30 minutes of faffing around I gave up and moved onto other packages. However, the biggest pain points for Python came in the 2-3 transition (and TensorFlow x->y in general). By then Python had too much momentum and popularity (and every undergrad learns python). TensorFlow, well at least there is a competitor (torch) and so we see that TF's popularity has basically been sucked dry, and I have no doubt that a large portion of it is just how awful Google+Nvidia have been in managing the TF releases.
>Of course S-expressions are wonderful for many reasons, but they share the problem with JSON of being somewhat hard to diff and edit without support from tooling.
If you want support for trees xpath has been there for 25 years now.
I probably need to write a blog post about this, but "full-blown programming languages" have 2 features that config files generally don't. And people often conflate them:
1. arbitrary I/O -- can I read a file from disk, open a socket, make a DNS query, etc.
2. arbitrary computation -- can I do arithmetic, can I capitalize strings, can I write a (pure) Lisp interpreter, etc.
I claim that the first IS a problem but the second ISN'T.
Arbitrary I/O is a problem because it means the configuration isn't reproducible / deterministic, so it's not debuggable. Your deployed system could be in a state that depends on the developer's laptop, and then nobody else can debug it.
The second is NOT a problem. As long as the state of the deployed system is a FUNCTION of what you have versioned/configured, then it's no problem. Functions are useful. Pure functions can also be expressed in an imperative style (another design issue that's commonly confused).
It is possible to have a programming language with well defined semantics, the ability to have libraries and utility functions and other nice things, while not requiring Turing completeness or the need to always expose file I/O etc. This allows reproducibility and not hitting things like the halting problem in your config file.
Purely my problem for not knowing but I ran into an issue of where I needed to escape characters in a password in a yaml file. Having said that I really like yaml as a Ruby dev.
Preface: this is why people like to complain about YAML, but really I think it's a feature and not a bug that you can write strings in so many different ways, to serve the many different needs for entering text into config files.
This is probably not a common piece of YAML knowledge, but it's arguably better to use "folded" style for a password:
password: >-2\n
asdjoi'";j;oj;90\[2301@
Or use single quotes, which signals to the YAML parser not to treat any characters as special, but then you need to escape the literal SINGLE QUOTE character (') by doubling it:
password: 'asdjoi''";j;oj;90\[2301@'
This is completely valid yaml that reduces to the JSON equivalent:
{"password": "asdjoi'\";j;oj;90\\[2301@"}
And of course you can always write JSON syntax for when the text escaping gets hairy, because YAML is a superset of JSON.
YAML has such a comic-book reputation around here that I want to offer a defense of it. Many criticisms of YAML are either about adding non-YAML bits (like turing-complete templates) on top of it, or complaining that it's a bad JSON superset without considering things that can't be expressed in JSON.
In YAML, you can write: (forgive the contrived example)
!Person
name: Joe
login: joe
In Python, you can use "yaml.add_constructor" to automatically dispatch specific Python code when !Person tags are encountered in the YAML. You can transform text into an object tree (edit: graph, actually -- don't forget about anchors and aliases) automatically.
In the right places, this is extremely handy. It's a middle ground between the "duck typing" you normally get with un-validated markup, and the quasi-"static typing" you get with the same inputs after a schema validation. There's an analogy with type annotations in Python.
As always, there are right and wrong places to apply this kind of thing.
Thanks gsmecher.
We lean on YAML for writing automated hardware test scripts and have good reasons for it.
JSON doesn't allow comments which are needed for a distributed team to understand the meaning of the tests.
https://docs.pltcloud.com/TestPlanReference
We use YAML to capture a structured description of a large instrument deployment. Comments are absolutely key for us. The YAML is revision controlled in git and annotating it as a document (rather than as data) is a huge benefit.
Tagged YAML allowed us to replace an older system that had ten times as much code. Now, there are far fewer, much warmer code paths, and it's easier to use and far less scary to maintain as a result.
Other markups are not the only alternative to this kind of thing. I expect many developers would reach for a database (SQLite?) as a storage medium. However, keeping our data accessible in plain text has been a huge benefit.
I'm working on a project that allows you to put a special "yaml shebang" at the top of any kubernetes manifest. The shebang is actually valid yaml since it's interpreted as a comment. This allows you to combine the declarative kubernetes spec (yaml) with external controller (kubectl)
after making my-app.yaml executable you can now invoke the application's config to reconcile itself
> chmod +x ./my-app.yaml
> ./my-app.yaml apply
namespace/my-namespace created
pod/my-pod created
> ./my-app.yaml get
NAME STATUS AGE
namespace/my-namespace Active 8s
NAME READY STATUS RESTARTS AGE
pod/my-pod 1/1 Running 0 7s
> ./my-app.yaml delete
namespace "my-namespace" deleted
pod "my-pod" deleted
obviously this is just a simple example but there is come really cool potential here.
To be fair, this is just how executable scripts work — it’s not a special yaml shebang, just a regular shebang for an executable that happens to take yaml files as it’s input. The underlying OS just invokes the command in the shebang and passes the file as the first parameter, exactly as it would if the shebang were for bash or python.
I never thought I would see the day when someone "discovers" the way shell scripts were intentionally designed 40 years ago. This is why perl and later python uses # for comments.
It is common for lisp folks to chime in at this point to iterate the value of s-expressions. While this is one such, but I actually have an evolution story about this which I shared in a talk here - http://sriku.org/posts/inevitable-lisp/
It is definitely not the fault of YAML. It has its own issues but that is another story.
The problem is when you use YAML (or an ad hoc interpreter embedded in YAML, or a templating system built on top of YAML) for things that should really be a programming language. Things like imports and functions and composition are useful. Templating is a more ergonomic form of string concatenation.
Usually, when I have to deal with these crazy complicated YAML files at work, I'm not thinking, "I wish they had just turned all the configuration into a mess of shell scripts." I'm thinking, "We really need to clean this up. It's way more complicated than is necessary."
You have a point. A mess of shell scripts doesn't really sound like an improvement. If the config is too complex and the config consumer can be changed to simplify things then that should be the way to go.
However, if the config is a travisCI YAML file that is really just a glorified list of commands I don't think bash or a makefile is a bad solution. Take as much as possible out of the CI config and use a neutral standard format like a makefile to encapsulate much of your logic.
Then when TravisCI stops offering free usage for open source, its easy to move, because your format isn't specific to a vendor.
The visionary executive team at a certain company I worked at felt otherwise and poured millions and thousands of eng hours into recreating HTML/JavaScript MVC components, but with YAML
In my opinion and in my experience, many actual programming languages are fine for defining config, and (due to you inevitably needing some control flow, and them having it built in) are often better than declarative languages. I for one have no problem with config in Python, JS, Bash, PHP, or Ruby. As long as it's in dedicated config files, and as long as it's kept as simple and as declarative as possible.
Every time I write software, in the beginning there are command line flags. Then someone goes, can I put them in a file. Then later, someone comes along and goes, it would be nice if that file could be dynamic in some way (i.e. let me insert stuff from environment variables). The next request is usually something like, could you let me have conditionals, so if SOME_VARIABLE is true one thing happens, and if false, another. Then someone comes along and goes, THIS IS DANGEROUS, probably NOT VERY SECURE, and your FORMAT SUCKS because you use some character I have to escape when templating, so could you use something modern and trendy, that totally doesn't suck like XML (or whatever format is in the engineering doghouse)?
What is funny is that a shell script does a pretty good job at giving you a nice, programmable way to invoke software. But, that is too complicated :-)
I believe what puts people off shell (myself included) are the various arcane caveats (eg. spaces after or before operators) and little language quirks. It does not seem intuitive and since there are better alternatives like Python, I won't bother to learn some 70s string-based language.
You can use it right now as a dev tool. If you use ShellCheck to statically check it, then running your script under Oil is complementary (and it also has some static checks): https://www.oilshell.org/why.html
Yeah I'm a big fan of Oil and hopefully in 10-20 years it'll be the standard language of shells. Also I did notice the spaces issues when attempting shell but it was only from reading your post sometime ago that I found out it was a deliberate language design choice.
It should be usable for "cloud config" files long before that! :) But yes changing existing code at the lower distro level has a lot more inertia.
The cloud is basically built on top of Linux distros, so that part is easier to change and is rapidly evolving.
You could say that the = issue is deliberate, but I'd say the core problem is that shell didn't start out as a programming language, or at least it was a very impoverished one without variables.
The original paper from the 70's on the Thompson shell shows that. Shell had "goto" but no variables! So name=value had to be grafted on later without breaking too many things. Words were already split by spaces, so I guess they just made
name=value
a "pseudo-word" that becomes an assignment, whereas
I highly recommend https://github.com/koalaman/shellcheck It's a wonderful piece of software to help and guide users writing shell scripts. Hook it up to your editor/IDE and all those arcane caveats will be things of the past.
"The next request is usually something like, could you let me have conditionals, so if SOME_VARIABLE is true one thing happens, and if false, another."
I remember one project that started out with simple XML config, then I added conditionals. when I was starting o work on loops and reusable variables I realized that I was writing a programming language in XML. So I started to write config in C# and compiled that dynamically instead.
I implemented a simulation framework once, where the core simulation process took in all parameters as command line options. A runner program would read a YAML file describing the parameters in arbitrarily complex ways, and invoke the sim process potentially thousands of times (doing parameter sweeps). The intent was for the underlying sim process to always be directly runnable for debugging just by copy/pasting the generated list of options, regardless of how complex the config file got. It was a year or so before somebody started passing in an intermediate config file to the sim process by command line argument...
This is actually one reason Ruby DSLs became so popular. You could take config from an XML file, translate it to the DSL, and have all the basic Ruby features (flow control included) available as well.
For example, there was an ANT library for JRuby that worked super well for using ANT constructs/libraries but in a sane language.
What is funny is that a shell script does a pretty good job at giving you a nice, programmable way to invoke software. But, that is too complicated :-)
Ha, I agree, although I also think shell is a bit impoverished. It works but there are valid reasons people don't use it.
Prediction: we'll see a lot more shell embedded in YAML in the coming years, with the same examples shown in the original article (Github Actions, didn't know about Helm Charts)
I've been round that loop a few times; if you use a full language for config then you either have to impose iron discipline or you sooner or later end up needing a configuration format for your configuration format.
That is to say the configuration eventually becomes a program in itself, with a few key values... which then get pulled out into a simple config file.
Depending on the use-case, I think you may be right. Especially if you can use a language that the team understands and has tooling for and you don't take in outside configuration.
Pulumi is an interesting tool in this direction. Rather than write in something like teraform, you just use your programming language of choice. Pulumi is just a library you use.
I think that depends on how complex the config can get, and how it is used.
For example package.json in NPM packages -> that feels like a good fit for JSON (although it would be even better if JSON had comments). On the other hand, terraform, or build languages like Make or Meson -> they are complex enough that it probably makes sense to have a standalone DSL.
I was facing the same decision recently on my project while designing a declarative DSL for web app development (kind of like web framework).
From simplest to most complex option:
- should I just let them define it all in JSON? There would be a lot of repetition at some point and it would become impractical, but it could be ok for the start.
- should I just implement JS library, that devs can use in JS to construct a config object that is then exported to JSON? That would be embedded DSL. Sounds flexible and easy to do, but it is also overly expressive and not "cool" (ok this is debatable).
- should I use something like Dhall? It is declarative and simple.
- should I come up with my own declarative, configuration-like DSL? It would probably end up similar to Dhall, but this means I can do whatever I want - I can make it as ergonomic and custom as I want to (which I guess is both good and bad :D!). It might also allow for nicer interop with Javascript and other languages.
In this case, we went for the last option, mostly because we felt the most important thing is ergonomics and interop, but well, I am still curious how would other directions play out. Plus at the end we didn't yet get to the point where language is more expressive than JSON (code example: https://github.com/wasp-lang/wasp/blob/master/examples/tutor...).
Maybe I am just missing a better design process, but it seems to me that with a language idea it is hard to say if it is good or not until you try using it.
But that does not allow you to manipulate it automatically (think about moving domains or something like this), nor does it help you to detect errors (which few tools do, but many actually should do).
So now you need the runtime for that other language in your system-wide configuration management solution. Plus all the other runtimes for languages someone decided to use as a configuration language.
Sounds alright to me, as long as you restrict to sane, reasonable languages. For example lua's "runtime" fits in a single .c file and is likely simpler than many xml or json parsers.
While YAML has some warts, the "if" statement in there really cant be blamed on YAML as it does nothing but inserting a node in the object tree. You'd need something else to act on that node. I get what the author is saying though, you should try to keep your config logic-free.
That's like saying C doesn't have an if statement, you're just adding it to the AST, it's the CPU that does the actual jump not C. In short: a distinction without a difference.
The C language have an if-statement because it is part of the specification of the C language.
YAML does not have an if anymore than JSON have an if because you can write {"if": "foo", "then": "bar"} and some processor can process this with if-semantics.
This is neither yaml nor Jinja. It's a Helm chart, which uses Go templating to generate yaml programmatically. The point of this is all of the actual config is in Values.yaml, which generates the .Values object referenced here, and you use those values to generate all of your deployment definition, making it easy to apply environment-specific overrides by switching out a single file and leaving everything else alone.
So yeah, if Helm is complicated, sure, don't blame yaml for that. Similarly, I'm not sure all of these declarative CI examples are all valid yaml either and not fed through front-end preprocessor first. They're essentially feature-identical with Jenkins declarative pipeline minus the ability to run arbitrary Groovy code in your build scripts, though of course you can get this exact behavior if you want by feeding a HEREDOC to a sh step that invokes the Groovy interpreter.
In any case, the author calling CI workflow definitions "config" is a little misleading. They're necessarily more complicated than a properties file and need to allow you to invoke external tools. Newer language ecosystems like Go and Rust are trying to solve this by putting dependency management, compiler, packager, and testing all into one tool provided with the language installation, but even there a lot of CI/CD needs to do a lot more than that, like deploy infrastructure, build container or VM images, etc.
Author Here. Thanks for reading. I am not trying to be misleading. I'm trying to point out that although each step along the way of adding things to config seems to make sense, you arrive in a bad place.
People start with some YAML, and everything is fine. Then a simple condition is added, so ok, we are treating code as data, LISP style but in YAML. Then the logic and branching grows, and we introduce templates and so on.
You start with config, then the config ends up with its own config, and eventually, you are using Skaffold to configure Helm, which generates your YAML. That can't be the right solution, can it?
The point is that CI workflow orchestration and clustered application deployment definitions are not config, at least not in the same way as “here is some hierarchy of definitions that change the behavior of an application.” These are attempts to create declarative DSLs that script workflow steps. In the Helm case specifically, the templating engine produces yaml because Kubernetes uses yaml for its manifests, and Kubernetes uses yaml for manifests because it provides a one to one mapping to the actual data structures used by the cluster manager. It’s way beyond config. You’re defining the entire state of a clustered application, including the infrastructure. Only the Values.yaml is config and I don’t see how that alone is all that complicated.
I don't blame YAML for Helm's approach. I'm also not anti-templating, but templating the DSL as a string, and manually setting the indentation level strikes me as a particularly hacky approach. Compare with something like https://jsonnet.org/ (no endorsement), which let's you do the same kind of substitution, but directly in the structure of the data.
That looks like Jinja that outputs YAML, doesn't seem fair to blame YAML (which I think the author sort of does) just because someone wanted to bolt an include statement onto it.
For most applications it would probably be overkill, but it seems useful for applications that need to change their own configuration (e.g. by the user through its GUI), or need its configuration changed by other tools.
Yes! And then let's have anyone put their shit in there without order. So you have config and runtime data in the same area. Have multiple processes change the same value at will. Except for anything larger than a few Bytes. That we just disperse across like 10 possible locations on all hard drives!
I'm fine with having a program that transforms the configuration file into a more diffable format, and then doing my diffs on the transformed files.
I like for this a sorted name=value format, where non-scalars such as arrays and hashes are flattened into the names. E.g., if the config contains an array named "users" with 3 items, the name=value pairs would flatten to names users.0, users.1, and users.2. An array named "servers" whose entries are 3 hashes with the host name and port would flatten to names servers.0.host, servers.0.port, servers.1.host, servers.1.port, servers.2.host, servers.2.port.
That gives you diffs that tell you what has actually changed in the configuration itself rather than in the formatting of the configuration file.
Intermediate solution: SQL command text that generates the configuration file. Command text goes into source control, and the database it generates gets deployed.
Its not about the quantity. For my application the key driver was incremental remote update. If your deployment model is that 100% of the application configuration is linked into the container image, then I can see how flat files sound appealing. I completely agree with Dan Luu about the perils of files, though[1]. My application is a remote sensor that both requires incremental update, and is large enough to justify SQLite over something smaller.
I don't think much of that Dan Luu article applies to read-only files deployed with the application, which is often what configuration files are.
Definitely concurrency control is an issue with mutable files, that would all by itself make me hesitant to use them as a storage solution if that were in play. But it's not when the changes to the files are being done outside of the app being deployed, and they are deployed as read-only, any changes require app restart to pickup etc. That is how I have usually experienced configuration and I have personally never run into any problem with having it in the file system.
Most of the other stuff in the article also seems to me not to apply to the standard configuration use case.
(And if it did... it would apply to your source files too, right? Whether ruby or python or even JVM bytecode. Yet we obviously can and do put those in the filesystem. Ultimately read-only configuration are just another kind of source file they don't really have any special problems).
But "incremental remote update" (without requiring app restart especially!) is definitely not the standard configuration use case. I agree that a 'real database' seems reasonable for that use case, whether sqlite3 or something else. Whether you try to control the configuration (that will wind up in a db) in your version control system, or just use standard db backup/clone techniques instead.
I have definitely seen some tables + SQL used for hierarchical configuration spread across artifacts/modules. My guess is that this has probably seen more deployment than is at first suspected.
The XSLT FizzBuzz example doesn't seem written "in jest", except perhaps in the loose sense of making a point about Turing completeness.
It's as straightforward and readable as XSLT allows, leaving only a small refactoring on the table (putting "
" into an unconditional output instead of repeating it in the four cases).
But it is much simpler to write it in XPath (that could then be included in XSLT)
for $i in 1 to 100
return if ($i mod 3 = 0 and $i mod 5 = 0) then "FizzBuzz"
else if ($i mod 3 = 0) then "Fizz"
else if ($i mod 5 = 0) then "Buzz"
else $i
Wrap it in string-join, if you need the output as one string with line breaks rather than a sequence list.
Although one can also write horrible clever XPath:
for $i in 1 to 100
return (("Fizz"[$i mod 3 = 0] || "Buzz"[$i mod 5 = 0])[.], $i)[1]
Using your solution, one may shorten the XSL-T (with XSL-T 3.0) to:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template name="xsl:initial-template">
<xsl:sequence select='
for $i in 1 to 100
return (("Fizz"[$i mod 3 = 0] || "Buzz"[$i mod 5 = 0])[.], $i)[1]
' />
</xsl:template>
<xsl:stylesheet>
While I understand the ratio behind the article and agree, the author does not acknowledge, that XSL-T is not a "programming language" but a "templating language".
Intercal isn't horrible! How could be the only language with PLEASE keyword be bad?
(from wikipedia)
This last keyword provides two reasons for the program's rejection by the compiler: if "PLEASE" does not appear often enough, the program is considered insufficiently polite, and the error message says this; if too often, the program could be rejected as excessively polite.
Author here. Thanks for reading.
That reference comes directly from the INTERCAL-90 compiler. it has hilarious error messages. One of my favs is that it is an error to not say PLEASE enough and also an error to say PLEASE to often:
E079 PROGRAMMER IS INSUFFICIENTLY POLITE
The balance between various statement identifiers is important. If less than approximately one fifth of the statement identifiers used are the polite versions containing PLEASE, that causes this error at compile time.
E099 PROGRAMMER IS OVERLY POLITE
Of course, the same problem can happen in the other direction; this error is caused at compile time if more than about one third of the statement identifiers are the polite form.
Some more fun details:
- The compiler is called `ick`
- The compiler has a `-mystery` flag which is documented as
"This option is occasionally capable of doing something but is deliberately undocumented. Normally changing it will have no effect, but changing it is not recommended."
- Numbers have to be entered in English. 12345 would be written as `ONE TWO THREE FOUR FIVE` unless you put it in roman numeral mode where the characters ‘I’, ‘V’, ‘X’, ‘L’, ‘C’, ‘D’, and ‘M’ mean 1, 5, 10, 50, 100, 500 and 1000.
- The debugger is called `yuk`
Makes me want to hook up a simple predictive algorithm such as what is used for online Rock-Paper-Scissor against computers to demonstrate how predictable people can be and add
E134 PROGRAMMER'S POLITENESS IS TOO PREDICTABLE
Politeness, once made too predictable, is too easily overlooked. This error is
caused at compile time if the programmer's politeness level is too easily
predicted by the compiler as it encounters each statement in the code.
I think all Lisps work really nicely for configuration that is combining data and control flow.
It's a shame that most people are too unfamiliar with the prefix notation.
Sexps can be nice for a config representation (in particular, they imply less structure so they work well for more cases like maps between objects (just use an alist), little DSLs in configs (so long as you’re ok with a lisp-like language), and “tagged objects” where you want to specify what kind of thing you’re describing and then extra arguments, like enums in rust). There are problems though:
1. Casing (some lisps are case sensitive others aren’t)
2. Lots of types of atom. Eg Common Lisp has like a dozen different types of number (short/long/single/double floats, integers, rationals, complex numbers thereof), weird syntax (eg put a decimal point at the end of an integer to make sure it’s read in base 10. The fact that you have to worry it won’t be is already a serious concern), strings and symbols.
3. Extra syntax/types, eg vectors, arrays, bitvectors, #., backtick and comma, quote, backslash rules (but not sure if there’s a standard way to escape characters), keywords, packages
4. Multiple similar things, eg vectors and lists, symbols and strings, alists and plists.
Many of these qualities may be useful in programming but if one treats a config representation as a thing which must be validated and parsed into the actual data, then all of this adds confusion. I think there should be only one kind of atom: the string which may be written without quotes. This way you can be flexible in parsing (some fields you might choose to parse 50% as 0.5 and other fields you might require that the number starts with a dollar sign so people don’t forget that it is referring so some amount of dollars) and make it easy for people to write config files (no errors about expecting a string and getting a symbol or vice versa).
Following the $ thing for amount of values, a thing I like in Common Lisp is the way the reader can be customized, and maybe this is a bad idea but I would like to have user-provided data-types formats and bind them to prefix characters, or pairs of delimiters, ... to avoid having to tag your values like this, as done in JSON:
(published
(type std.iso.8601
value 2020-02-26))
Or like this, where "date:" is something only your application knows:
(published date:2020-02-26)
Instead, you could rely on an externally specified format
(prefix @ is std.iso.8601)
And use it in your file to parse text so that your application can build values of the proper datatype:
(published @2020-02-26)
The application language would register lexers for those formats or fail the parsing step.
You could have fake parsers that just skip over the defined syntax if you don't need to process it in your code.
Or, the way the syntax is defined could be such that it tells the lexer how to skip over a token even if the type is not useful for a tool's purpose (skip until a space, or parse exactly N characters, or "read until this delimiter with backslash being an escape character").
The program will get a list of two atoms, “published”, and “2021-02-26”. It can complain if you’ve eg written “published” when you should have written “submission_date” and it can complain if the next atom isn’t a valid date. You don’t need to tell the reader to parse something as a date because the reader isn’t best placed to know what should and shouldn’t be a date. Whereas the program should know exactly where it expects dates to be so it might as well handle parsing them so you don’t need to tag dates when you write your config.
Any human can read that date and know what it means so why bother tagging it if the machine can also know it should be a date.
I see this happen a lot with python projects. In my opinion, a superior strategy is to simply import python config modules, e.g. config.py.
You can always write a dictionary into that config file, just like you would with normal yaml. It would actually be sort of nice if python had support for yaml style dictionaries exactly for this purpose.
That's why erlang has file:consult from the start. It just reads whatever data is in file, that data can be any valid erlang terms. That format is also used for typical system configuration files. Of course, you could make execution based on that configuration file, it happens that erlang already has "standardised" way of storing this as {M,F,A} (module, function, args), which you could send directly to apply:
Same here is JS land. We have JSON (and JSONC) but it's easier to just have a `js` config file, with comments, importing the `dotenv` module for secrets values, or whatever else you need.
Wondering why there is no light-YAML or just JSON without comma and braces sort of language available? Or even if there why not popular yet? Almost everyone complains about the huge spec of YAML and that pretty much seems to be the biggest complaint (other than it getting huge - but that can be there for all languages).
.ini or .properties files are still pretty straightforward.
That said, just because a language has these features, doesn't mean you need to use them. A lot of these issues are in part the fact they're there, but also that people don't take a stand / don't have the discipline to NOT use them.
It's why I'm now an opponent of Scala, because it's too free and feature rich.
There's a very specific, very common, use case that the people who write most markup parsers refuse to support...
A good config file format needs to:
- be human readable
- be machine modifiable while preserving comments and whitespace
- be unable to run code
- be unable to call a constructor on an object without being whitelisted
It's really sad CSON didn't catch on. We have a JSON swagger spec at work and balacing braces,etc is just painful.
So for a new project I tried writing a spec in YAML instead and the indentation rules felt quite shaky for me as a beginner (ie i wasn't entirely sure where things ended up with all the modes) so it was a lot of trial-and-error for someone unfamiliar with YAML as me (and it explains why I've kinda stayed away from it because writing it was as hard as reading it sometimes).
CSON seems to have a sweet-spot between them with less clutter than JSON but more straightforward block rules than YAML.
HOCON is the "human optimised config object notation" and is a superset of JSON. All valid JSON is valid HOCON but then it goes and adds lots of other features specifically designed for writing usable config files.
> Wondering why there is no light-YAML or just JSON without comma and braces sort of language available?
Well, JSON without commas or braces would be restricted to single literal values. (Or, I guess, single-element lists, but what would the point of that be?)
If all you have are record separators, so that the only data you can send is a list of values distinguished by nothing other than the order in which they occur, then the information about the structure of the data must exist separately on both the sending side and the receiving side. Better hope they agree.
But regardless, I feel like that exists right now; doesn't this describe protocol buffers?
> Parent said no commas or braces. He didn't say "all you have are record separators".
No, he didn't, he said "JSON without commas or braces". That makes no sense.
You were the one who specified nothing but record separators:
>>> you could have multiple elemnts without commas or braces, just with whitespace and newlines as separators.
And sure, that approach has upsides and downsides compared to JSON, which is why it's already a widespread alternative to JSON.
Your proposal here is just JSON that looks slightly different. The reason we don't have that is that we do have that, with isomorphic syntax, and we call it "JSON". There's nothing at all interesting about the idea of "JSON without commas" if you satisfy it by saying "we've eliminated the comma by relabeling it as a 'hyphen' in some cases and a 'tab' in the rest".
>No, he didn't, he said "JSON without commas or braces". That makes no sense.
Strictly interpreted, no, it doesn't.
But I picked his intention, and clarified it already as: "Where JSON here is just a stand-in for "simple format with few types", not about it being parseable as JSON or anything." to the comment you've respond to :-)
>Your proposal here is just JSON that looks slightly different. The reason we don't have that is that we do have that, with isomorphic syntax, and we call it "JSON". There's nothing at all interesting about the idea of "JSON without commas" if you satisfy it by saying "we've eliminated the comma by relabeling it as a 'hyphen' in some cases and a 'tab' in the rest".
Hey, one should at least try to infer what people mean from context. Not everybody is a native speaker or the best communicator.
What the parent asks for, and it is interesting, and we should have had it, is basically "minimal, sane, YAML subset".
I don't know about other but I have a nice little scheme based dsl which compiles to yaml for aws services. I've used it in anger for 2 years now and no one has noticed any difference at any place I've worked in.
I was hoping the article would mention Cue. I find the syntax so much better.
Cue needs to be adoption ready and is getting close. The big one will be the 'cue mod' command, when that lands we should be in good shape. I think v0.4 will be when people start looking hard at Cue, but maybe a later 0.3.x
> INTERCAL is a bit unusual. For example, single quotes are called sparks, and double quotes are called rabbit ears, less than (<) is an angle, and a dash (-) is a worm.
Just took a quick look and the first example does that hateful thing with starting lines with commas. Even if that's just a convention thing it instantly makes me want to avoid it.
Just allow trailing commas at the end of lists. It's much less jarring and unfamiliar.
Trivial? Maybe. But my brain takes a while to adjust to new conventions and I do it way too frequently already.
This will not diff well if you add a new row at the beginning. It's the same problem with non-trailing-commas, but moved to the front instead of the end.
A few months ago I explored using Dhall and Jsonnet to re-write an Ansible playbook [0,1]. I wanted to like Dhall, but found the type system got in the way more than it helped, while Jsonnet was very productive and a huge improvement over YAML.
It's great if the system is built directly on top of it, e.g. spago. Not so much if you're trying to type yaml. Dhall is very opinionated, so trying to use it like TypeScript to type an untyped structure is... interesting.
> How did we get to this world of little programming languages embedded into YAML?
A configuration file is just a bad programming language (or a good one...).
Some people have forced YAML to be a configuration file because they're lazy and don't want to write a parser/lexer. But most people do it because they don't know the difference.
despite what the article shows, YAML is NOT a programming language - constructs such as `if`, `when` and others are not generic. Try to feed an ansible playbook to Travis, or get those jinja-like variable substitutions working with a YAML parser that doesn't process them, and things won't work as expected.
The problem with these languages embedded into YAML is they are all one-off implementations. TravisCI conditionals have a TravisCI specific syntax, usage, and features. You can't use Travis's concat function or conditional regex in the YAML configuration for your ansible playbooks.
Using traitlets for jupyter configuration completely changed the game. It enhanced jupyter versatility and extendability so much. The only downside is that it is kinda unparsable but that's because providing parsers for most used IDEs requires a lot of time...
"Writing control flow in a config file is like hammering in a screw."
Sometimes a screw is the correct or perhaps only fastener for a job and a hammer is the only screwdriver available or the only screwdriver capable of driving the screw. I've done it.
Unfortunately this explains most problems with evolving software applications. You see a problem, and you KNOW it would be much better solved with e.g. a rewrite into a proper programming language, but as a developer (and manager) you have to make the tradeoff; spend five minutes adding an IF to a yaml file, or five months rebuilding the whole CI to allow for conditionals.
I'm currently in tradeoff mode as well. Do I spend a month copy / pasting some shit code in the existing codebase so that I can spend the rest of the year on the project rebuilding things, or should I pause the rebuild project and instead clean up the existing one (effectively a rebuild-in-place).
Author Here. My point is not that YAML is a programming language, and that is horrible. Instead, it is used sometimes used to embed a vendor-specific unnamed programming language in, and that is horrible.
People start with some YAML, and everything is fine. Then a simple condition is added, so ok, we are treating code as data, LISP style but in YAML. Then the logic and branching grows, and we introduce templates.
It not that anyone wants to get where we've ended up. It's that each step along the way seems to make sense until you end up trapped in complex templates, and scripts to configure your config and it's too late.
It is a vicious local optimum that everyone keeps falling into.
Otherwise, one could argue that some programming languages use newlines to separate commands, therefore the file type is NSV (newline separated values), and therefore NSV is a programming language, and any newline-separated file is a program. This is clearly nonsense.
Sometimes people confuse the medium with the language. YAML may be the medium in which programming instructions are conveyed, but it never makes YAML a programming language, irrespective of the file extension. If you put lines of bash into a YAML array, the YAML itself still only contains data. If you pass that data file to something that can take the bash lines out and make use of them, then great.
Essentially, it's a storage medium - just a slightly higher-level storage medium than we're used to thinking about. You could create a programming language syntax that is entwined with YAML, but then the language would be more correctly named something like Whatever-over-YAML (or Whatever for short)
I feel like a lot of people who were criticizing XML a few years back because "schemas are complicated" and "XML is too verbose" are slowly realizing these things were invented for a reason.
JSON-LD or XML are perfectly good candidates for data with strict schema. But it took devops startup fanboys some time to realize schemas were useful in the first place.
> a lot of people who were criticizing XML a few years back because "schemas are complicated" and "XML is too verbose" are slowly realizing these things were invented for a reason
Agreed, it's analogous to the way the programming language world has come around to realising the static type system folks had a point all along.
I don't have much experience with XML so I can't speak to whether the other criticisms of XML make sense, especially regarding complexity.
YAML sounds like the, olde time "hardy hard har" solution to modern extensible languages. Where pretty much "anything goes" to the point of conceiving full blown tumor executables.
People can say what they want about JSON but anything beats the old world of rolling your own serializers and convincing your boss over the hours you've spent reinterpreting some "sporadic flavor" like YAML.
> force Douglas Crockford at gunpoint to add comments to the spec.
You mean re-add them.
> I removed comments from JSON because I saw people were using them to hold parsing directives, a practice which would have destroyed interoperability.
Many JSON parsers has flags to enable comment parsing (and someone went ahead and made a "standard" called JWCC that seems to be more or less what these parsers accept)
I think that is a good tradeoff, JSON being strict for data interop with the possibility of enabling comments for those cases where people use it for configuration (and specifying it as JWCC)
Pride in creating a "generic" system that can be configured to do all kinds of new things "without touching the code." Reality check: only one or two programmers understand how to modify the config file, and changes have to go through the same life cycle as a code change, so you haven't gained anything. You've only made it harder to onboard new programmers to the project.
Hope that if certain logic is encoded in config files, then it can never get complicated. Reality check: product requirements do not magically become simpler because of your implementation decisions. The config file will become as expressive as necessary to fulfill the requirements, and the code to translate the config file into runtime behavior will become much more complex than if you had coded the logic directly.
Hope that you can get non-programmers to code review your business logic. Reality check: the DSL you embedded in your config file isn't as "human readable" as you think it is. Also, they're not going to sign up for a Github account and learn how to review a PR so they can do your job for you.
Marketing your product as a "no code" solution. Reality check: none for you; this is great! Your customers, on the other hand, are going to find out that "no code" means "coding in something that was never meant to be a programming language."