I think you're right that Salt/Puppet and to a lesser extent Chef take the wrong approach, but you make some confusing comments that make me suspect you might not understand what these existing approaches are about.
> IMHO, the overwhelming problem with salt/cfengine/puppet style solutions (which I will refer to as 'post-facto configuration tinkerers', or PFCT's) is that they potentially accrue vast amounts of undocumented/invisible state
I think you mean "post-facto" as in: "run after everything is done"? This is not the way that people would advocate Puppet should be used. Puppet should be used from the start, not added as an afterthought once you are done.
> IMHO, a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made
This isn't a cleaner solution, this is almost the solution you get when you use Puppet. With puppet the development workflow is like this:
- Span up a vagrant VM
- Run your manifests against this VM to test them
- To check in, run your manifests against your staging environment
- To deploy, spin up new clean production VMs and run puppet manifest against them
- Use a reverse proxy to route all traffic to new production VMs. Terminate old production VMs
> PFCT's deployment paradigm tends to be relative slow and error prone.
This much is true. Puppet's slow speed is particularly galling, but maybe that's just because I use it at work.
> isn't a cleaner solution, this is almost the solution you get when you use Puppet
The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.
More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration. This is what I meant by configuration drift.
CSIEs, by contrast, are essentially the complete product of the entire generation process, thus ensuring that future instantiations are identical. A subtle difference, but an important one.
1) You have not dealt with large enough data, since you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data.
2) You haven't thought about what and how those initial CSIE configurations are generated. Do you hand tweak everything, make && make install onto a particular installation of a particular OS all the software then just spawn those? It seems that should go to the dustbin of history. You know essentially have a black-box that someone somewhere tweaked and you have not recipe on how to repeat it. If that person left the company, it might be tricky to understand what and where and at what version was installed.
If you have "configuration" drift there needs to be a fix to the configuration declaration and people shouldn't be hand editing and messing up with individual production servers. If network operations fail in the middle then the configuration management systems needs to have better transaction management (maybe use OS packages instead of just ./configure && make && make install) so if the operation fails, it is rolled back.
you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data
There are many ways to take an image of an environment, not only VMs or snapshots. But if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.
You haven't thought about what and how those initial CSIE configurations are generated.
On the contrary, generation should be automated. In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.
> In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.
A mix of two. Use salt/puppet/chef etc to bootstrap a known OS base image to a stable production platform VM for example. Then spawn clones of that. I would do that and I see how it would work very well with testing.
> if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.
Without disagreeing with your conclusion about the design process, it's useful to note that this situation simply isn't a problem for a conventional configuration management tool.
> On the contrary, generation should be automated.
One could argue that Puppet and Chef are ideal tools for performing that automation.
The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.
CSIEs are still a generative process. The difference with what you call PCFT is that the generative process isn't swept under the rug and codified into a versioned image unless it's really necessary for performance reasons.
The result is that it's easy to maintain a clear distinction between machine state and human instructions. For a trivial example: a list of packages that humans decided are necessary for the system versus the final output of 'dpkg -l' after all dependencies have been resolved.
With chef/puppet/etc. the code used to generate instances represents a human-created description of what the environment is supposed to look like, with as much version-control and referenced documentation as is necessary. With a versioned-image approach, all you have is the one-dimensional history of the image in question.
I fully advocate the description of build steps for environments, just as PFCTs encourage. However, the use of PFCTs to prepare and manage environments seems .. suboptimal, in terms of potential for issues. I suppose a PFCT could be useful as a means to automate the generation of environments, but ... IMHO ... it should not be used for the live instantiation/configuration of real infrastructure (which should be more atomic, from some versioned/known quantity). A subtle difference, but important.
I forgot to mention this before, it is strange that you credit yourself with defining this term when it has been well defined for some time in ops.
> More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration.
But this adds a significant weight over and above the "generative process" of running manifests. Yes, running manifests against your VMs can "fail or change in unexpected ways due to xyz" - don't do it against VMs that are currently in production! I'm not sure you've ended up with anything less error prone and you're still going to need a way to get from a fresh VM image and your output images - which is where Puppet would come in.
I'd really rather not make the entire disk image my build artefact, for fairly obvious reasons (ie: size).
You might like this, which is written by a colleague of mine, except that it is not in "opposition" to Puppet/Chef/etc:
"Adequate" is an interesting word :-) We don't consider DRBD in anything other than synchronous replication mode to be reliable, which puts a fair performance penalty on it.
Perhaps. The problem is, if you use PFCT on a bunch of hosts and something subtle changes that can cause issue, the granularity of the PFCT doesn't necessarily equate to that required for detecting the cause. With a CSIE-style atomic approach to deployment and a properly segregated monitoring system, you can, say, 'roll back' to the last known good version. PFCTs leak state, and will not always allow you this reverse pathway (random examples might include kernel feature or compiler version migration).
> IMHO, the overwhelming problem with salt/cfengine/puppet style solutions (which I will refer to as 'post-facto configuration tinkerers', or PFCT's) is that they potentially accrue vast amounts of undocumented/invisible state
I think you mean "post-facto" as in: "run after everything is done"? This is not the way that people would advocate Puppet should be used. Puppet should be used from the start, not added as an afterthought once you are done.
> IMHO, a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made
This isn't a cleaner solution, this is almost the solution you get when you use Puppet. With puppet the development workflow is like this:
- Span up a vagrant VM
- Run your manifests against this VM to test them
- To check in, run your manifests against your staging environment
- To deploy, spin up new clean production VMs and run puppet manifest against them
- Use a reverse proxy to route all traffic to new production VMs. Terminate old production VMs
> PFCT's deployment paradigm tends to be relative slow and error prone.
This much is true. Puppet's slow speed is particularly galling, but maybe that's just because I use it at work.