I think you're right that Salt/Puppet and to a lesser extent Chef take the wrong...

contingencies · on June 24, 2013

> isn't a cleaner solution, this is almost the solution you get when you use Puppet

The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.

More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration. This is what I meant by configuration drift.

CSIEs, by contrast, are essentially the complete product of the entire generation process, thus ensuring that future instantiations are identical. A subtle difference, but an important one.

rdtsc · on June 25, 2013

It seems that

1) You have not dealt with large enough data, since you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data.

2) You haven't thought about what and how those initial CSIE configurations are generated. Do you hand tweak everything, make && make install onto a particular installation of a particular OS all the software then just spawn those? It seems that should go to the dustbin of history. You know essentially have a black-box that someone somewhere tweaked and you have not recipe on how to repeat it. If that person left the company, it might be tricky to understand what and where and at what version was installed.

If you have "configuration" drift there needs to be a fix to the configuration declaration and people shouldn't be hand editing and messing up with individual production servers. If network operations fail in the middle then the configuration management systems needs to have better transaction management (maybe use OS packages instead of just ./configure && make && make install) so if the operation fails, it is rolled back.

contingencies · on June 25, 2013

you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data

There are many ways to take an image of an environment, not only VMs or snapshots. But if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.

You haven't thought about what and how those initial CSIE configurations are generated.

On the contrary, generation should be automated. In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.

rdtsc · on June 25, 2013

> In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.

A mix of two. Use salt/puppet/chef etc to bootstrap a known OS base image to a stable production platform VM for example. Then spawn clones of that. I would do that and I see how it would work very well with testing.

contingencies · on June 25, 2013

Agreed. I think PFCTs are good for generative part, not for the rest. They should be part of a build process really, not live infrastructure.

regularfry · on June 25, 2013

> if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.

Without disagreeing with your conclusion about the design process, it's useful to note that this situation simply isn't a problem for a conventional configuration management tool.

> On the contrary, generation should be automated.

One could argue that Puppet and Chef are ideal tools for performing that automation.

contingencies · on June 25, 2013

this situation simply isn't a problem for a conventional configuration management tool

Sure. But loads of other stuff is. The weight of tradeoffs is clearly against PFCTs here.

One could argue that Puppet and Chef are ideal tools for performing that automation.

Absolutely agree - but not within live infrastructure. Only build.

Goladus · on June 24, 2013

The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.

CSIEs are still a generative process. The difference with what you call PCFT is that the generative process isn't swept under the rug and codified into a versioned image unless it's really necessary for performance reasons.

The result is that it's easy to maintain a clear distinction between machine state and human instructions. For a trivial example: a list of packages that humans decided are necessary for the system versus the final output of 'dpkg -l' after all dependencies have been resolved.

With chef/puppet/etc. the code used to generate instances represents a human-created description of what the environment is supposed to look like, with as much version-control and referenced documentation as is necessary. With a versioned-image approach, all you have is the one-dimensional history of the image in question.

contingencies · on June 25, 2013

I fully advocate the description of build steps for environments, just as PFCTs encourage. However, the use of PFCTs to prepare and manage environments seems .. suboptimal, in terms of potential for issues. I suppose a PFCT could be useful as a means to automate the generation of environments, but ... IMHO ... it should not be used for the live instantiation/configuration of real infrastructure (which should be more atomic, from some versioned/known quantity). A subtle difference, but important.

calpaterson · on June 24, 2013

> This is what I meant by configuration drift.

I forgot to mention this before, it is strange that you credit yourself with defining this term when it has been well defined for some time in ops.

> More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration.

But this adds a significant weight over and above the "generative process" of running manifests. Yes, running manifests against your VMs can "fail or change in unexpected ways due to xyz" - don't do it against VMs that are currently in production! I'm not sure you've ended up with anything less error prone and you're still going to need a way to get from a fresh VM image and your output images - which is where Puppet would come in.

I'd really rather not make the entire disk image my build artefact, for fairly obvious reasons (ie: size).

You might like this, which is written by a colleague of mine, except that it is not in "opposition" to Puppet/Chef/etc:

http://martinfowler.com/bliki/ImmutableServer.html

_bpo · on June 24, 2013

"Immutable Server" seems like an oxymoron.

Frequent refreshes are great, but your system is only doing something useful once it has "mutated" (i.e. accepted external data to operate on).

The tradeoff system designers have to make is frequency of refreshes vs the cost of transferring interesting data to that server.

Seems like your organization has just coined a new synonym for "gold master"

regularfry · on June 25, 2013

You can do it the other way round: have your data on direct-attached storage and network mount the root filesystem.

contingencies · on June 25, 2013

We do well with DRBD as an alternative here .. adequate network redundancy without performance penalties.

regularfry · on June 26, 2013

"Adequate" is an interesting word :-) We don't consider DRBD in anything other than synchronous replication mode to be reliable, which puts a fair performance penalty on it.

drivebyacct2 · on June 24, 2013

If used properly they should have an identical result.

contingencies · on June 25, 2013

Perhaps. The problem is, if you use PFCT on a bunch of hosts and something subtle changes that can cause issue, the granularity of the PFCT doesn't necessarily equate to that required for detecting the cause. With a CSIE-style atomic approach to deployment and a properly segregated monitoring system, you can, say, 'roll back' to the last known good version. PFCTs leak state, and will not always allow you this reverse pathway (random examples might include kernel feature or compiler version migration).