Yes, there's more to it sadly. The hardware isn't in our control and UPS are out of the question. The machines aren't on premises or in a datacentre.
Puppet works okayish but when you have a large fleet of what are basically IoT devices, you start getting an unpleasant failure rate. Push one bad network config and you've 'bricked' thousands of machines.
I am being a bit vague intentionally, hope you can understand. But to help imagine, the scale of our problem is pretty big, you've probably even had some interaction with one of our machines.
This isn't a hard problem space, but treating an IoT device as a devops-managed fleet of hosts is asking for a bad time. Most modern devops situations assume you can, worst case scenario, replace the machine outright with relatively low friction. This isn't true with IoT. I wouldn't recommend puppet at all.
For a household-name IoT device, we did about 10000+ hours worth of testing on an array of devices for every update candidate. Think a walk-in closet chock full of devices covering every surface. This included thousands of hours of pure power-interruption scenarios, all automated.
We had a "alpha" and "beta" branch for internal users for ~a month before updates hit customers. If we bricked a device we could replace it.
For all update channels, we set a percentage of devices to deterministically receive updates. We start with 1% (probably 0.0001% now...) and double the rollout every day or so if things look good.
We had the ability to "roll back" to the last good firmware, which is stored on a partition on the device. This is almost never used.
Devices update in the background, unobtrusively. Once the update is complete, the device waits for a quiet window in which to reboot and try out the new firmware. Boot-time tests and a custom watchdog monitor the device to make sure everything works, including networking, filesystem, all services start up normally. If there is an anomaly, the device reboots back to the previous firmware, and this is reported home.
After some period of stability (minutes to hours usually) the device marks the update as good and will keep using it. If the device crash loops three times in a row within some window, it reverts to the previous version.
Yes, losing power sucks, and yes devices will get bricked and the filesystem will be corrupt, but you can do lots to ensure this is minimized. Like redundant partitions, leaving ample room in flash storage for bad blocks to prolong the device life, tuning log storage to avoid wasting write cycles and losing data on power loss, lots and lots of metrics and lots and lots and lots of automated integration tests.
If you're really struggling, I can help. Lots of fun problem solving in this space.
My impression of GGP’s post is not that you can’t solve these problems, you clearly can, but that not [m]any distos are interested in tackling this problem (or if they are, have not effectively or cheaply accomplished such a goal). So, you have to build a bespoke solution (like you’ve done). To be fair, to qualify as a solved problem in the industry one would expect either a standard, possibly community-supported, software implementation of the important parts, or at least documentation such that others could read up, learn about your wins, and apply them consistently to their projects, or both.
Truthfully, the only "distro" that comes close to solving these problems is Yocto, which is really a distro-builder for embedded devices. Yocto & Mender work pretty nice. You could probably get Mender working with Ubuntu Core.
That's actually 90% of the solution these days: Use Mender. It wasn't around when I did this, but we likely would've used it, as we built essentially the same thing.
The other 90% is quality control. Distros can't really solve that for you.
> To be fair, to qualify as a solved problem in the industry one would expect either a standard, possibly community-supported, software implementation of the important parts, or at least documentation such that others could read up, learn about your wins, and apply them consistently to their projects, or both.
These are solved problems. Mender exists and documents almost everything I mentioned. Google has published incredibly thorough technical documents describing how Chromecasts and Chromebooks update, and at least the latter solution is Open Source.
If you look, you'd see these problems are not novel, which is why I declared it "not a hard problem space." The prior art is tremendous.
I got _super_ lucky at the startup where I was responsible for OS/updates/security for our IoT devices. Between the design stage and the production run, the price of 4GB SD cards dropped below the price of the 2GB cards on our BOM, so I had an entire spare partition to play with where I could keep a "spare" copy of the entire device image. And we had a "watchdog" microprocessor that could switch the main processor's boot config if it failed to boot. (We were basically running a RaspberryPi and an Arduino connected together. The prototype were exactly that, the final hardware was an iMX233 and an Atmel328 on our own custom board.)
We used Arch Linux with our own pacman repo, so the devices all pulled their own updates automatically. (Also it was super low risk, these were xmas tree lights, so our problem was "we don't want to ruin anyone's xmas!" instead of "If we fuck this up people might go broke and/or die...")
Christmas tree lights with 4 GB of storage. I'm sure this was a superb product, and a result of sensible decisions, but there is nevertheless something hilarious about that.
In the case of bricking, a strategy I've seen is have two partitions in your flash (I am imagining your device has a flash?) then a watchdog can verify the health your deployment, in case your deployment is unhealth it can boot from the known good partition.
Networking specific, if using NetworkManager you can set up fallback network profiles which can kick in, in the event the primary connection fails. Could be a secondary ip, last good profile or dhcp.
With regards to bricking thousands of machines, with great power comes great responsibility. Is there no way to test the config before mass deployment? In batches or a canary deployment?
Puppet works okayish but when you have a large fleet of what are basically IoT devices, you start getting an unpleasant failure rate. Push one bad network config and you've 'bricked' thousands of machines.
I am being a bit vague intentionally, hope you can understand. But to help imagine, the scale of our problem is pretty big, you've probably even had some interaction with one of our machines.