I work in a different industry and am responsible for maintaining a fleet of bar...

hnlmorg · on March 15, 2021

> those of us working with bare metal are still trying to find a way to keep machines up and running with an OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config.

I'm guessing there's more to this story than you've summarised because those points are pretty easily solved with:

- UPS (if the power outs are that much of a problem then you might need to invest in a generator as well).

- iLo / IPMI (remote management). Though even just running a serial cable out the back of the server is good enough for a remote console in the event of a network failure.

As for managing the config of them, the usual tools like Ansible and Puppet work just as well (in some cases actually better since they were initially designed for on-prem hardware). Likewise for Docker. So don't think you can't run those tools on bare metal Linux. But if you don't want the containerisation-like aspects of Docker but still wanted the deployment tools then you can go a long way with git and shell scripts.

While DevOps really came into popularity with cloud hosting, there's nothing fundamentally new about a lot of the tooling that wasn't possible in the old days of bare metal UNIX and Linux. Us older sysadmins were still doing a lot of the same stuff back then too, we just didn't given it trendy names back then.

10+ years ago I was installing Linux remotely on bare metal hardware via a console server. Then once I had one machine installed I'd take one HDD out of the mirrored RAID array and plug it in the 2nd server so the RAID controller would clone. I could repeat the disk swapping as many times as I wanted as long as I remembered to change the IPs (again, via remote serial console). And this was the lazy way of deploying a fleet of servers quickly. More professional places would push the install onto the server via netboot cloning solutions.

cedws · on March 15, 2021

Yes, there's more to it sadly. The hardware isn't in our control and UPS are out of the question. The machines aren't on premises or in a datacentre.

Puppet works okayish but when you have a large fleet of what are basically IoT devices, you start getting an unpleasant failure rate. Push one bad network config and you've 'bricked' thousands of machines.

I am being a bit vague intentionally, hope you can understand. But to help imagine, the scale of our problem is pretty big, you've probably even had some interaction with one of our machines.

xyzzy_plugh · on March 16, 2021

This isn't a hard problem space, but treating an IoT device as a devops-managed fleet of hosts is asking for a bad time. Most modern devops situations assume you can, worst case scenario, replace the machine outright with relatively low friction. This isn't true with IoT. I wouldn't recommend puppet at all.

For a household-name IoT device, we did about 10000+ hours worth of testing on an array of devices for every update candidate. Think a walk-in closet chock full of devices covering every surface. This included thousands of hours of pure power-interruption scenarios, all automated.

We had a "alpha" and "beta" branch for internal users for ~a month before updates hit customers. If we bricked a device we could replace it.

For all update channels, we set a percentage of devices to deterministically receive updates. We start with 1% (probably 0.0001% now...) and double the rollout every day or so if things look good.

We had the ability to "roll back" to the last good firmware, which is stored on a partition on the device. This is almost never used.

Devices update in the background, unobtrusively. Once the update is complete, the device waits for a quiet window in which to reboot and try out the new firmware. Boot-time tests and a custom watchdog monitor the device to make sure everything works, including networking, filesystem, all services start up normally. If there is an anomaly, the device reboots back to the previous firmware, and this is reported home.

After some period of stability (minutes to hours usually) the device marks the update as good and will keep using it. If the device crash loops three times in a row within some window, it reverts to the previous version.

Yes, losing power sucks, and yes devices will get bricked and the filesystem will be corrupt, but you can do lots to ensure this is minimized. Like redundant partitions, leaving ample room in flash storage for bad blocks to prolong the device life, tuning log storage to avoid wasting write cycles and losing data on power loss, lots and lots of metrics and lots and lots and lots of automated integration tests.

If you're really struggling, I can help. Lots of fun problem solving in this space.

dcow · on March 16, 2021

My impression of GGP’s post is not that you can’t solve these problems, you clearly can, but that not [m]any distos are interested in tackling this problem (or if they are, have not effectively or cheaply accomplished such a goal). So, you have to build a bespoke solution (like you’ve done). To be fair, to qualify as a solved problem in the industry one would expect either a standard, possibly community-supported, software implementation of the important parts, or at least documentation such that others could read up, learn about your wins, and apply them consistently to their projects, or both.

xyzzy_plugh · on March 16, 2021

Truthfully, the only "distro" that comes close to solving these problems is Yocto, which is really a distro-builder for embedded devices. Yocto & Mender work pretty nice. You could probably get Mender working with Ubuntu Core.

That's actually 90% of the solution these days: Use Mender. It wasn't around when I did this, but we likely would've used it, as we built essentially the same thing.

The other 90% is quality control. Distros can't really solve that for you.

> To be fair, to qualify as a solved problem in the industry one would expect either a standard, possibly community-supported, software implementation of the important parts, or at least documentation such that others could read up, learn about your wins, and apply them consistently to their projects, or both.

These are solved problems. Mender exists and documents almost everything I mentioned. Google has published incredibly thorough technical documents describing how Chromecasts and Chromebooks update, and at least the latter solution is Open Source.

If you look, you'd see these problems are not novel, which is why I declared it "not a hard problem space." The prior art is tremendous.

10, 15 years ago? Not so much.

bigiain · on March 16, 2021

> a large fleet of what are basically IoT devices

I got _super_ lucky at the startup where I was responsible for OS/updates/security for our IoT devices. Between the design stage and the production run, the price of 4GB SD cards dropped below the price of the 2GB cards on our BOM, so I had an entire spare partition to play with where I could keep a "spare" copy of the entire device image. And we had a "watchdog" microprocessor that could switch the main processor's boot config if it failed to boot. (We were basically running a RaspberryPi and an Arduino connected together. The prototype were exactly that, the final hardware was an iMX233 and an Atmel328 on our own custom board.)

We used Arch Linux with our own pacman repo, so the devices all pulled their own updates automatically. (Also it was super low risk, these were xmas tree lights, so our problem was "we don't want to ruin anyone's xmas!" instead of "If we fuck this up people might go broke and/or die...")

twic · on March 16, 2021

Christmas tree lights with 4 GB of storage. I'm sure this was a superb product, and a result of sensible decisions, but there is nevertheless something hilarious about that.

andrecp11 · on March 15, 2021

In the case of bricking, a strategy I've seen is have two partitions in your flash (I am imagining your device has a flash?) then a watchdog can verify the health your deployment, in case your deployment is unhealth it can boot from the known good partition.

Hopefully it makes some sense!

nullify88 · on March 16, 2021

Networking specific, if using NetworkManager you can set up fallback network profiles which can kick in, in the event the primary connection fails. Could be a secondary ip, last good profile or dhcp.

With regards to bricking thousands of machines, with great power comes great responsibility. Is there no way to test the config before mass deployment? In batches or a canary deployment?

hnlmorg · on March 15, 2021

ahhh yes IoT is another field entirely. I can appreciate your frustrations there.

unixhero · on March 16, 2021

I bet it is a Tesla CAR:)

captainmuon · on March 16, 2021

That is an option with servers, but not with "desktop" computers (in my case, custom industrial PCs with touch interfaces).

There is no really great deployment story. In the very best case, you have to connect a keyboard, a PxE LAN cable, change some things in the bios, and select the OS you want to clone. But creating the image in the first place takes the better part of a day. Another option is to maybe clone the SSD beforehand. It gets more complicated if you have to flash a certain bios, change settings, and so on. Ideally, we'd just want to connect one cable bundle and "pressure tank" the new system with OS and configuration in a couple of minutes.

This is for the PCs that we sell, the story for the office laptops we use is even worse. There are good tools in the Windows world for deployment, but they all seem geared for installations of 1000s of computers. What if the office you manage (on the side of your normal work) just has 20? There is little point setting up SCCM, WSUS, or newer stuff like Autopilot (which seems pretty cool, but I couldn't figure out how to install MS office with the user's license, or how to install an ERP from Microsoft themselves... that should be 1-click or 1 line of code if you offer such a solution).

What I'd really like is a mixture of Ansible or Puppet with a stupid simple monitoring GUI. Then I'd be able to say

    choco install firefox,7zip,vlc
    install office,erp
    join-domain mycorp,$credentials

boot from USB, hit a few keys, and come back to a deployed PC later. AND be able to see the PC in a simple desktop app, where I can ping it and see who's logged in, what updates are missing etc..

hnlmorg · on March 16, 2021

You're talking about Windows though. This topic is about Linux. I don't pretend to specialise in automation on Windows like I do with Linux but what experiences I do have managing Windows instances have all been painful (regardless of whether they were a desktop or server) because the problems require a completely different mindset to solve and half the time those solutions are only semi-effective. So I do feel your pain there.

For what it's worth, I've had some success with Powershell for package management and domain management, and tools like Clonezilla / Norton Ghost for managing images on small to medium sized fleets of machines (again, both desktop and server). There are also a plethora tools that can interrogate what machines are on a given network, the software installed and their patch levels -- but most of them are not going to be free. However there definitely are alternative options to SCCM and WSUS if they're too "enterprisey" for your needs (I've used a few different ones but I'm afraid I can't recall the names of the more effective solutions in terms of ease of us and features vs license fee).

captainmuon · on March 16, 2021

I agree, Windows is the main difficulty here, but we also ship (Desktop) Ubuntu. It's much more amenable to command line tools, but probably still nowhere close to what people working with disposable VMs on the cloud are used to.

tw04 · on March 15, 2021

>As for managing the config of them, the usual tools like Ansible and Puppet work just as well (in some cases actually better since they were initially designed for on-prem hardware).

I hear this a lot, but do they really? As best I can tell they require writing everything from scratch. I went down this path just an exercise to see what end-users were having to deal with. Everything in ansible appears to be re-inventing the wheel.

Sure it's powerful, and it provides a robust framework to do everything, but out of the box it does NOTHING on its own. Want to update ilo(m)? Someone probably has a playbook somewhere you can build on, but you have to go find the playbook, and modify it to meet your needs. Same with updating an OS, or software package or X.

There's no "point at this server, scan the server, figure out the hardware model, os, application packages, and build a default list of things you might want to manage".

Cisco UCS handles a chunk of this for the physical hardware, but it has plenty of issues itself. Same with HPe and Synergy, and Dell/EMC with Openmanage (this may have been replaced, I honestly haven't had to deal with Dell recently).

Honestly it gets a bit frustrating when the response is always "there's this tool that can do it if you just spend 6 months customizing it to your environment!". I think what OP is asking for is some intelligence to automatically discover.

If Ansible can do that, I'm all ears, but I haven't figured out how.

andrecp11 · on March 15, 2021

I've used Ansible to manage thousands of baremetal servers, you do have to build some stuff yourself but it mostly just works. The Ansible framework for gathering facts is pretty neat, it has a framework to do X or Y depending on facts from hosts, you can run it in pull model or push model, through AWX or jenkins. A lot of freedom.

If you want something that just works automatically and is free, well, I don't think you will find. Learning Ansible (or another configuration management tool) well will pay dividends through a lot of your career tho.

bigiain · on March 16, 2021

> I hear this a lot, but do they really? As best I can tell they require writing everything from scratch. I went down this path just an exercise to see what end-users were having to deal with. Everything in ansible appears to be re-inventing the wheel.

Other people's Ansible playbooks are often nice, because people mostly only publish things that stick to "best practice". I'll often use these if the exist and seem well thought out (and can do what I need done).

For things that there isn't an obvious "good choice of existing playbook", I'll sometimes write a "proper" one (we've got a quite good one here that got used a lot when we were deploying a lot of fundamentalist similar Grails/Tomcat/Apache apps with a bunch of common dependancies), or I'll just resort to using Ansible to run remote commands over ssh. If I know how to do it on the server from the command line, it's trivial to do exactly the same thing using Ansible from my laptop (or our bastion or config host). The trick here is making sure you don't write Ansible/ssh that screws up if you run it twice - which mostly isn't too hard to avoid so long as you remember to do so.

> I think what OP is asking for is some intelligence to automatically discover. > If Ansible can do that, I'm all ears, but I haven't figured out how.

I suspect this is a difference in approaches. It's easier in the cloud/vm world, bit I still treat bare metal servers more like cattle than pets. If I get a problem like "OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config." I just get the box reimaged to a known state, then run the Ansible "update and deploy/configure from scratch" on it.

I don't think trying to build something to "intellegently auto discover" and repair a corrupted OS from hard powerdowns or a totally botched configuration change is a good use of my time... I'll just stand up another server from scratch using tried/tested automation.

cedws · on March 15, 2021

Something else to consider is all the edge cases you'll hit with tools like Ansible and Puppet. I'm pretty sure neither of them handle apt getting into a locked state because an operation was interrupted. So you end up having to write some hacky script to detect that and fix itself.

Ansible is not even an option for IoT because it doesn't scale well and requires a stable connection. Puppet's agent model works better for large fleets especially if you're crossing multiple gateways.

nullify88 · on March 15, 2021

No luck with Ansible pull? Sounds like it maybe ideal for your usecase. Git pulls are often incremental and compressed, and going over https may prove to be more stable than ssh over unstable connections. Since Ansible is executed locally, theres no need to have a persistent ssh connection to run your playbooks.

hda111 · on March 16, 2021

Ansible-pull doesn’t require a stable connection. It’s just a way to download a playbook and execute it unattended.

andrecp11 · on March 15, 2021

Ansible can also run in pull-mode, just like Puppet. Is there anything that hasn't scaled well that you could share?

hnlmorg · on March 15, 2021

Neither Ansible nor Puppet takes 6 months to configure. Yes you have to write the code (and/or pull pre-existing modules) to make the config changes you want, but that's the point of them: they are there to automate the changes of config on servers. If you don't want to change the config of a base install then there is nothing to write. And given config is going to be specific to your environment then of course you're going to need to customise your config management code. No two companies are going to want their servers set up exactly the same.

tw04 · on March 15, 2021

>Neither Ansible nor Puppet takes 6 months to configure. Yes you have to write the code (and/or pull pre-existing modules) to make the config changes you want, but that's the point of them: they are there to automate the changes of config on servers.

So if I've got 40 different models of servers, with 10 different models of switches, 8 different operating systems, you're telling me right now there's a playbook for every possible hardware configuration I could have that will update all of that to the latest firmware without significant work on my part? Where is this magical repo?

andrecp11 · on March 15, 2021

I think you misunderstand how Ansible is supposed to work. You can put in the work (much less than ~ 6 months, I'm available for freelancing!) and have it working. By then you will be able to run your command :-)

You will also be able to share as much of the deployment as you want/can between the different models/OSes. Templating configuration variables based on their characteristics.

What you're asking to is akin to "Why cannot I download an app that does what $PRODUCT does but for my business for free? Why do I need to write my own software?"

tw04 · on March 16, 2021

Free? I’m not sure if that’s a joke. Ansible licensing is about as far from free as you can get.

Having to hire someone to customize ansible is the exact opposite of what op asked for and just reaffirms my point: ansible does nothing without significant customization.

andrecp11 · on March 16, 2021

Ansible licensing? I must be missing something! Do you mean Ansible Tower? Ansible is free and OSS. It has AWX which is the Tower equivalent but free.

Yes. Ansible is a framework with plugins. Just like Django or Ruby on Rails.

tw04 · on March 16, 2021

What other ansible licensing would I be talking about? AWX has no support, it is a non-starter in the enterprise.

So again: Ansible is not an answer for someone looking for a ready-made solution to managing bare metal, which was my entire premise. Telling people to just use Ansible or Puppet when they're asking for a solution to the problem is just barely more helpful then telling them you can do it with a bash script and SSH for a consulting fee. It's pedantic and misses the point entirely.

hnlmorg · on March 16, 2021

I've used Ansible to manage bare metal. It worked great. I've used Ansible in high availability enterprise environments. Again, it worked great.

I'm not as big of a fan of Puppet but actually puppet also works great for on-prem systems given that's what it was originally designed for. If anything, Puppet makes less sense in the cloud than it does on bare metal.

I'm not disputing you have a complex problem but that just means you need to spend a little more time tuning your solution (not less time like you seem to assume).

And if you want my advice about how to approach a daunting build: break your problem down. First start with delivering easy systems which will have the least impact if it goes wrong. This is to get your confidence up in working with the tool. Then start picking the harder targets that will give you the most reward, so even if your project ends up unfinished you've still fixed the biggest problems in your org. Then work your way backwards until everything is fixed. After a while, some of the easier deliveries will become background jobs you can fit in between support queries or half day sprint tickets (depending on whether you Kanban or Sprint). Before you know it, you'll have everything automated and realise it was far less painful than it appeared before you'd started the project.

Disclaimer: I'm DevOps Manager who has transitioned several orgs to through this process :)

Karunamon · on March 16, 2021

AWX (and Tower) have a different use case than the base apps. And I think you'd be surprised how heavy it is in the enterprise. You don't need to pay thousands to get battle-tested tools.

apple4ever · on March 16, 2021

You don't need to use Tower to use Ansible.

You can use Ansible to manage bare metal - I've done it at two different companies.

Trust me, it works great.

notyourday · on March 16, 2021

> Having to hire someone to customize ansible is the exact opposite of what op asked for and just reaffirms my point: ansible does nothing without significant customization.

When you decide that your cattle is going to be pets and pets get individual care and feeding, you have to hire a lot of people to do the individual care and feeding.

I've done done management of ~2k Linux systems across 17 generations with Ansible. It is not a big deal. You enforce conventions so you no longer have 2k different servers but rather a feet of Betsys, a fleet of Franks, a dozen of Marshas and a couple of Jacks. And you do not touch the boot configuration because even in 2020 you do not need to touch the boot/network configuration.

protomyth · on March 16, 2021

It's amazing how many places people want computers that cannot use UPSes or even small batteries to provide for a controlled shutdown.

drewzero1 · on March 16, 2021

I'm a little surprised we don't see small battery backup systems built into PCs or available as an add-on. One thing I really like about laptops is that they essentially come with a built-in UPS, and I would love (some of) the desktops I work with to have that functionality built-in as well.

Somebody please correct me if this thing exists and is available, I'd love to be wrong!

flemhans · on March 16, 2021

Managed 500 physical machines over the course of 20 years, no UPS, never had a problem where power loss caused data corruption or loss.

protomyth · on March 16, 2021

Lucky you. I have not been that lucky. Heck, I had one go up in flames even though it had a surge protector because of an up-down-up-down event. F'n video card.

tinus_hn · on March 16, 2021

Better hope you don’t use any security feature that requires a stored randomly generated number to be unique.

cik · on March 16, 2021

I did this for a few years with cobbler (https://cobbler.github.io/). cobbler, pxe, bootp, tftp, ansible and friends pretty much solve this problem. In fact, if you know the mac addresses (or ranges) you can fairly easily designate groups of machines, roles, and the like.

Years ago I did this to "rapidly" provision a couple of thousand machines we bought for the stock exchange. You can do most of your testing locally in vagrants, even simulating the networks you need to provision.

You can go a step further and trigger api updates at the end of your ansible runs so that cobbler updates collins (https://tumblr.github.io/collins/) so that you can track inventorying, cm changes and the like. At one point in time we reimaged entire subnets when they lost power - and ran tests against the hardware to ensure the machines functioned post-provisioning.

As far as I know the boursa still uses this system.

elcritch · on March 16, 2021

> Bare metal management really feels like an unsolved problem. Whilst everybody working with cloud environments is whisked away by the latest shiny tools like Docker and Ansible, those of us working with bare metal are still trying to find a way to keep machines up and running with an OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config.

It is and isn't solved. It usually takes a lot of work or custom scripts. One of the best is the Nerves Project, and is what I use for IoT deploymens [0] or even simple cloud deployments.

Nerves is setup to run Elixir/Erlang, but it's really just a wrapper around buildroot and Elixir can start programs in any language desired with some work. One of the core authors wrote a tool called `fwup` for doing immutable updates on Linux [1]. The ability to do an A/B update and have the device do an automatic rollback if an update fails is crucial.

A year or so ago they changed the default boot process to still allow networking and remote connections to work even if the main application crashes. Surprisingly it's all done using Erlang tooling, AFAICT. There are still rough edges, like limited ipv6 support. You can still get devices failing from dead SD cards -- even if your system boots from a read-only partition as power outages during a write on any partition can effectively destroy the SD card, so skip the SD cards.

0: https://www.nerves-project.org/ 1: https://github.com/fwup-home/fwup

eointierney · on March 15, 2021

More and more guix or nixos seem like good practical choices for these kinds of use cases. I prefer guix as it's a little less finicky (though https://gitlab.com/nonguix/nonguix is probably required for all non-purists such as this lowly worm).

Ubuntu OS is now stinky doodoo, which is a shame as it used to be the cat's pajamas for ease of use. Snapd is a debacle.

spindle · on March 16, 2021

I totally agree except that I prefer NixOS because it supports ZFS so well.

cedws · on March 15, 2021

Snap definitely doesn't belong on servers - our flavor of Ubuntu doesn't have it. If you bootstrap from Ubuntu Base you can cut out a lot of that crap.

clcaev · on March 16, 2021

Have you considered FreeBSD? Many NAS devices use it, and those are sometimes updated remotely, etc. There's less churn. Keep most of your work in jails, which have been production quality since 10 years ago.

nullify88 · on March 15, 2021

I'm running a few bare metal Fedora CoreOS nodes and its been a dream with IPMI, ignition files, containers, and ansible to orchestrate it all. Havent used Flatcar Linux but I have seen the experience is similar.

dgellow · on March 16, 2021

A bit confused here. Isn't CoreOS deprecated since ~May 2020?

nullify88 · on March 16, 2021

I understand your confusion.

CoreOS has indeed since been deprecated, however when Redhat acquired CoreOS in Jan 2018 they merged CoreOS and Project Atomic to create Redhat CoreOS. They also released Fedora CoreOS which shares some technologies with vanilla Fedora and Silverblue.

Shortly after the acquisition, Flatcar Container Linux was released which is an updated derivative of the original CoreOS.

dgellow · on March 16, 2021

Wow, thank you for this, I had no idea. What a mess made by this acquisition, reusing the name of a deprecated product isn't really a great idea. Though I'm glad to know there are ways to go back to a CoreOS-like experience, that has been my favorite server setup for a while.

Thanks again!

joshuak · on March 16, 2021

Agreed, the original CoreOS is by far my favorite server OS so far. I was very sad to see it get sold to RedHat, though few other people shared my concern at the time[0]. I wish I could say my concern turned out to be unjustified, but alas here we are.

I do recommend Flatcar linux though, and strongly encourage everyone to check it out. Not as jazzed about the name, but what’s in a name?

[0] https://news.ycombinator.com/item?id=16270382

gmfawcett · on March 15, 2021

Have you played with NixOS? The initial learning curve is steep, but the payoffs are pretty nice.

cedws · on March 15, 2021

I've had my eye on it but haven't played with it yet. From a bird's eye view and a discussion with a friend, it doesn't sound ready for production, but I'm not ruling it out.

I think it would be a battle to convince my colleagues and managers to try NixOS precisely because of the learning curve and lack of experts in the hiring pool out there.

mikepurvis · on March 16, 2021

I agree that it's completely unsolved. I work in mobile robots and it's just maddening the number of solutions which either won't work if you're headless or won't work if you'll need kernel upgrades, or won't work if you'll need seamless rollbacks, or won't work if you'll want OTA updates or won't work if you'll need to be able to reconfigure the network in-band (not by exiting your app for some outside management GUI).

We ended up with a mostly hand rolled system based on kexec, grub chainloading, and deploying rootfs images as big tarballs. It works, but I really wish there was something we could have just taken off the shelf.

coddle-hark · on March 15, 2021

Basically, you want read-only systems with separate partitions for persistent data. Have two images to boot from and overwrite just one of them when you upgrade. This is how Balena does it and it’s a pretty common setup for IoT stuff.

NewJazz · on March 15, 2021

http://www.chromium.org/chromium-os/chromiumos-design-docs/d...

hedora · on March 16, 2021

It’s also not difficult to implement from scratch. Pretty much any linux distro with a bootable USB installer already supports it out of the box.

gjsman-1000 · on March 16, 2021

Have you tried Fedora Silverblue? It uses OSTree technology to offer many of the features of Ubuntu Core including automatic updating, upgrades that work regardless of power loss, Flatpak confinement for all apps (without any arbitrary restrictions like Snaps), and it's free.

"Silverblue is a variant of Fedora Workstation. It looks, feels and behaves like a regular desktop operating system, and the experience is similar to what you find with using a standard Fedora Workstation.

However, unlike other operating systems, Silverblue is immutable. This means that every installation is identical to every other installation of the same version. The operating system that is on disk is exactly the same from one machine to the next, and it never changes as it is used.

Silverblue’s immutable design is intended to make it more stable, less prone to bugs, and easier to test and develop. Finally, Silverblue’s immutable design also makes it an excellent platform for containerized applications as well as container-based software development. In each case, applications (apps) and containers are kept separate from the host system, improving stability and reliability.

Silverblue’s core technologies have some other helpful features. OS updates are fast and there’s no waiting around for them to install: just reboot as normal to start using the next version. With Silverblue, it is also possible to roll back to the previous version of the operating system, if something goes wrong."

https://silverblue.fedoraproject.org

dineshdb · on March 16, 2021

Fedora CoreOS might be better in this regard since that is designed for server use case. Silverblue took that technology and applied to the desktop.

Silverblue user here. Haven't tried CoreOS myself.

curt15 · on March 16, 2021

What about Fedora IoT?

dineshdb · on March 16, 2021

That seems to be geared towards edge servers and IoT use cases. Otherwise, all of them use the same technologies.

Also one thing to note is that CoreOS as well as IoT is bootstrappable with a configuration file via ignition(a tool for automation).

necovek · on March 16, 2021

You might also look at Canonical MAAS (maas.io) which will get you some of the way there too.

Though, that's more of the traditional server management approach, if properly set up, you can have it worry about the network configuration and pushing it to clients.

To me, it beats the entire IoT pull setup that Canonical is pushing with Ubuntu Core and snaps.

chousuke · on March 15, 2021

Out of curiosity, what sorts of issues are you having? I don't have much experience specifically with Ubuntu on bare metal, but I find kickstart easier to understand and use compared to preseed, for consistent bare-metal installations.

I think there's an OpenStack project for provisioning bare metal servers via an API from images; I wonder how that's doing nowadays.

Puppet is pretty good for configuration management when your systems do actually require the occasional change instead of just being continuously redeployed. Maybe there are people who rebuild their database servers every hour via CI, but I'd rather not.

"Immutable infra" is definitely not the default state of things in DevOps land either. Often people talking about how their infra is immutable just conveniently ignore the parts that aren't. The data has to live somewhere.

mos_6502 · on March 15, 2021

Not that I’m a Canonical stan, or anything. But have you looked at MAAS [1]? It works decently well in my small-scale lab testing.

1. https://maas.io

cedws · on March 15, 2021

Correct me if I'm wrong, but isn't this just a tool to install a chosen OS (for example Ubuntu)?

andrewjf · on March 15, 2021

I mean yes... but in bare metal there's a lot of things that need to be handled:

- Network topology management, IP address management (IPAM) and host associations (DNS, DHCP services or Static IP assignment)

- IPMI to power on/off machine and change boot orders, serial consoles

- Disk configuration (MD raid sets, LVM configuration, partitioning)

- Firmware of all the components

- Burn-in testing

- Passing configuration to the host OS for application/OS configuration

- Detecting hardware failures and tracking response.

- Install the OS

MaaS does most of these things, but it's all the life of bare metal. So "just a tool to install a chosen OS" undersells the complexity in this space.

busterarm · on March 16, 2021

https://github.com/Roblox/terraform-provider-maas

There's even a terraform provider in development.

mesebrec · on March 15, 2021

It's what I think of when I hear "bare metal management".

It manages a fleet of physical servers and provides an API to provision servers, configure networks, and install operating systems. Basically an easier alternative to OpenStack for bare metal servers.

It supports ipmi and a bunch of other BMC's, uses pxe to deploy the operating systems, optionally runs its own DNS and DHCP servers and integrated with stuff like SMART to verify server health before deploying operating systems.

I use it in my research group to manage a fleet of about two dozen servers.

rjzzleep · on March 16, 2021

There have been various different setups here. Have you looked at Matchbox?

> matchbox is a service that matches bare-metal machines to profiles that PXE boot and provision clusters. Machines are matched by labels like MAC or UUID during PXE and profiles specify a kernel/initrd, iPXE config, and Ignition config.

https://github.com/poseidon/matchbox

sneak · on March 16, 2021

CoreOS was a good attempt in this direction, but seems to have died. I am also a fan of the ChromeOS partition layout, which is solving the same problem.

dgellow · on March 16, 2021

I still miss CoreOS, it was a really good solution.

Edit: as said in another comment in this thread, there are now alternatives to CoreOS!

- Fedora CoreOS: https://docs.fedoraproject.org/en-US/fedora-coreos/

- Flatcar Container Linux: https://kinvolk.io/flatcar-container-linux/

AlotOfReading · on March 15, 2021

Perhaps I'm misunderstanding, but power failure resilience is a system design property. You shouldn't be trying to solve it at the orchestration/provisioning level. Add some caps, have durable memory, change FS, avoid writes to OS partitions, do atomic updates, etc.

However, you might find QBee relevant for the other stuff.

swiley · on March 16, 2021

The original author of busybox had a really good idea for this. I don't remember exactly but I think he partitioned the bootdisk in half amd had the bootloader fail a partition (and not boot it) if it didn't boot all the way. Each partition would apply OTA updates depending on which was active.

srswtf123 · on March 15, 2021

Have you checked out Digital Rebar? I’d be interested in hearing anyone’s experiences with it.

https://rackn.com/rebar/

jmedefind · on March 15, 2021

We use it for our bare metal management and like it for the most part.

It does require a lot of planning though.

But the company has been great to work with and super helpful in slack.

renewiltord · on March 15, 2021

Bare metal for a reason? We found benefits from virtualizing on the same bare metal infrastructure and then running on VMs. Used VMWare for everything.

However, the Docker-based strategy is far ahead.

flemhans · on March 16, 2021

You'd still need to bootstrap the metal with a VM/container environment, anyway.

zmix · on March 16, 2021

Is there any specific reason not to use something like SmartOS?

imwillofficial · on March 15, 2021

Check out Cockpit by the red hat guys. I think this might be useful for you.

cedws · on March 15, 2021

Appreciate the suggestion but this looks like a provisioning tool. Puppet is more or less good enough, it's the OS and its mutability which is our problem.

imwillofficial · on March 15, 2021

Maybe I read it wrong, I thought you were looking for fleet a management for bare metal. Anyone have any good suggestions of cockpit doesn’t fit the bill? https://www.redhat.com/sysadmin/intro-cockpit

ohyeshedid · on March 15, 2021

It's been some time since I've happened across Cockpit on Ubuntu, but previous experience showed it was several major updates behind and lacked a lot of the functionality you see in the screenshots.