I work in a different industry and am responsible for maintaining a fleet of bare metal OSes (we currently use Ubuntu).
Bare metal management really feels like an unsolved problem. Whilst everybody working with cloud environments is whisked away by the latest shiny tools like Docker and Ansible, those of us working with bare metal are still trying to find a way to keep machines up and running with an OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config.
The only existing candidate I've seen is Balena, but it only supports specific hardware and the cost is probably so high that we wouldn't be making a profit if we went with it.
At my current employer we are building a custom flavor of Ubuntu and provision it with Puppet, but we still get failures, and it's far from the immutable haven that DevOps guys would be used to.
> those of us working with bare metal are still trying to find a way to keep machines up and running with an OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config.
I'm guessing there's more to this story than you've summarised because those points are pretty easily solved with:
- UPS (if the power outs are that much of a problem then you might need to invest in a generator as well).
- iLo / IPMI (remote management). Though even just running a serial cable out the back of the server is good enough for a remote console in the event of a network failure.
As for managing the config of them, the usual tools like Ansible and Puppet work just as well (in some cases actually better since they were initially designed for on-prem hardware). Likewise for Docker. So don't think you can't run those tools on bare metal Linux. But if you don't want the containerisation-like aspects of Docker but still wanted the deployment tools then you can go a long way with git and shell scripts.
While DevOps really came into popularity with cloud hosting, there's nothing fundamentally new about a lot of the tooling that wasn't possible in the old days of bare metal UNIX and Linux. Us older sysadmins were still doing a lot of the same stuff back then too, we just didn't given it trendy names back then.
10+ years ago I was installing Linux remotely on bare metal hardware via a console server. Then once I had one machine installed I'd take one HDD out of the mirrored RAID array and plug it in the 2nd server so the RAID controller would clone. I could repeat the disk swapping as many times as I wanted as long as I remembered to change the IPs (again, via remote serial console). And this was the lazy way of deploying a fleet of servers quickly. More professional places would push the install onto the server via netboot cloning solutions.
Yes, there's more to it sadly. The hardware isn't in our control and UPS are out of the question. The machines aren't on premises or in a datacentre.
Puppet works okayish but when you have a large fleet of what are basically IoT devices, you start getting an unpleasant failure rate. Push one bad network config and you've 'bricked' thousands of machines.
I am being a bit vague intentionally, hope you can understand. But to help imagine, the scale of our problem is pretty big, you've probably even had some interaction with one of our machines.
This isn't a hard problem space, but treating an IoT device as a devops-managed fleet of hosts is asking for a bad time. Most modern devops situations assume you can, worst case scenario, replace the machine outright with relatively low friction. This isn't true with IoT. I wouldn't recommend puppet at all.
For a household-name IoT device, we did about 10000+ hours worth of testing on an array of devices for every update candidate. Think a walk-in closet chock full of devices covering every surface. This included thousands of hours of pure power-interruption scenarios, all automated.
We had a "alpha" and "beta" branch for internal users for ~a month before updates hit customers. If we bricked a device we could replace it.
For all update channels, we set a percentage of devices to deterministically receive updates. We start with 1% (probably 0.0001% now...) and double the rollout every day or so if things look good.
We had the ability to "roll back" to the last good firmware, which is stored on a partition on the device. This is almost never used.
Devices update in the background, unobtrusively. Once the update is complete, the device waits for a quiet window in which to reboot and try out the new firmware. Boot-time tests and a custom watchdog monitor the device to make sure everything works, including networking, filesystem, all services start up normally. If there is an anomaly, the device reboots back to the previous firmware, and this is reported home.
After some period of stability (minutes to hours usually) the device marks the update as good and will keep using it. If the device crash loops three times in a row within some window, it reverts to the previous version.
Yes, losing power sucks, and yes devices will get bricked and the filesystem will be corrupt, but you can do lots to ensure this is minimized. Like redundant partitions, leaving ample room in flash storage for bad blocks to prolong the device life, tuning log storage to avoid wasting write cycles and losing data on power loss, lots and lots of metrics and lots and lots and lots of automated integration tests.
If you're really struggling, I can help. Lots of fun problem solving in this space.
My impression of GGP’s post is not that you can’t solve these problems, you clearly can, but that not [m]any distos are interested in tackling this problem (or if they are, have not effectively or cheaply accomplished such a goal). So, you have to build a bespoke solution (like you’ve done). To be fair, to qualify as a solved problem in the industry one would expect either a standard, possibly community-supported, software implementation of the important parts, or at least documentation such that others could read up, learn about your wins, and apply them consistently to their projects, or both.
Truthfully, the only "distro" that comes close to solving these problems is Yocto, which is really a distro-builder for embedded devices. Yocto & Mender work pretty nice. You could probably get Mender working with Ubuntu Core.
That's actually 90% of the solution these days: Use Mender. It wasn't around when I did this, but we likely would've used it, as we built essentially the same thing.
The other 90% is quality control. Distros can't really solve that for you.
> To be fair, to qualify as a solved problem in the industry one would expect either a standard, possibly community-supported, software implementation of the important parts, or at least documentation such that others could read up, learn about your wins, and apply them consistently to their projects, or both.
These are solved problems. Mender exists and documents almost everything I mentioned. Google has published incredibly thorough technical documents describing how Chromecasts and Chromebooks update, and at least the latter solution is Open Source.
If you look, you'd see these problems are not novel, which is why I declared it "not a hard problem space." The prior art is tremendous.
I got _super_ lucky at the startup where I was responsible for OS/updates/security for our IoT devices. Between the design stage and the production run, the price of 4GB SD cards dropped below the price of the 2GB cards on our BOM, so I had an entire spare partition to play with where I could keep a "spare" copy of the entire device image. And we had a "watchdog" microprocessor that could switch the main processor's boot config if it failed to boot. (We were basically running a RaspberryPi and an Arduino connected together. The prototype were exactly that, the final hardware was an iMX233 and an Atmel328 on our own custom board.)
We used Arch Linux with our own pacman repo, so the devices all pulled their own updates automatically. (Also it was super low risk, these were xmas tree lights, so our problem was "we don't want to ruin anyone's xmas!" instead of "If we fuck this up people might go broke and/or die...")
Christmas tree lights with 4 GB of storage. I'm sure this was a superb product, and a result of sensible decisions, but there is nevertheless something hilarious about that.
In the case of bricking, a strategy I've seen is have two partitions in your flash (I am imagining your device has a flash?) then a watchdog can verify the health your deployment, in case your deployment is unhealth it can boot from the known good partition.
Networking specific, if using NetworkManager you can set up fallback network profiles which can kick in, in the event the primary connection fails. Could be a secondary ip, last good profile or dhcp.
With regards to bricking thousands of machines, with great power comes great responsibility. Is there no way to test the config before mass deployment? In batches or a canary deployment?
That is an option with servers, but not with "desktop" computers (in my case, custom industrial PCs with touch interfaces).
There is no really great deployment story. In the very best case, you have to connect a keyboard, a PxE LAN cable, change some things in the bios, and select the OS you want to clone. But creating the image in the first place takes the better part of a day. Another option is to maybe clone the SSD beforehand. It gets more complicated if you have to flash a certain bios, change settings, and so on. Ideally, we'd just want to connect one cable bundle and "pressure tank" the new system with OS and configuration in a couple of minutes.
This is for the PCs that we sell, the story for the office laptops we use is even worse. There are good tools in the Windows world for deployment, but they all seem geared for installations of 1000s of computers. What if the office you manage (on the side of your normal work) just has 20? There is little point setting up SCCM, WSUS, or newer stuff like Autopilot (which seems pretty cool, but I couldn't figure out how to install MS office with the user's license, or how to install an ERP from Microsoft themselves... that should be 1-click or 1 line of code if you offer such a solution).
What I'd really like is a mixture of Ansible or Puppet with a stupid simple monitoring GUI. Then I'd be able to say
boot from USB, hit a few keys, and come back to a deployed PC later. AND be able to see the PC in a simple desktop app, where I can ping it and see who's logged in, what updates are missing etc..
You're talking about Windows though. This topic is about Linux. I don't pretend to specialise in automation on Windows like I do with Linux but what experiences I do have managing Windows instances have all been painful (regardless of whether they were a desktop or server) because the problems require a completely different mindset to solve and half the time those solutions are only semi-effective. So I do feel your pain there.
For what it's worth, I've had some success with Powershell for package management and domain management, and tools like Clonezilla / Norton Ghost for managing images on small to medium sized fleets of machines (again, both desktop and server). There are also a plethora tools that can interrogate what machines are on a given network, the software installed and their patch levels -- but most of them are not going to be free. However there definitely are alternative options to SCCM and WSUS if they're too "enterprisey" for your needs (I've used a few different ones but I'm afraid I can't recall the names of the more effective solutions in terms of ease of us and features vs license fee).
I agree, Windows is the main difficulty here, but we also ship (Desktop) Ubuntu. It's much more amenable to command line tools, but probably still nowhere close to what people working with disposable VMs on the cloud are used to.
>As for managing the config of them, the usual tools like Ansible and Puppet work just as well (in some cases actually better since they were initially designed for on-prem hardware).
I hear this a lot, but do they really? As best I can tell they require writing everything from scratch. I went down this path just an exercise to see what end-users were having to deal with. Everything in ansible appears to be re-inventing the wheel.
Sure it's powerful, and it provides a robust framework to do everything, but out of the box it does NOTHING on its own. Want to update ilo(m)? Someone probably has a playbook somewhere you can build on, but you have to go find the playbook, and modify it to meet your needs. Same with updating an OS, or software package or X.
There's no "point at this server, scan the server, figure out the hardware model, os, application packages, and build a default list of things you might want to manage".
Cisco UCS handles a chunk of this for the physical hardware, but it has plenty of issues itself. Same with HPe and Synergy, and Dell/EMC with Openmanage (this may have been replaced, I honestly haven't had to deal with Dell recently).
Honestly it gets a bit frustrating when the response is always "there's this tool that can do it if you just spend 6 months customizing it to your environment!". I think what OP is asking for is some intelligence to automatically discover.
If Ansible can do that, I'm all ears, but I haven't figured out how.
I've used Ansible to manage thousands of baremetal servers, you do have to build some stuff yourself but it mostly just works. The Ansible framework for gathering facts is pretty neat, it has a framework to do X or Y depending on facts from hosts, you can run it in pull model or push model, through AWX or jenkins. A lot of freedom.
If you want something that just works automatically and is free, well, I don't think you will find. Learning Ansible (or another configuration management tool) well will pay dividends through a lot of your career tho.
> I hear this a lot, but do they really? As best I can tell they require writing everything from scratch. I went down this path just an exercise to see what end-users were having to deal with. Everything in ansible appears to be re-inventing the wheel.
Other people's Ansible playbooks are often nice, because people mostly only publish things that stick to "best practice". I'll often use these if the exist and seem well thought out (and can do what I need done).
For things that there isn't an obvious "good choice of existing playbook", I'll sometimes write a "proper" one (we've got a quite good one here that got used a lot when we were deploying a lot of fundamentalist similar Grails/Tomcat/Apache apps with a bunch of common dependancies), or I'll just resort to using Ansible to run remote commands over ssh. If I know how to do it on the server from the command line, it's trivial to do exactly the same thing using Ansible from my laptop (or our bastion or config host). The trick here is making sure you don't write Ansible/ssh that screws up if you run it twice - which mostly isn't too hard to avoid so long as you remember to do so.
> I think what OP is asking for is some intelligence to automatically discover.
> If Ansible can do that, I'm all ears, but I haven't figured out how.
I suspect this is a difference in approaches. It's easier in the cloud/vm world, bit I still treat bare metal servers more like cattle than pets. If I get a problem like "OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config." I just get the box reimaged to a known state, then run the Ansible "update and deploy/configure from scratch" on it.
I don't think trying to build something to "intellegently auto discover" and repair a corrupted OS from hard powerdowns or a totally botched configuration change is a good use of my time... I'll just stand up another server from scratch using tried/tested automation.
Something else to consider is all the edge cases you'll hit with tools like Ansible and Puppet. I'm pretty sure neither of them handle apt getting into a locked state because an operation was interrupted. So you end up having to write some hacky script to detect that and fix itself.
Ansible is not even an option for IoT because it doesn't scale well and requires a stable connection. Puppet's agent model works better for large fleets especially if you're crossing multiple gateways.
No luck with Ansible pull? Sounds like it maybe ideal for your usecase. Git pulls are often incremental and compressed, and going over https may prove to be more stable than ssh over unstable connections.
Since Ansible is executed locally, theres no need to have a persistent ssh connection to run your playbooks.
Neither Ansible nor Puppet takes 6 months to configure. Yes you have to write the code (and/or pull pre-existing modules) to make the config changes you want, but that's the point of them: they are there to automate the changes of config on servers. If you don't want to change the config of a base install then there is nothing to write. And given config is going to be specific to your environment then of course you're going to need to customise your config management code. No two companies are going to want their servers set up exactly the same.
>Neither Ansible nor Puppet takes 6 months to configure. Yes you have to write the code (and/or pull pre-existing modules) to make the config changes you want, but that's the point of them: they are there to automate the changes of config on servers.
So if I've got 40 different models of servers, with 10 different models of switches, 8 different operating systems, you're telling me right now there's a playbook for every possible hardware configuration I could have that will update all of that to the latest firmware without significant work on my part? Where is this magical repo?
I think you misunderstand how Ansible is supposed to work. You can put in the work (much less than ~ 6 months, I'm available for freelancing!) and have it working. By then you will be able to run your command :-)
You will also be able to share as much of the deployment as you want/can between the different models/OSes. Templating configuration variables based on their characteristics.
What you're asking to is akin to "Why cannot I download an app that does what $PRODUCT does but for my business for free? Why do I need to write my own software?"
Free? I’m not sure if that’s a joke. Ansible licensing is about as far from free as you can get.
Having to hire someone to customize ansible is the exact opposite of what op asked for and just reaffirms my point: ansible does nothing without significant customization.
What other ansible licensing would I be talking about? AWX has no support, it is a non-starter in the enterprise.
So again: Ansible is not an answer for someone looking for a ready-made solution to managing bare metal, which was my entire premise. Telling people to just use Ansible or Puppet when they're asking for a solution to the problem is just barely more helpful then telling them you can do it with a bash script and SSH for a consulting fee. It's pedantic and misses the point entirely.
I've used Ansible to manage bare metal. It worked great. I've used Ansible in high availability enterprise environments. Again, it worked great.
I'm not as big of a fan of Puppet but actually puppet also works great for on-prem systems given that's what it was originally designed for. If anything, Puppet makes less sense in the cloud than it does on bare metal.
I'm not disputing you have a complex problem but that just means you need to spend a little more time tuning your solution (not less time like you seem to assume).
And if you want my advice about how to approach a daunting build: break your problem down. First start with delivering easy systems which will have the least impact if it goes wrong. This is to get your confidence up in working with the tool. Then start picking the harder targets that will give you the most reward, so even if your project ends up unfinished you've still fixed the biggest problems in your org. Then work your way backwards until everything is fixed. After a while, some of the easier deliveries will become background jobs you can fit in between support queries or half day sprint tickets (depending on whether you Kanban or Sprint). Before you know it, you'll have everything automated and realise it was far less painful than it appeared before you'd started the project.
Disclaimer: I'm DevOps Manager who has transitioned several orgs to through this process :)
AWX (and Tower) have a different use case than the base apps. And I think you'd be surprised how heavy it is in the enterprise. You don't need to pay thousands to get battle-tested tools.
> Having to hire someone to customize ansible is the exact opposite of what op asked for and just reaffirms my point: ansible does nothing without significant customization.
When you decide that your cattle is going to be pets and pets get individual care and feeding, you have to hire a lot of people to do the individual care and feeding.
I've done done management of ~2k Linux systems across 17 generations with Ansible. It is not a big deal. You enforce conventions so you no longer have 2k different servers but rather a feet of Betsys, a fleet of Franks, a dozen of Marshas and a couple of Jacks. And you do not touch the boot configuration because even in 2020 you do not need to touch the boot/network configuration.
I'm a little surprised we don't see small battery backup systems built into PCs or available as an add-on. One thing I really like about laptops is that they essentially come with a built-in UPS, and I would love (some of) the desktops I work with to have that functionality built-in as well.
Somebody please correct me if this thing exists and is available, I'd love to be wrong!
Lucky you. I have not been that lucky. Heck, I had one go up in flames even though it had a surge protector because of an up-down-up-down event. F'n video card.
I did this for a few years with cobbler (https://cobbler.github.io/). cobbler, pxe, bootp, tftp, ansible and friends pretty much solve this problem. In fact, if you know the mac addresses (or ranges) you can fairly easily designate groups of machines, roles, and the like.
Years ago I did this to "rapidly" provision a couple of thousand machines we bought for the stock exchange. You can do most of your testing locally in vagrants, even simulating the networks you need to provision.
You can go a step further and trigger api updates at the end of your ansible runs so that cobbler updates collins
(https://tumblr.github.io/collins/) so that you can track inventorying, cm changes and the like. At one point in time we reimaged entire subnets when they lost power - and ran tests against the hardware to ensure the machines functioned post-provisioning.
As far as I know the boursa still uses this system.
> Bare metal management really feels like an unsolved problem. Whilst everybody working with cloud environments is whisked away by the latest shiny tools like Docker and Ansible, those of us working with bare metal are still trying to find a way to keep machines up and running with an OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config.
It is and isn't solved. It usually takes a lot of work or custom scripts. One of the best is the Nerves Project, and is what I use for IoT deploymens [0] or even simple cloud deployments.
Nerves is setup to run Elixir/Erlang, but it's really just a wrapper around buildroot and Elixir can start programs in any language desired with some work. One of the core authors wrote a tool called `fwup` for doing immutable updates on Linux [1]. The ability to do an A/B update and have the device do an automatic rollback if an update fails is crucial.
A year or so ago they changed the default boot process to still allow networking and remote connections to work even if the main application crashes. Surprisingly it's all done using Erlang tooling, AFAICT. There are still rough edges, like limited ipv6 support. You can still get devices failing from dead SD cards -- even if your system boots from a read-only partition as power outages during a write on any partition can effectively destroy the SD card, so skip the SD cards.
More and more guix or nixos seem like good practical choices for these kinds of use cases. I prefer guix as it's a little less finicky (though https://gitlab.com/nonguix/nonguix is probably required for all non-purists such as this lowly worm).
Ubuntu OS is now stinky doodoo, which is a shame as it used to be the cat's pajamas for ease of use. Snapd is a debacle.
Snap definitely doesn't belong on servers - our flavor of Ubuntu doesn't have it. If you bootstrap from Ubuntu Base you can cut out a lot of that crap.
Have you considered FreeBSD? Many NAS devices use it, and those are sometimes updated remotely, etc. There's less churn. Keep most of your work in jails, which have been production quality since 10 years ago.
I'm running a few bare metal Fedora CoreOS nodes and its been a dream with IPMI, ignition files, containers, and ansible to orchestrate it all.
Havent used Flatcar Linux but I have seen the experience is similar.
CoreOS has indeed since been deprecated, however when Redhat acquired CoreOS in Jan 2018 they merged CoreOS and Project Atomic to create Redhat CoreOS. They also released Fedora CoreOS which shares some technologies with vanilla Fedora and Silverblue.
Shortly after the acquisition, Flatcar Container Linux was released which is an updated derivative of the original CoreOS.
Wow, thank you for this, I had no idea. What a mess made by this acquisition, reusing the name of a deprecated product isn't really a great idea. Though I'm glad to know there are ways to go back to a CoreOS-like experience, that has been my favorite server setup for a while.
Agreed, the original CoreOS is by far my favorite server OS so far. I was very sad to see it get sold to RedHat, though few other people shared my concern at the time[0]. I wish I could say my concern turned out to be unjustified, but alas here we are.
I do recommend Flatcar linux though, and strongly encourage everyone to check it out. Not as jazzed about the name, but what’s in a name?
I've had my eye on it but haven't played with it yet. From a bird's eye view and a discussion with a friend, it doesn't sound ready for production, but I'm not ruling it out.
I think it would be a battle to convince my colleagues and managers to try NixOS precisely because of the learning curve and lack of experts in the hiring pool out there.
I agree that it's completely unsolved. I work in mobile robots and it's just maddening the number of solutions which either won't work if you're headless or won't work if you'll need kernel upgrades, or won't work if you'll need seamless rollbacks, or won't work if you'll want OTA updates or won't work if you'll need to be able to reconfigure the network in-band (not by exiting your app for some outside management GUI).
We ended up with a mostly hand rolled system based on kexec, grub chainloading, and deploying rootfs images as big tarballs. It works, but I really wish there was something we could have just taken off the shelf.
Basically, you want read-only systems with separate partitions for persistent data. Have two images to boot from and overwrite just one of them when you upgrade. This is how Balena does it and it’s a pretty common setup for IoT stuff.
Have you tried Fedora Silverblue? It uses OSTree technology to offer many of the features of Ubuntu Core including automatic updating, upgrades that work regardless of power loss, Flatpak confinement for all apps (without any arbitrary restrictions like Snaps), and it's free.
"Silverblue is a variant of Fedora Workstation. It looks, feels and behaves like a regular desktop operating system, and the experience is similar to what you find with using a standard Fedora Workstation.
However, unlike other operating systems, Silverblue is immutable. This means that every installation is identical to every other installation of the same version. The operating system that is on disk is exactly the same from one machine to the next, and it never changes as it is used.
Silverblue’s immutable design is intended to make it more stable, less prone to bugs, and easier to test and develop. Finally, Silverblue’s immutable design also makes it an excellent platform for containerized applications as well as container-based software development. In each case, applications (apps) and containers are kept separate from the host system, improving stability and reliability.
Silverblue’s core technologies have some other helpful features. OS updates are fast and there’s no waiting around for them to install: just reboot as normal to start using the next version. With Silverblue, it is also possible to roll back to the previous version of the operating system, if something goes wrong."
You might also look at Canonical MAAS (maas.io) which will get you some of the way there too.
Though, that's more of the traditional server management approach, if properly set up, you can have it worry about the network configuration and pushing it to clients.
To me, it beats the entire IoT pull setup that Canonical is pushing with Ubuntu Core and snaps.
Out of curiosity, what sorts of issues are you having? I don't have much experience specifically with Ubuntu on bare metal, but I find kickstart easier to understand and use compared to preseed, for consistent bare-metal installations.
I think there's an OpenStack project for provisioning bare metal servers via an API from images; I wonder how that's doing nowadays.
Puppet is pretty good for configuration management when your systems do actually require the occasional change instead of just being continuously redeployed. Maybe there are people who rebuild their database servers every hour via CI, but I'd rather not.
"Immutable infra" is definitely not the default state of things in DevOps land either. Often people talking about how their infra is immutable just conveniently ignore the parts that aren't. The data has to live somewhere.
It's what I think of when I hear "bare metal management".
It manages a fleet of physical servers and provides an API to provision servers, configure networks, and install operating systems. Basically an easier alternative to OpenStack for bare metal servers.
It supports ipmi and a bunch of other BMC's, uses pxe to deploy the operating systems, optionally runs its own DNS and DHCP servers and integrated with stuff like SMART to verify server health before deploying operating systems.
I use it in my research group to manage a fleet of about two dozen servers.
There have been various different setups here. Have you looked at Matchbox?
> matchbox is a service that matches bare-metal machines to profiles that PXE boot and provision clusters. Machines are matched by labels like MAC or UUID during PXE and profiles specify a kernel/initrd, iPXE config, and Ignition config.
CoreOS was a good attempt in this direction, but seems to have died. I am also a fan of the ChromeOS partition layout, which is solving the same problem.
Perhaps I'm misunderstanding, but power failure resilience is a system design property. You shouldn't be trying to solve it at the orchestration/provisioning level. Add some caps, have durable memory, change FS, avoid writes to OS partitions, do atomic updates, etc.
However, you might find QBee relevant for the other stuff.
The original author of busybox had a really good idea for this.
I don't remember exactly but I think he partitioned the bootdisk in half amd had the bootloader fail a partition (and not boot it) if it didn't boot all the way. Each partition would apply OTA updates depending on which was active.
Bare metal for a reason? We found benefits from virtualizing on the same bare metal infrastructure and then running on VMs. Used VMWare for everything.
Appreciate the suggestion but this looks like a provisioning tool. Puppet is more or less good enough, it's the OS and its mutability which is our problem.
Maybe I read it wrong, I thought you were looking for fleet a management for bare metal.
Anyone have any good suggestions of cockpit doesn’t fit the bill?
https://www.redhat.com/sysadmin/intro-cockpit
It's been some time since I've happened across Cockpit on Ubuntu, but previous experience showed it was several major updates behind and lacked a lot of the functionality you see in the screenshots.
Do they want all the "features" for free, I went to https://ubuntu.com/core and it says 10 year security update commitment, it doesn't say it's going to be free, how Canonical will make money?
I had that reaction as well. At the end of the article, though, they explain where Ubuntu Core makes sense and why their use case doesn't (it seems that the main reason is they don't have a subscription model, so they can't budget ongoing costs, or something like that).
In my mind, I'm here wondering -- if you really need the features, spending your own dev time for self-maintenance has to cost more than $30k/year... what are they thinking?
> Accordingly, the risk for NextBox users would be that at some point in the future, Canonical would revoke this privilege from us, making NextBox un-updatable from one day to the next, or at worst, unusable.
> In addition, it became apparent that we had not selected sufficiently strict according to open source criteria. Assuming Canonical would eventually cease to exist or discontinue Ubuntu Core, it would be nearly impossible with Ubuntu Core for the open-source community to ensure that NextBox would continue to be usable in a meaningful way.
It's about more than the monetary costs over the coming couple of years. Canonical pulling a RedHat here would be much worse for Ubuntu Core users than it is/was for CentOS users.
That's exactly it. I work at Canonical and was part of the internal conversation around this subject. We constantly walk that fine line where we want to encourage open source work and communities around it to flourish, while at the same time we need to pay for bandwidth and people's salaries to be working on that exact technology. The irony is that for the particular case at hand, they would probably get it for free because despite being a commercial project it's a small one at that, and we love to see such initiatives taking place. At the same time, we work with major industry players that are supposed to pay the bill, for their own benefit and for everybody else's too, otherwise we just go out of business and that's no good. It took time mainly because we need to set the exact terms without arbitrary discrimination.
We'll have a more clear form for that kind of application soon, so that we can streamline such requests, community or otherwise.
No, there always was an explicit warranty disclaimer and few, if any, distros offered continuous 10-year security updates constrained to a base version. Usually, you had to keep upgrading to keep getting security updates.
Source: spending my teenage years on installing various distros on my dad's computer, simultaneously pissing him off and ensuring I'd never have a girlfriend
It once was a Debian that just worked, nowadays is some kind of a trap. As you start getting deeper you start to notice non-standard things that get in the way more than they should, besides being utterly non-standard.
Things i noticed so far:
- auto-update enabled by default. if I boot a vm it's going to be nearly unusable (can't install packages) because it's going to spend the first 30 minutes doing a full upgrade
- netplan -- not sure why that's there
- snaps. for everything. the last straw for me was realizing that gnome-calculator packaged as a snap. it took almost 20 seconds to show the f-ing calculator. every time an app is slow i suspect that's because it's packaged as a snap.
- doing weird stuff with motd. why?
At this point my next reinstall will be a good old Debian.
The funny thing is, that NixOS did actually solve the goddamn dependency hell problem in the best possible way and yet there are n different ways that try to do something, but fail. Linux should at least converge on this one actuall good solution.
AppImage is the best, and I wish more developers would use it. At least for desktop/end-user software. Maybe Guix is a better choice for a server, though I don't have much experience with it.
Appimage is basically a linux port of the windows workflow where you download an exe file from random sites and run it. No update mechanism, no discovery, no install/remove mechanisms no sandboxing.
Appimage is super cool for being able to quickly test builds of stuff but for software you actually use its not great.
I have been using Fedora Silverblue for a few months now and using flatpak to install every gui tool and its been excellent. I use a tool called flatseal which lets me tighten or loosen permissions for apps based on what I need which has been awesome. I can just flat out disable networking on apps if I don't need the networked parts.
> Appimage is basically a linux port of the windows workflow where you download an exe file from random sites and run it. No update mechanism, no discovery, no install/remove mechanisms no sandboxing.
I use it on https://mudlet.org and it's amazing. No Linux users are ever issues with them - it's an effectively solved distribution problem.
The one drawback is that you need to use an ancient compiler, but for our purposes that ancient compiler supports C++17 so that is okay for the time being.
- auto-update enabled by default. if I boot a vm it's going to be nearly unusable (can't install packages) because it's going to spend the first 30 minutes doing a full upgrade
There is something seriously wrong with the Ubuntu auto updater. It literally will run for hours at 100% cpu on a system that hasn't been updated in a few months vs an apt update && apt upgrade that can do the same work in a few minutes. It acts almost like some sort of O(n^x) behavior where x > 2.
I just disabled it and moved on with my life as bugs like these in my experience get ignored forever when you report them.
When it came out in an LTS, I was impressed by being able to declaratively describe networking. And there's even a way to test a configuration with auto rollback.
These are features that I find great in thing like juniper routers.
But when I went online to see what people thought there just was annoyance.
I really think that netplan exists because "everyone knows" networkmanager is bad. Ironically, I think it's for the same reason that netplan is now bad; it's a leaky abstraction which doesn't always support the feature you want. I've worked with people who disable nm as a matter of muscle memory on new systems, and I wonder how long it'll be until netplan goes that route (hehe).
Well, unless you used to do bridges with nmcli (and if you did, i'm really impressed), netplan do have some advantages.
And for all the swearing i did when i add to change the packer conf, then the ansible conf, i do think netplan is in fact easier to understand, read and change than brctl/bridge-utils.
netplan brings "generate" and "apply" and so on but that's about all the usefulness it brought while doing so completely upended the configuration format and supported functionality. It seems like there could have been a less disruptive way to add that functionality, or at least netplan could have been more feature complete when they switched to it.
Snaps pissed me off too, and I vowed to switch to another distro (probably Debian) when this installation stopped working. In particular, it annoys the hell out of me when they pull a bait-and-switch by automatically converting `apt install X` commands to `snap install X` for certain popular packages.
...but this Kubuntu system (my primary workstation) has been rock steady for the past ~5 years with no signs of problems. So as annoyed as I am, I gotta hand it to them for keeping things working smoothly.
I'm really enjoying Pop! OS. It's Ubuntu (and I installed the default ubuntu-desktop over Pop's), but it has its own "store" with .debs and flatpaks, no snaps. It's still possible to accidentally install a snap, but it's also easy to revert that. It's also very nice how effortlessly the graphics drivers and full-disk encryption work.
I haven't tried Debian on desktop, but it is great for servers.
When I have a new computer and want to install Linux on it, I reach for Ubuntu out of habit, because I have this vague idea that it will have the drivers that I might need for my screen, keyboard, WiFi, etc. to work. But, because of the issues you raise, I’d much rather install Debian. Is it practical to do so?
On all my home boxes I run vanilla Debian and it’s great. No fuss. I’ve started to dabble outside my comfort zone with a Fedora box and it’s also pretty great, despite the culture shock.
Part of losing that "shine" is them abandoning Unity and other Ubuntu-specific things. Back when it first got popular one of its selling points was how easy it was to install the proprietary nvidia driver. It always had something a little extra compared to other distros. Now it's using the same stuff as everything else and meanwhile I've ditched nvidia a long time ago and now even Gentoo is as easy to install (for me at least) as Ubuntu was back then.
“ if I boot a vm it's going to be nearly unusable (can't install packages) because it's going to spend the first 30 minutes doing a full upgrade” - wouldn’t be an issue if you install from the newest image. Also if your cloud provider is so crappy that it actually takes 30 minutes to apply a few updates maybe find a new one.
I'm sorry but this is neither on point nor a good advice.
I can't be bothered making a new packer image each time a new ubuntu image is up. And people to use USB keys to make their friend and family try Ubuntu (or you know, as a backup).
Also, because even if your point is wrong, you're kinda right: you can disable autoupdate in the seed file or during install. And also, minimal image are a good idea/good practice and won't ever take 30 minute even if your image is really old.
I once spoke with a Canonical Sales guy and he asked me why i said that snaps were not good in the conversation we had prior.
Somehow nobody at canonical thinks that snap performance is an issue at all, he was genuinely surprised. Snap is just unusable, full stop. If everything takes like 10x to open i can not work with my system anymore. I'd rather work with windows or a chalkboard instead.
I don't know who you talked to, but I can tell the story is richer and more interesting than that. Saying that snaps have performance issues is similar to saying that containers have performance issues. They do affect performance because doing something is always more expensive than doing nothing, and snaps do something in addition to just running a bare executable on your machine. At the same time, the kind of operation that snaps perform should not have a significant impact on a modern computer to the point of making it slow or annoying, because most of the operations are relatively simple and happen at a low level, and computers are fast.
At the same time, snaps are a new packaging format, and when you change the layout of applications to include things such as restrictions or making things read-only, suddenly all kinds of things can go wrong, and some of these can cause major performance impact.
Two easy and real examples from the snap world: early on there was a bug where .pyc files would be out of date, and the filesystem was read-only. This meant every single time the application was opened Python would recompile the entire application and fail to write its cache files in every case. Major performance impact. That was fixed.
Another one: fontconfig cache changed its format, and as a side effect applications running could not make use of the one in the system and had to rebuild their own copy every time. Extreme performance impact. That was fixed.
And the list goes on. So the point is: snaps are not slow, because there's nothing fundamental happening there to make them slow. But snap applications can be slow, of course, potentially by orders of magnitude. These are bugs, and we fix them when we see them.
In my experience everything i opened had significantly longer startup times - especially for VS Code this was a problem for me. I can not explain why, i just experienced the symptoms as did a few other people i talked to.
So if it is just an app problem great, hopefully there will be a day when not every app with UI i try has that problem.
I would have loved to see a comparison of Ubuntu Core against openSUSE MicroOS (https://en.opensuse.org/Portal:MicroOS), I wonder whether it would be up for the task they had in mind.
"In other words openSUSE MicroOS is an operating system you don't have to worry about. It's designed for but not limited to container hosts and edge devices.
...
- Read-only root filesystem to avoid accidental modifications of the OS
- Transactional Updates leveraging btrfs snapshots to apply updates without interfering with the running system
- health-checker to verify the OS is operational after updates. Automatically rolls back in case of trouble.
Not to mention, Suse is far more lenient when it comes to licensing in my experience. I remember them being extremely blase about one of my old jobs using a significantly higher number of licenses than had been purchased, with their sales team's response amounting to a particularly verbose "meh, you're good."
If I were them, I'd go with something that provides an immutable root FS (or at least immutable /usr) and atomic whole-system updates, the way Chromium OS and Flatcar Linux (former CoreOS) do. IIRC, Balena does this for the base system, but adds containers on top.
Canonical's engineers are very talented. And Canonical provides some truly innovative solutions (MaaS and JuJu come to mind. Snaps to a lesser extent). However, as anecdotes like these show, their user/customer hostile attitude makes one very hesitant to adopt their tech.
> Nevertheless, we were able to build a comparably lean system with Debian: The base systems of both Ubuntu Core and Debian each have a total size of about 1 GB.
This line made me somewhat sad. But it is possible that I am just being old and grumpy.
You can always opt for Alpine or Tiny Core or such if you're into minimalism. IIRC Debian and Ubuntu install lots of localizations by default, which takes up quite some space.
Minimalism? Minimalism is fitting the kernel and initrd on a single floppy ;) (to be fair, it was hard to get a full x11 in there.. Which was one reason the qnx demo floppy was so impressive)
It seems that Ubuntu's 10 years support (ESM, provided by Ubuntu Advantage program) is $225/y for a server. I can't find price for Core, but I expect the same or cheaper. IMO it's fair price to maintain ancient distribution. For Desktop is $25/y, cheap!
Don't get me wrong but I don't really understand why for server stuff do you could prefer Ubuntu in favor of a rock-solid operating system like Debian?
One number, Debian Buster for example is LTS until June 2024 + 2 yrs.
Just to mention one strength, I don't tell you about the well-known ones such as large community, fully openness, among others.
They raise some good points.
However I think it is absurd to expect a company to be blamed for having a feature wall for paid features.
You’re trying to use a paid feature for free, that’s uncool.
I see the point about long term support being deceptive, and also the mixed messaging being confusing.
hardware wise. use a bsd. FreeBSD. there are reasons why hardware vendors are almost 100% BSD based.
Its rock solid and you can depend on it being engineered well.
Snaps , docker, and various other solutions are symptoms of a problem not a solution. The problem is userland is in a complete mess in linux. Fix the kernel and userland together, and everything flows.
I have made OpenBSD based devices like routers that just work for years without failure. It is kind of funny to me hearing this guy complaining about managing bare metal Linux. Maybe I am missing something but I would say (1) find a configuration that does what you need it to do and don't update unless necessary (2) make important file systems read only (3) clone to flash.
For a laptop you're using as a luggable workstation, absolutely. For a laptop you're moving around with and unplugging frequently, you probably want a model that's popular with hackers (e.g. thinkpad) - getting suspend working is often fiddly, and there might be some models where it won't work at all. Most wifi chipsets are supported IME, but it might be worth checking that as well.
I have been using OpenBSD laptops for years. Everything I need works. My configuration is very lean and very fast even on an older machine. Thinkpads (older at least) are 100% hardware supported in my experience. Only thing I can think of that is not supported is hot swapping ultrabay drives. I am not sure if that is supported on Windows since I have not used Windows.
Bare metal management really feels like an unsolved problem. Whilst everybody working with cloud environments is whisked away by the latest shiny tools like Docker and Ansible, those of us working with bare metal are still trying to find a way to keep machines up and running with an OS that doesn't get corrupted from unexpected poweroffs or permanently cut itself off from the network because of a bad config.
The only existing candidate I've seen is Balena, but it only supports specific hardware and the cost is probably so high that we wouldn't be making a profit if we went with it.
At my current employer we are building a custom flavor of Ubuntu and provision it with Puppet, but we still get failures, and it's far from the immutable haven that DevOps guys would be used to.