The difference is that when you're buying a x86 system, the entire CPU bringup (incl. AMEGAS/openSIL on AMD) runs proprietary and poorly documented firmware. You're entirely at the mercy of the vendor.
Oxide has put immense effort into writing open-source platform initialization code, and built their own open-source BMC/RoT solution.
So effectively I’m at the mercy of Oxide, at least as long as their system does not become some kind of standard.
Not in theory maybe, but in practice. Because as a customer, I would probably also need to put in immense effort to understand and maintain that software myself.
Their firmware is open source. You can pay whoever you want to maintain it. You can't do that with Dell, HP, Supermicro, and the unknowable rabbit hole of ODMs and sub-suppliers and contractors who actually make the hardware and firmware for these companies.
Until you've dealt with a malfunctioning Dell or HP server and have to live with being told "we don't know why it acts that way, we'll try to get the ODM to repro" I don't think you can appreciate how cool Oxide's offering seems.
If I have server under maintenance with Dell or HP, they would replace the server or component for me in such a case.
Which would probably be a lot faster than trying to find someone who could maintain some non-standard firmware (as good as it might be).
Even if I had to replace the server on my own cost it would probably still be cheaper. And it would be easy to replace because it's commodity hardware, that was kind of my point.
I have had experiences with tens of Dell servers with the same model NIC having the same fault. The servers were absolutely under maintenance. I fought with tech support for weeks before I was finally told it was a driver/firmware issue and that I had to work around it (and lose performance for the sake of reliability).
Maybe if I had hundreds of servers Dell would have helped me out. At the scale of tens they told me to take what I got. The Customer got a lower performance solution and nobody anywhere could help them for any amount of money, short of replacing the gear.
That's just a performance issue. I've heard horror stories about reliability-- All the way down to disk firmware and RAID controllers. I consider myself lucky.
But how much effort (or money) do you think it would have taken to fix this issue if the NIC firmware was open source?
And with standard hardware, depending on the model, you might have had the option to add dedicated PCIe NICs for example. Not great, but at least something. Now try that with something proprietary (as in non-standard) like this Oxide system.
Replacing hardware? Sure, they'll help. What about debugging firmware though? I'm curious how much help you would get from Dell fixing and patching complicated firmware errors. A side benefit of the openness is that firmware issues can be discussed publicly, and the patches can be upstreamed into the main repo and made available to every customer (and even competitor). This gives you the kind of network effects that you'd never see in a locked-down ecosystem.
The CPUs are x64 but the architecture is not that of a PC, there is no BIOS, etc. You couldn't boot Windows or Linux on the bare metal. The hardware, firmware and hypervisor are custom built for control, safety and observability. On top of that, the application OS all run on VMs which _do_ have a (virtual) PC architecture.
To me as a casual outside observer, the fact that they're using hardware virtualization at the top of the stack, after bcantrill gave so many talks about running containers on bare metal, is the most disappointing part. They could have had unbroken control and observability from the bottom of the stack all the way to the top. They got so close!
It's possible that the hypervisor can reserve you a full CPU or full cores for the guest OS to work with, so you still get most of that bare metal goodness.
The CPU is commodity, nothing else is. Costume Mainboard and firmware without BIOS and their own BMCish thing and their own Root of Trust. Same for their router. Standard chip, everything else is costume.
Similar in some ways different in others. But in terms of not being a PC architecture. Yes it is. But in many other ways its not at all like a Mainframe.
It's similar to hyperscale infrastructure — it doesn't matter as long as it looks like a PC architecture from the OS running inside a VM. The layers and layers of legacy abstraction firmware, BMC, drivers BIOS, hypervisors you get with a typical on premise Dell/HP/SuperMicro/... server motherboard are responsible for a cold start lasting 20 minutes, random failures, weird packet loss, SMART telemetry malfunctions, etc.
This is the type of "PC architecture" cruft many customers have been yearning to ditch for years.
I’m not in the bare metal/data center business anymore at the moment, but I was for more or less the last 25 years. I never had such issues. Maybe I was just lucky?
Maybe you were. :-) And maybe this is not for you or me (I haven't contacted their sales, yet) it's not for everyone.
Personally, I have always been annoyed that the BIOS is clunky and every change requires a reboot, taking several minutes. As computers got faster over the years, this has gotten worse, not better. At the core of cloud economics is elasticity: don't pay for a service that you don't use. Wouldn't it be great to power down an idle server, knowing that it can be switched on seconds before you actually need it?
> Wouldn't it be great to power down an idle server, knowing that it can be switched on seconds before you actually need it?
considering you would still need to boot the VMs then, once the Oxide system is up, I’m not sure if this is such a big win.
And at a certain scale you’d probably have something like multiple systems and VMware vMotion or alternatives anyway. So if the ESXi host (for example) takes a while to boot, I wouldn’t care too much.
And, economics of elasticity - you’d still have to buy the Oxide server, even if it’s idle.
> considering you would still need to boot the VMs then, once the Oxide system is up, I’m not sure if this is such a big win.
To be honoust, I'm using containers most of the time these days but even the full blown windows VMs I'm orchestrating boot in less than 20s, assuming the hypervisor is operational. I think that's about on par with public cloud, no?
> [...] vMotion [...] ESXi.
Is VMware still a thing? Started with virsh, kvm/qemu a decade ago and never looked back.
> And, economics of elasticity - you’d still have to buy the Oxide server, even if it’s idle.
That's a big part of the equation indeed. This is where hyperscalers have an advantage that Oxide at some point in the future might enjoy as well. Interesting to see how much of that they will be willing to share with their customers...
Re VMware, it’s certainly still a thing in enterprise environments. Can kvm do things like live migration in the meantime? For me it’s the other way round, haven’t looked into that for a while ;)
How do you mean Oxide might have that advantage as well in the future? As I understand, you have to buy hardware from them?
Ah yes live migration, off course. We design "ephemeral" applications that scale horizontally and use load balancer to migrate. With 99% of traffic serviced from CDN cache updates and migrations have a very different set of challenges.
As to your question, I meant to say that as volumes and scales economies increase they can source materials far cheaper than regular shops. Possibly similar to AWS, gcs, Azure, akamai etc. It would be nice if they were able and willing to translate some of those scales economies into prices commensurate with comparable public cloud instances.
If you want more insight into all of the things that normally run on "PC architecture" - the 2.5 other kernels/operating systems running underneath the one you think you're running - https://www.youtube.com/watch?v=mUTx61t443A
Every PC has millions of lines of firmware code that often fails and causes problems. Case and point, pretty much all hyperscalers rip out all the traditional vender firmware and replace it with their own, often partially open source.
The BMC is often a huge problem, its a shifty architecture and extremely unsafe. Meta is paying for u-bmc development to have something a bit better.
Doing things like rack attestation of a whole racks worth of firmware if stacked PC is incredibly hard and so many companies simply don't do it. And doing it with the switch as well is even harder.
Sometimes the firmware runs during operations and takes over your computer causing strange bugs, see SMM. If there is a bug anywhere in that stack, there are 10 layers of closed source vendors that don't care about you.
Costumers don't care if its PC or not, but they do care if the machine is stable, the firmware is bug free and the computer is safe and not infected by viruses. Not being a PC enables that.
I imagine they are aware that this isn't a solution for many customers. A John Deere tractor makes a poor minivan. This isn't for you. That's fine. It's not for me either. That's ok. I don't need to poo-poo their efforts and sit and moan about how it's not for me.
I can buy these at my local electroncis retailer. So pretty much commodity x86.